The Prevalence of Errors in Machine Learning Experiments

09/10/2019
by   Martin Shepperd, et al.
0

Context: Conducting experiments is central to research machine learning research to benchmark, evaluate and compare learning algorithms. Consequently it is important we conduct reliable, trustworthy experiments. Objective: We investigate the incidence of errors in a sample of machine learning experiments in the domain of software defect prediction. Our focus is simple arithmetical and statistical errors. Method: We analyse 49 papers describing 2456 individual experimental results from a previously undertaken systematic review comparing supervised and unsupervised defect prediction classifiers. We extract the confusion matrices and test for relevant constraints, e.g., the marginal probabilities must sum to one. We also check for multiple statistical significance testing errors. Results: We find that a total of 22 out of 49 papers contain demonstrable errors. Of these 7 were statistical and 16 related to confusion matrix inconsistency (one paper contained both classes of error). Conclusions: Whilst some errors may be of a relatively trivial nature, e.g., transcription errors their presence does not engender confidence. We strongly urge researchers to follow open science principles so errors can be more easily be detected and corrected, thus as a community reduce this worryingly high error rate with our computational experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/28/2019

A Systematic Review of Unsupervised Learning Techniques for Software Defect Prediction

Background: Unsupervised machine learners have been increasingly applied...
research
10/11/2007

Comparison and Combination of State-of-the-art Techniques for Handwritten Character Recognition: Topping the MNIST Benchmark

Although the recognition of isolated handwritten digits has been a resea...
research
02/25/2021

Generalized Adversarial Distances to Efficiently Discover Classifier Errors

Given a black-box classification model and an unlabeled evaluation datas...
research
08/01/2021

Experimental Findings on the Sources of Detected Unrecoverable Errors in GPUs

We investigate the sources of Detected Unrecoverable Errors (DUEs) in GP...
research
06/01/2023

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

Given the rapid ascent of large language models (LLMs), we study the que...
research
07/01/2022

Quality increases as the error rate decreases

In this paper we propose an approach to the design of processes and soft...
research
09/08/2023

Probabilistic Safety Regions Via Finite Families of Scalable Classifiers

Supervised classification recognizes patterns in the data to separate cl...

Please sign up or login with your details

Forgot password? Click here to reset