The Pitfalls of Sample Selection: A Case Study on Lung Nodule Classification

08/11/2021
by   Vasileios Baltatzis, et al.
8

Using publicly available data to determine the performance of methodological contributions is important as it facilitates reproducibility and allows scrutiny of the published results. In lung nodule classification, for example, many works report results on the publicly available LIDC dataset. In theory, this should allow a direct comparison of the performance of proposed methods and assess the impact of individual contributions. When analyzing seven recent works, however, we find that each employs a different data selection process, leading to largely varying total number of samples and ratios between benign and malignant cases. As each subset will have different characteristics with varying difficulty for classification, a direct comparison between the proposed methods is thus not always possible, nor fair. We study the particular effect of truthing when aggregating labels from multiple experts. We show that specific choices can have severe impact on the data distribution where it may be possible to achieve superior performance on one sample distribution but not on another. While we show that we can further improve on the state-of-the-art on one sample selection, we also find that on a more challenging sample selection, on the same database, the more advanced models underperform with respect to very simple baseline methods, highlighting that the selected data distribution may play an even more important role than the model architecture. This raises concerns about the validity of claimed methodological contributions. We believe the community should be aware of these pitfalls and make recommendations on how these can be avoided in future work.

READ FULL TEXT
research
11/01/2018

Analyzing different prototype selection techniques for dynamic classifier and ensemble selection

In dynamic selection (DS) techniques, only the most competent classifier...
research
04/13/2021

Interpretability-Driven Sample Selection Using Self Supervised Learning For Disease Classification And Segmentation

In supervised learning for medical image analysis, sample selection meth...
research
08/14/2020

Feature Selection Methods for Cost-Constrained Classification in Random Forests

Cost-sensitive feature selection describes a feature selection problem, ...
research
08/01/2022

Lung nodules segmentation from CT with DeepHealth toolkit

The accurate and consistent border segmentation plays an important role ...
research
07/25/2023

Robust Assignment of Labels for Active Learning with Sparse and Noisy Annotations

Supervised classification algorithms are used to solve a growing number ...
research
06/09/2022

OOD Augmentation May Be at Odds with Open-Set Recognition

Despite advances in image classification methods, detecting the samples ...
research
02/21/2023

Importance of methodological choices in data manipulation for validating epileptic seizure detection models

Epilepsy is a chronic neurological disorder that affects a significant p...

Please sign up or login with your details

Forgot password? Click here to reset