Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefore, that semantic redundancy, which is the presence of similar or repetitive information, would tend to lower performance and limit generalizability to unseen data. In medical imaging data, semantic redundancy can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Further, the common use of augmentation methods to generate variety in DL training may be limiting performance when applied to semantically redundant data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.

READ FULL TEXT

page 4

page 9

page 10

page 11

page 13

page 14

page 15

page 16

research
04/25/2023

Quantifying the Effect of Image Similarity on Diabetic Foot Ulcer Classification

This research conducts an investigation on the effect of visually simila...
research
07/11/2020

Generalization of Deep Convolutional Neural Networks – A Case-study on Open-source Chest Radiographs

Deep Convolutional Neural Networks (DCNNs) have attracted extensive atte...
research
02/07/2022

Semantic-aware Speech to Text Transmission with Redundancy Removal

Deep learning (DL) based semantic communication methods have been explor...
research
04/04/2021

Towards Semantic Interpretation of Thoracic Disease and COVID-19 Diagnosis Models

Convolutional neural networks are showing promise in the automatic diagn...
research
06/06/2020

Deep Mining External Imperfect Data for Chest X-ray Disease Screening

Deep learning approaches have demonstrated remarkable progress in automa...
research
08/08/2023

When More is Less: Incorporating Additional Datasets Can Hurt Performance By Introducing Spurious Correlations

In machine learning, incorporating more data is often seen as a reliable...

Please sign up or login with your details

Forgot password? Click here to reset