Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining

09/13/2023
by   Xianghao Zhan, et al.
0

Accurately labeling biomedical data presents a challenge. Traditional semi-supervised learning methods often under-utilize available unlabeled data. To address this, we propose a novel reliability-based training data cleaning method employing inductive conformal prediction (ICP). This method capitalizes on a small set of accurately labeled training data and leverages ICP-calculated reliability metrics to rectify mislabeled data and outliers within vast quantities of noisy training data. The efficacy of the method is validated across three classification tasks within distinct modalities: filtering drug-induced-liver-injury (DILI) literature with title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced through label permutation. Results show significant enhancements in classification performance: accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4 enhancements in all 48 COVID-19 experiments (up to 23.8 accuracy and macro-average F1 score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6 substantially boost classification performance in multi-modal biomedical machine learning tasks. Importantly, it accomplishes this without necessitating an excessive volume of meticulously curated training data.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset