Survival Prediction from Imbalance colorectal cancer dataset using hybrid sampling methods and tree-based classifiers

09/04/2023
by   Sadegh Soleimani, et al.
0

Background and Objective: Colorectal cancer is a high mortality cancer. Clinical data analysis plays a crucial role in predicting the survival of colorectal cancer patients, enabling clinicians to make informed treatment decisions. However, utilizing clinical data can be challenging, especially when dealing with imbalanced outcomes. This paper focuses on developing algorithms to predict 1-, 3-, and 5-year survival of colorectal cancer patients using clinical datasets, with particular emphasis on the highly imbalanced 1-year survival prediction task. To address this issue, we propose a method that creates a pipeline of some of standard balancing techniques to increase the true positive rate. Evaluation is conducted on a colorectal cancer dataset from the SEER database. Methods: The pre-processing step consists of removing records with missing values and merging categories. The minority class of 1-year and 3-year survival tasks consists of 10 respectively. Edited Nearest Neighbor, Repeated edited nearest neighbor (RENN), Synthetic Minority Over-sampling Techniques (SMOTE), and pipelines of SMOTE and RENN approaches were used and compared for balancing the data with tree-based classifiers. Decision Trees, Random Forest, Extra Tree, eXtreme Gradient Boosting, and Light Gradient Boosting (LGBM) are used in this article. Method. Results: The performance evaluation utilizes a 5-fold cross-validation approach. In the case of highly imbalanced datasets (1-year), our proposed method with LGBM outperforms other sampling methods with the sensitivity of 72.30 and LGBM achieves a sensitivity of 80.81 works best for highly imbalanced datasets. Conclusions: Our proposed method significantly improves mortality prediction for the minority class of colorectal cancer patients.

READ FULL TEXT
research
04/13/2023

Supervised Machine Learning for Breast Cancer Risk Factors Analysis and Survival Prediction

The choice of the most effective treatment may eventually be influenced ...
research
03/24/2018

Balanced Random Survival Forests for Extremely Unbalanced, Right Censored Data

Accuracies of survival models for life expectancy prediction as well as ...
research
03/16/2023

On a fundamental problem in the analysis of cancer registry data

In epidemiology research with cancer registry data, it is often of prima...
research
03/07/2020

Large-scale benchmark study of survival prediction methods using multi-omics data

Multi-omics data, that is, datasets containing different types of high-d...
research
11/07/2022

Multimodal Learning for Non-small Cell Lung Cancer Prognosis

This paper focuses on the task of survival time analysis for lung cancer...
research
10/29/2020

Limitations of ROC on Imbalanced Data: Evaluation of LVAD Mortality Risk Scores

Objective: This study illustrates the ambiguity of ROC in evaluating two...
research
07/05/2013

Supervised Learning and Anti-learning of Colorectal Cancer Classes and Survival Rates from Cellular Biology Parameters

In this paper, we describe a dataset relating to cellular and physical c...

Please sign up or login with your details

Forgot password? Click here to reset