Variable selection in high-dimensional logistic regression models using a whitening approach

06/29/2022
by   Wencan Zhu, et al.
0

In bioinformatics, the rapid development of sequencing technology has enabled us to collect an increasing amount of omics data. Classification based on omics data is one of the central problems in biomedical research. However, omics data usually has a limited sample size but high feature dimensions, and it is assumed that only a few features (biomarkers) are active, i.e. informative to discriminate between different categories (cancer subtypes, responder/non-responder to treatment, for example). Identifying active biomarkers for classification has therefore become fundamental for omics data analysis. Focusing on binary classification, we propose an innovative feature selection method aiming at dealing with the high correlations between the biomarkers. Various research has shown the notorious influence of correlated biomarkers and the difficulty of accurately identifying active ones. Our method, WLogit, consists in whitening the design matrix to remove the correlations between biomarkers, then using a penalized criterion adapted to the logistic regression model to select features. The performance of WLogit is assessed using synthetic data in several scenarios and compared with other approaches. The results suggest that WLogit can identify almost all active biomarkers even in the cases where the biomarkers are highly correlated, while the other methods fail, which consequently leads to higher classification accuracy. The performance is also evaluated on the classification of two Lymphoma subtypes, and the obtained classifier also outperformed other methods. Our method is implemented in the R package available from the Comprehensive R Archive Network (CRAN).

READ FULL TEXT

page 14

page 16

page 22

page 23

research
07/21/2020

A variable selection approach for highly correlated predictors in high-dimensional genomic data

In genomic studies, identifying biomarkers associated with a variable of...
research
02/01/2018

Greedy Active Learning Algorithm for Logistic Regression Models

We study a logistic model-based active learning procedure for binary cla...
research
11/25/2014

PLUTO: Penalized Unbiased Logistic Regression Trees

We propose a new algorithm called PLUTO for building logistic regression...
research
09/23/2020

Bayesian Hierarchical Models for High-Dimensional Mediation Analysis with Coordinated Selection of Correlated Mediators

We consider Bayesian high-dimensional mediation analysis to identify amo...
research
01/06/2019

Sparse estimation for case-control studies with multiple subtypes of cases

The analysis of case-control studies with several subtypes of cases is i...
research
04/17/2019

Correlated Logistic Model With Elastic Net Regularization for Multilabel Image Classification

In this paper, we present correlated logistic (CorrLog) model for multil...
research
02/04/2022

Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso

In clinical trials, identification of prognostic and predictive biomarke...

Please sign up or login with your details

Forgot password? Click here to reset