Optirank: classification for RNA-Seq data with optimal ranking reference genes

01/11/2023
by   Paola Malsot, et al.
0

Classification algorithms using RNA-Sequencing (RNA-Seq) data as input are used in a variety of biological applications. By nature, RNA-Seq data is subject to uncontrolled fluctuations both within and especially across datasets, which presents a major difficulty for a trained classifier to generalize to an external dataset. Replacing raw gene counts with the rank of gene counts inside an observation has proven effective to mitigate this problem. However, the rank of a feature is by definition relative to all other features, including highly variable features that introduce noise in the ranking. To address this problem and obtain more robust ranks, we propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking. We show the effectiveness of this method on simulated data. We also consider real classification tasks, which present different kinds of distribution shifts between train and test data. Those tasks concern a variety of applications, such as cancer of unknown primary classification, identification of specific gene signatures, and determination of cell type in single-cell RNA-Seq datasets. On those real tasks, optirank performs at least as well as the vanilla logistic regression on classical ranks, while producing sparser solutions. In addition, to increase the robustness against dataset shifts, we propose a multi-source learning scheme and demonstrate its effectiveness when used in combination with rank-based classifiers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2020

A mathematical framework for raw counts of single-cell RNA-seq data analysis

Single-cell RNA-seq data are challenging because of the sparseness of th...
research
08/20/2021

Robust adaptive Lasso in high-dimensional logistic regression with an application to genomic classification of cancer patients

Penalized logistic regression is extremely useful for binary classificat...
research
12/18/2020

Classification with Strategically Withheld Data

Machine learning techniques can be useful in applications such as credit...
research
10/11/2022

A Latent Logistic Regression Model with Graph Data

Recently, graph (network) data is an emerging research area in artificia...
research
06/05/2019

A Deep Learning Framework for Classification of in vitro Multi-Electrode Array Recordings

Multi-Electrode Arrays (MEAs) have been widely used to record neuronal a...
research
09/07/2018

Logistic Regression Augmented Community Detection for Network Data with Application in Identifying Autism-Related Gene Pathways

When searching for gene pathways leading to specific disease outcomes, a...

Please sign up or login with your details

Forgot password? Click here to reset