variable selection and missing data imputation in categorical genomic data analysis by integrated ridge regression and random forest

11/10/2021
by   Siru Wang, et al.
0

Genomic data arising from a genome-wide association study (GWAS) are often not only of large-scale, but also incomplete. A specific form of their incompleteness is missing values with non-ignorable missingness mechanism. The intrinsic complications of genomic data present significant challenges in developing an unbiased and informative procedure of phenotype-genotype association analysis by a statistical variable selection approach. In this paper we develop a coherent procedure of categorical phenotype-genotype association analysis, in the presence of missing values with non-ignorable missingness mechanism in GWAS data, by integrating the state-of-the-art methods of random forest for variable selection, weighted ridge regression with EM algorithm for missing data imputation, and linear statistical hypothesis testing for determining the missingness mechanism. Two simulated GWAS are used to validate the performance of the proposed procedure. The procedure is then applied to analyze a real data set from breast cancer GWAS.

READ FULL TEXT

page 13

page 26

research
05/04/2011

MissForest - nonparametric missing value imputation for mixed-type data

Modern data acquisition based on high-throughput technology is often fac...
research
02/25/2022

Flexible variable selection in the presence of missing data

In many applications, it is of interest to identify a parsimonious set o...
research
08/10/2017

Contextuality from missing and versioned data

Traditionally categorical data analysis (e.g. generalized linear models)...
research
10/20/2022

Adaptive greedy forward variable selection for linear regression models with incomplete data using multiple imputation

Variable selection is crucial for sparse modeling in this age of big dat...
research
07/20/2021

Strategies for variable selection in large-scale healthcare database studies with missing covariate and outcome data

Prior work has shown that combining bootstrap imputation with tree-based...
research
03/29/2023

Correcting for Selection Bias and Missing Response in Regression using Privileged Information

When estimating a regression model, we might have data where some labels...
research
03/25/2020

Missing at Random or Not: A Semiparametric Testing Approach

Practical problems with missing data are common, and statistical methods...

Please sign up or login with your details

Forgot password? Click here to reset