Statistical Integration of Heterogeneous Data with PO2PLS

03/24/2021
by   Said el Bouhaddani, et al.
0

The availability of multi-omics data has revolutionized the life sciences by creating avenues for integrated system-level approaches. Data integration links the information across datasets to better understand the underlying biological processes. However, high-dimensionality, correlations and heterogeneity pose statistical and computational challenges. We propose a general framework, probabilistic two-way partial least squares (PO2PLS), which addresses these challenges. PO2PLS models the relationship between two datasets using joint and data-specific latent variables. For maximum likelihood estimation of the parameters, we implement a fast EM algorithm and show that the estimator is asymptotically normally distributed. A global test for testing the relationship between two datasets is proposed, and its asymptotic distribution is derived. Notably, several existing omics integration methods are special cases of PO2PLS. Via extensive simulations, we show that PO2PLS performs better than alternatives in feature selection and prediction performance. In addition, the asymptotic distribution appears to hold when the sample size is sufficiently large. We illustrate PO2PLS with two examples from commonly used study designs: a large population cohort and a small case-control study. Besides recovering known relationships, PO2PLS also identified novel findings. The methods are implemented in our R-package PO2PLS. Supplementary materials for this article are available online.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2019

Efficiency of maximum likelihood estimation for a multinomial distribution with known probability sums

For a multinomial distribution, suppose that we have prior knowledge of ...
research
10/29/2018

Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models

Mixture of Experts (MoE) are successful models for modeling heterogeneou...
research
05/21/2023

Estimation of finite population proportions for small areas: a statistical data integration approach

Empirical best prediction (EBP) is a well-known method for producing rel...
research
02/19/2019

An entropic feature selection method in perspective of Turing formula

Health data are generally complex in type and small in sample size. Such...
research
03/26/2019

Linkage Free Dual System Estimation

In this paper it is shown that under certain conditions there is a relat...
research
08/31/2022

Joint Modeling of An Outcome Variable and Integrated Omic Datasets Using GLM-PO2PLS

In many studies of human diseases, multiple omic datasets are measured. ...

Please sign up or login with your details

Forgot password? Click here to reset