Modeling High-Dimensional Data with Case-Control Sampling and Dependency Structures

01/11/2018
by   Omer Weissbrod, et al.
0

Modern data sets in various domains often include units that were sampled non-randomly from the population and have a complex latent correlation structure. Here we investigate a common form of this setting, where every unit is associated with a latent variable, all latent variables are correlated, and the probability of sampling a unit depends on its response. Such settings often arise in case-control studies, where the sampled units are correlated due to spatial proximity, family relations, or other sources of relatedness. Maximum likelihood estimation in such settings is challenging from both a computational and statistical perspective, necessitating approximation techniques that take the sampling scheme into account. We propose a family of approximate likelihood approaches by combining state of the art methods from statistics and machine learning, including composite likelihood, expectation propagation and generalized estimating equations. We demonstrate the efficacy of our proposed approaches via extensive simulations. We utilize them to investigate the genetic architecture of several complex disorders collected in case-control genetic association studies, where hundreds of thousands of genetic variants are measured for every individual, and the underlying disease liabilities of individuals are correlated due to genetic similarity. Our work is the first to provide a tractable likelihood-based solution for case-control data with complex dependency structures.

READ FULL TEXT
research
07/13/2018

Sequential sampling of Gaussian latent variable models

We consider the problem of inferring a latent function in a probabilisti...
research
12/09/2020

Searching for genetic interactions in complex disease by using distance correlation

Understanding epistasis (genetic interaction) may shed some light on the...
research
10/22/2019

Reduced-dimensional Monte Carlo Maximum Likelihood for Latent Gaussian Random Field Models

Monte Carlo maximum likelihood (MCML) provides an elegant approach to fi...
research
07/13/2018

Sequential sampling of Gaussian process latent variable models

We consider the problem of inferring a latent function in a probabilisti...
research
09/28/2021

Statistical methods for modeling spatially-referenced paired genetic relatedness data

Understanding factors that contribute to the increased likelihood of dis...
research
08/31/2023

Haplotype frequency inference from pooled genetic data with a latent multinomial model

In genetic studies, haplotype data provide more refined information than...
research
06/10/2018

Generalized Goodness-Of-Fit Tests for Correlated Data

This paper concerns the problem of applying the generalized goodness-of-...

Please sign up or login with your details

Forgot password? Click here to reset