Haplotype frequency inference from pooled genetic data with a latent multinomial model

08/31/2023
by   Yong See Foo, et al.
0

In genetic studies, haplotype data provide more refined information than data about separate genetic markers. However, large-scale studies that genotype hundreds to thousands of individuals may only provide results of pooled data, where only the total allele counts of each marker in each pool are reported. Methods for inferring haplotype frequencies from pooled genetic data that scale well with pool size rely on a normal approximation, which we observe to produce unreliable inference when applied to real data. We illustrate cases where the approximation breaks down, due to the normal covariance matrix being near-singular. As an alternative to approximate methods, in this paper we propose exact methods to infer haplotype frequencies from pooled genetic data based on a latent multinomial model, where the observed allele counts are considered integer combinations of latent, unobserved haplotype counts. One of our methods, latent count sampling via Markov bases, achieves approximately linear runtime with respect to pool size. Our exact methods produce more accurate inference over existing approximate methods for synthetic data and for data based on haplotype information from the 1000 Genomes Project. We also demonstrate how our methods can be applied to time-series of pooled genetic data, as a proof of concept of how our methods are relevant to more complex hierarchical settings, such as spatiotemporal models.

READ FULL TEXT

page 28

page 33

page 34

research
06/08/2021

A Unified Approach to Robust Inference for Genetic Covariance

Genome-wide association studies (GWAS) have identified thousands of gene...
research
05/09/2020

Time Varying Markov Process with Partially Observed Aggregate Data; An Application to Coronavirus

A major difficulty in the analysis of propagation of the coronavirus is ...
research
03/27/2019

Bayesian Multinomial Logistic Normal Models through Marginally Latent Matrix-T Processes

Bayesian multinomial logistic-normal (MLN) models are popular for the an...
research
11/02/2016

A nonparametric HMM for genetic imputation and coalescent inference

Genetic sequence data are well described by hidden Markov models (HMMs) ...
research
10/29/2017

A Fast, Accurate Two-Step Linear Mixed Model for Genetic Analysis Applied to Repeat MRI Measurements

Large-scale biobanks are being collected around the world in efforts to ...
research
01/11/2018

Modeling High-Dimensional Data with Case-Control Sampling and Dependency Structures

Modern data sets in various domains often include units that were sample...
research
10/27/2021

Poisson PCA for matrix count data

We develop a dimension reduction framework for data consisting of matric...

Please sign up or login with your details

Forgot password? Click here to reset