ERStruct: An Eigenvalue Ratio Approach to Inferring Population Structure from Sequencing Data

04/05/2021
by   Yuyang Xu, et al.
0

Inference of population structure from genetic data plays an important role in population and medical genetics studies. The traditional EIGENSTRAT method has been widely used for computing and selecting top principal components that capture population structure information (Price et al., 2006). With the advancement and decreasing cost of sequencing technology, whole-genome sequencing data provide much richer information about the underlying population structures. However, the EIGENSTRAT method was originally developed for analyzing array-based genotype data and thus may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio n/p is nearly zero, violating the assumption of the Tracy-Widom test used in the EIGENSTRAT method. Second, the EIGENSTRAT method might not be able to handle the linkage disequilibrium (LD) well in sequencing data. To resolve those two critical issues, we propose a new statistical method called ERStruct to estimate the number of sub-populations based on sequencing data. We propose to use the ratio of successive eigenvalues as a more robust testing statistic, and then we approximate the null distribution of our proposed test statistic using modern random matrix theory. Simulation studies found that our proposed ERStruct method has improved performance compared to the traditional Tracy-Widom test on sequencing data. We further illustrate our ERStruct method using the sequencing data set from the 1000 Genomes Project. We also implemented our ERStruct in a MATLAB toolbox which is now publicly available on github: https://github.com/bglvly/ERStruct.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2019

A Simple Yet Efficient Parametric Method of Local False Discovery Rate Estimation Designed for Genome-Wide Association Data Analysis

In genome-wide association studies (GWAS), hundreds of thousands of gene...
research
03/23/2022

Estimating trans-ancestry genetic correlation with unbalanced data resources

The aim of this paper is to propose a novel estimation method of using g...
research
07/21/2020

Empirical Likelihood Ratio Test on quantiles under a Density Ratio Model

Population quantiles are important parameters in many applications. Enth...
research
03/05/2020

A Nearest-Neighbor Based Nonparametric Test for Viral Remodeling in Heterogeneous Single-Cell Proteomic Data

An important problem in contemporary immunology studies based on single-...
research
11/06/2019

Minimax Nonparametric Parallelism Test

Testing the hypothesis of parallelism is a fundamental statistical probl...
research
07/21/2023

The Population Resemblance Statistic: A Chi-Square Measure of Fit for Banking

The Population Stability Index (PSI) is a widely used measure in credit ...
research
06/28/2021

What to do if N is two?

The field of in-vivo neurophysiology currently uses statistical standard...

Please sign up or login with your details

Forgot password? Click here to reset