ERStruct: An Eigenvalue Ratio Approach to Inferring Population Structure from Sequencing Data

04/05/2021 ∙ by Yuyang Xu, et al. ∙ The University of Hong Kong 0

Inference of population structure from genetic data plays an important role in population and medical genetics studies. The traditional EIGENSTRAT method has been widely used for computing and selecting top principal components that capture population structure information (Price et al., 2006). With the advancement and decreasing cost of sequencing technology, whole-genome sequencing data provide much richer information about the underlying population structures. However, the EIGENSTRAT method was originally developed for analyzing array-based genotype data and thus may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio n/p is nearly zero, violating the assumption of the Tracy-Widom test used in the EIGENSTRAT method. Second, the EIGENSTRAT method might not be able to handle the linkage disequilibrium (LD) well in sequencing data. To resolve those two critical issues, we propose a new statistical method called ERStruct to estimate the number of sub-populations based on sequencing data. We propose to use the ratio of successive eigenvalues as a more robust testing statistic, and then we approximate the null distribution of our proposed test statistic using modern random matrix theory. Simulation studies found that our proposed ERStruct method has improved performance compared to the traditional Tracy-Widom test on sequencing data. We further illustrate our ERStruct method using the sequencing data set from the 1000 Genomes Project. We also implemented our ERStruct in a MATLAB toolbox which is now publicly available on github: https://github.com/bglvly/ERStruct.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inference of population structure is a fundamentally important problem in population genetics and also plays a critical role in genetic association studies (Novembre et al., 2008; Skoglund et al., 2017; Cao et al., 2020)

. Principal component analysis (PCA) based method has been popularized to capture the population structures from array-based genotype data

(Menozzi, Piazza and Cavalli-Sforza, 1978; Patterson, Price and Reich, 2006; Reich, Price and Patterson, 2008). PCA works by selecting top principal components (PCs) that can sufficiently capture population structures, and then those top PCs can be further used to correct for population stratification bias in genetic association studies, for example, using the popular EIGENSTRAT method (Price et al., 2006).

A natural and necessary question when applying PCA on genetic data is to determine the number of PCs that can sufficiently capture the underlying unknown population structure. Patterson, Price and Reich (2006) developed the so-called sequential Tracy–Widom test using the “effective number of markers” as a plug-in estimate of the original number of markers in the data set. The effective number of markers is used to estimate the number of the underlying uncorrelated markers so that the number of markers can be effectively reduced. With the use of effective number of markers, Patterson, Price and Reich (2006) tried to alleviate the possible violation of the assumption in the Tracy–Widom test that the sample size and the number of genetic markers should be comparably large. We will refer to this method as the PCA-TW test throughout this paper. However, after being applied in various empirical studies for several years, it has been found that the PCA-TW test might not perform well for capturing the true population structure in sequencing data (Song, Hao and Storey, 2015; Prokopenko et al., 2015; Zhang, Dey and Lee, 2020). This is because the traditional PCA-TW test was originally developed for array-based genotype data sets that typically contain a moderate number of genetic markers, while the number of genetic markers is much larger in sequencing data. As large-scale sequencing data sets become increasingly available, we need to develop a new method that can resolve the following two issues:

2 Method

Suppose that one is interested in estimating the number of sub-populations based on a raw -by- genotype matrix which contains genetic markers from individuals. Each entry represents the raw count of the minor alleles for the genetic marker on the individual . Assume that there are unknown (or latent) sub-populations and the th sub-population is of size , where and . To refer to sub-population specific individuals, the index of individual is written as follows:

where indexed the latent sub-populations and indexed the individuals in the th sub-population. Using this notation, the raw count matrix can be rewritten as

where is a

-dimensional vector containing information of the genetic variants for the

th individual. We consider the following asymptotic regime throughout this paper.

Asymptotic Regime.

This asymptotic regime is reasonable in sequencing data, where the number of markers is far greater than the sample size .

2.1 The ANOVA Model

Our model builds on the key fact that individuals from different sub-populations have different minor allele frequencies (MAF) and individuals from the same sub-population have the same MAF. This observation motivates us to model the raw minor allele count data matrix using the following analysis of variance (ANOVA) model,

(1)

where the -dimensional vectors are the sub-population-specific mean counts of minor alleles across latent sub-populations, and the vectors are independent and identically distributed noise vectors with mean zeros and covariance . We emphasize here that even though we use this ANOVA model for the raw minor allele count matrix, however our model differs from the classical ANOVA model in three ways. First, we do not know the total number of sub-populations a priori and which individual belongs to which sub-population. Second, the sample to marker ratio is nearly zero. Third, our model allows for the presence of different LD patterns. Those three salient features of our ANOVA model thus require modern random matrix theory to understand the sources of the variation in the sequencing data.

Following Patterson, Price and Reich (2006), we normalize the raw count matrix such that each column (genetic marker) has mean zero and unit variance. The estimates of the th sub-population-specific mean vector and the estimate of the overall mean vector are given by

(2)

We also need the following diagonal matrix with diagonal elements equal to the standard deviations of the

genetic markers
Then the normalized genotype data matrix is given by
(3)

Define the sample covariance matrix of the normalized data as . Then, we have the following standard ANOVA decomposition

where the between-group variation is given by
and the within-group variation is given by
(4)

In the next section, we will perform spectral analysis for using modern random matrix theory.

2.2 The Spikes and the Bulk

It can be seen from Equation 4 that the within-group covariance matrix is essentially the noise covariance matrix. The between-group covariance matrix is of rank because it is the sum of rank one matrices and the between-group vectors satisfy the following linear constrain:

Hence the matrix can be viewed as a rank perturbation of the sample noise covariance matrix . According to the finite-rank perturbation theory (Benaych-Georges and Nadakuditi, 2011; Benaych-Georges, Guionnet and Maida, 2011), under the assumption that , the non-zero ordered sample eigenvalues of matrix can be separated into two sets by relating them to the eigenvalues of either or (graphically illustrated in Figure 1):

3 Simulation Studies

In this section, we compare the performances of our new method ERStruct versus the traditional PCA-TW test based on their final estimated number of sub-populations. We consider two different settings depending on whether or not the LD is present.

4 Real Data Analysis

In this section, we apply our proposed ERStruct to the 1000 Genomes Project sequencing data to estimate the number of sub-populations for illustration. We also include the PCA-TW method for comparison purpose. The 1000 Genomes Project was established in January 2008 with the aim to build by then the most detailed catalogue of genetic variations in the human population. In 2015, the final phase of the project was completed (1000 Genomes Project Consortium, 2015; Sudmant et al., 2015). Here, we analyze the data from the 1000 Genomes Project, phase 3 v5a, which is publicly accessible at https://www.internationalgenome.org/data. In this data set, there are individuals from different sub-populations (as shown in Table 2).

Sub-population Sample size
ACB (African Caribbean in Barbados) 96
ASW (African Ancestry in Southwest US) 61
BEB (Bengali in Bangladesh) 86
CDX (Chinese Dai in Xishuangbanna, China) 93
CEU (Utah residents with Northern and Western European ancestry) 99
CHB (Han Chinese in Beijing, China) 103
CHS (Southern Han Chinese, China) 105
CLM (Colombian in Medellin, Colombia) 94
ESN (Esan in Nigeria) 99
FIN (Finnish in Finland) 99
GBR (British in England and Scotland) 91
GIH (Gujarati Indian in Houston, TX) 103
GWD (Gambian in Western Division, The Gambia) 113
IBS (Iberian populations in Spain) 107
ITU (Indian Telugu in the UK) 102
JPT (Japanese in Tokyo, Japan) 104
KHV (Kinh in Ho Chi Minh City, Vietnam) 99
LWK (Luhya in Webuye, Kenya) 99
MSL (Mende in Sierra Leone) 85
MXL (Mexican Ancestry in Los Angeles, California) 64
PEL (Peruvian in Lima, Peru) 85
PJL (Punjabi in Lahore, Pakistan) 96
PUR (Puerto Rican in Puerto Rico) 104
STU (Sri Lankan Tamil in the UK) 102
TSI (Toscani in Italy) 107
YRI (Yoruba in Ibadan, Nigeria) 108
Table 2: Geographical information of the 1000 Genomes Project data.

The raw data is pre-processed as follows. We first filter extremely rare genetic variants and retain genetic markers with MAF greater than using the PLINK software (Purcell et al., 2007). We applied our ERStruct and the traditional PCA-TW methods to this data set and perform the following three analyses.

5 Discussions

The traditional PCA-TW test was originally developed for array-based genotype data set where the number of genetic markers is much smaller than that from modern sequencing data (Patterson, Price and Reich, 2006). Hence, the PCA-TW method has been found to perform not well when the sample to marker ratio is nearly zero as in the sequencing data. Our proposed method ERStruct has been shown to have improved performance over the traditional PCA-TW test in both simulated and real sequencing data.

Our ERStruct method enjoys several advantages. First, our ERStruct is based on the more robust eigenvalue ratios rather than the original eigenvalues for inference as it preserves the scale invariant property. Second, we obtain the more accurate null distribution for the ER test statistic from the most recent random matrix theory. Third, our method is not confined to a specific LD structure among genetic markers. Even though in the real genetic data with complicated LD structures, our ERStruct can still separate the spikes from the bulk. Fourth, our ERStruct method is also computationally efficient. In fact, our ERStruct achieves almost the same computational speed as the PCA-TW test, given that . For example, it took around 40 minutes for analyzing the whole 1000 Genomes data with 126G RAM and 5 cores of CPU using our ERStruct MATLAB toolbox.

Our proposed ERStruct method can also be used for inferring latent structures in other types of ultra-dimensional data. However, both our proposed ERStruct method and the PCA-TW test assume that the raw data is complete, that is, there are no missing values. But in single-cell sequencing data, it is common that most of the entries in the data matrix are missing. Inference of latent structures in such sparse data matrix is still very challenging because a large proportion of missing values will seriously impact the null distribution of the ER test statistic. One possible solution is to impute those missing values as described by

Aparicio et al. (2020). We will leave this as future work by extending our ERStruct method for sparse data matrices.

References

  • Aparicio et al. (2020) [author] Aparicio, LuisL., Bordyuh, MykolaM., Blumberg, Andrew J.A. J. Rabadan, RaulR. (2020). A Random Matrix Theory Approach to Denoise Single-Cell Data. Patterns 1 100035. https://doi.org/10.1016/j.patter.2020.100035
  • Arnold (1971)

    [author] Arnold, LudwigL. (1971). On Wigner’s semicircle law for the eigenvalues of random matrices. Probability Theory and Related Fields 19 191–198.

  • Bai and Yao (2008)

    [author] Bai, ZhidongZ. Yao, JianfengJ. (2008). Central limit theorems for eigenvalues in a spiked population model. Annales de l’IHP Probabilités et Statistiques 44 447–474.

  • Baik, Arous and Péché (2005)

    [author] Baik, JinhoJ., Arous, Gérard BenG. B. Péché, SandrineS. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability 33 1643–1697.

  • Baik and Silverstein (2006)

    [author] Baik, JinhoJ. Silverstein, Jack WJ. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis 97 1382–1408.

  • Benaych-Georges, Guionnet and Maida (2011) [author] Benaych-Georges, FlorentF., Guionnet, AliceA. Maida, MylèneM. (2011). Fluctuations of the extreme eigenvalues of finite rank deformations of random matrices. Electronic Journal of Probability 16 1621–1662.
  • Benaych-Georges and Nadakuditi (2011)

    [author] Benaych-Georges, FlorentF. Nadakuditi, Raj RaoR. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics 227 494–521.

  • Cao et al. (2020) [author] Cao, YananY., Li, LinL., Feng, ZhiminZ., Sun, XiaohuiX., Lu, JieliJ., Xu, YuY., Du, PeinaP., Wang, TiangeT., Hu, RuyingR., Ye, ZhenZ., Shi, LixinL., Tang, XuleiX., Yan, LiL., Gao, ZhengnanZ., Chen, GangG., Zhang, YinfeiY., Chen, LuluL., Ning, GuangG. Wang, WeiqingW. (2020). The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Research 30 717–731.
  • 1000 Genomes Project Consortium (2015) [author] 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526 68–74.
  • Johnstone (2001) [author] Johnstone, Iain MI. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics 29 295–327.
  • Li, Wang and Yao (2017)

    [author] Li, ZengZ., Wang, QinwenQ. Yao, JianfengJ. (2017). Identifying the number of factors from singular values of a large sample auto-covariance matrix. The Annals of Statistics 45 257–288.

  • Menozzi, Piazza and Cavalli-Sforza (1978) [author] Menozzi, PaoloP., Piazza, AlbertoA. Cavalli-Sforza, LL. (1978). Synthetic maps of human gene frequencies in Europeans. Science 201 786–792.
  • Novembre et al. (2008) [author] Novembre, JohnJ., Johnson, TobyT., Bryc, KatarzynaK., Kutalik, ZoltánZ., Boyko, Adam RA. R., Auton, AdamA., Indap, AmitA., King, Karen SK. S., Bergmann, SvenS. Nelson, Matthew RM. R. (2008). Genes mirror geography within Europe. Nature 456 98.
  • Palla and Dudbridge (2015) [author] Palla, LuigiL. Dudbridge, FrankF. (2015). A fast method that uses polygenic scores to estimate the variance explained by genome-wide marker panels and the proportion of variants affecting a trait. The American Journal of Human Genetics 97 250–259.
  • Patterson, Price and Reich (2006) [author] Patterson, NickN., Price, Alkes LA. L. Reich, DavidD. (2006). Population structure and eigenanalysis. PLoS Genetics 2 e190.
  • Paul (2007) [author] Paul, DebashisD. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 17 1617–1642.
  • Price et al. (2006) [author] Price, Alkes LA. L., Patterson, Nick JN. J., Plenge, Robert MR. M., Weinblatt, Michael EM. E., Shadick, Nancy AN. A. Reich, DavidD. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904.
  • Prokopenko et al. (2015)

    [author] Prokopenko, DmitryD., Hecker, JulianJ., Silverman, Edwin K.E. K., Pagano, MarcelloM., Nöthen, Markus M.M. M., Dina, ChristianC., Lange, ChristophC. Fier, Heide LoehleinH. L. (2015). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics 32 1366–1372. 10.1093/bioinformatics/btv752

  • Purcell et al. (2007) [author] Purcell, ShaunS., Neale, BenjaminB., Todd-Brown, KatheK., Thomas, LoriL., Ferreira, Manuel ARM. A., Bender, DavidD., Maller, JulianJ., Sklar, PamelaP., De Bakker, Paul IWP. I. Daly, Mark JM. J. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81 559–575.
  • Reich, Price and Patterson (2008) [author] Reich, DavidD., Price, Alkes LA. L. Patterson, NickN. (2008). Principal component analysis of genetic data. Nature Genetics 40 491–492.
  • Shriner (2012) [author] Shriner, DanielD. (2012). Improved eigenanalysis of discrete subpopulations and admixture using the minimum average partial test. Human Heredity 73 73–83.
  • Skoglund et al. (2017) [author] Skoglund, PontusP., Thompson, Jessica CJ. C., Prendergast, Mary EM. E., Mittnik, AlissaA., Sirak, KendraK., Hajdinjak, MatejaM., Salie, TasneemT., Rohland, NadinN., Mallick, SwapanS. Peltzer, AlexanderA. (2017). Reconstructing prehistoric African population structure. Cell 171 59–71.
  • Song, Hao and Storey (2015) [author] Song, MinsunM., Hao, WeiW. Storey, John DJ. D. (2015). Testing for genetic associations in arbitrarily structured populations. Nature Genetics 47 550.
  • Sudmant et al. (2015) [author] Sudmant, Peter HP. H., Rausch, TobiasT., Gardner, Eugene JE. J., Handsaker, Robert ER. E., Abyzov, AlexejA., Huddleston, JohnJ., Zhang, YanY., Ye, KaiK., Jun, GooG. Fritz, Markus Hsi-YangM. H.-Y. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526 75–81.
  • Wang and Paul (2014) [author] Wang, LiliL. Paul, DebashisD. (2014). Limiting spectral distribution of renormalized separable sample covariance matrices when p/n→0. Journal of Multivariate Analysis 126 25–52. https://doi.org/10.1016/j.jmva.2013.12.015
  • Wigner (1958) [author] Wigner, Eugene PE. P. (1958). On the distribution of the roots of certain symmetric matrices. Annals of Mathematics 67 325–327.
  • Zhang, Dey and Lee (2020) [author] Zhang, DaiweiD., Dey, RounakR. Lee, SeunggeunS. (2020). Fast and robust ancestry prediction using principal component analysis. Bioinformatics 36 3439–3446. 10.1093/bioinformatics/btaa152
  • Zhou, Marron and Wright (2018) [author] Zhou, Yi-HuiY.-H., Marron, JSJ. Wright, Fred AF. A. (2018). Eigenvalue significance testing for genetic association. Biometrics 74 439–447.

References

  • Aparicio et al. (2020) [author] Aparicio, LuisL., Bordyuh, MykolaM., Blumberg, Andrew J.A. J. Rabadan, RaulR. (2020). A Random Matrix Theory Approach to Denoise Single-Cell Data. Patterns 1 100035. https://doi.org/10.1016/j.patter.2020.100035
  • Arnold (1971)

    [author] Arnold, LudwigL. (1971). On Wigner’s semicircle law for the eigenvalues of random matrices. Probability Theory and Related Fields 19 191–198.

  • Bai and Yao (2008)

    [author] Bai, ZhidongZ. Yao, JianfengJ. (2008). Central limit theorems for eigenvalues in a spiked population model. Annales de l’IHP Probabilités et Statistiques 44 447–474.

  • Baik, Arous and Péché (2005)

    [author] Baik, JinhoJ., Arous, Gérard BenG. B. Péché, SandrineS. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability 33 1643–1697.

  • Baik and Silverstein (2006)

    [author] Baik, JinhoJ. Silverstein, Jack WJ. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis 97 1382–1408.

  • Benaych-Georges, Guionnet and Maida (2011) [author] Benaych-Georges, FlorentF., Guionnet, AliceA. Maida, MylèneM. (2011). Fluctuations of the extreme eigenvalues of finite rank deformations of random matrices. Electronic Journal of Probability 16 1621–1662.
  • Benaych-Georges and Nadakuditi (2011)

    [author] Benaych-Georges, FlorentF. Nadakuditi, Raj RaoR. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics 227 494–521.

  • Cao et al. (2020) [author] Cao, YananY., Li, LinL., Feng, ZhiminZ., Sun, XiaohuiX., Lu, JieliJ., Xu, YuY., Du, PeinaP., Wang, TiangeT., Hu, RuyingR., Ye, ZhenZ., Shi, LixinL., Tang, XuleiX., Yan, LiL., Gao, ZhengnanZ., Chen, GangG., Zhang, YinfeiY., Chen, LuluL., Ning, GuangG. Wang, WeiqingW. (2020). The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Research 30 717–731.
  • 1000 Genomes Project Consortium (2015) [author] 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526 68–74.
  • Johnstone (2001) [author] Johnstone, Iain MI. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics 29 295–327.
  • Li, Wang and Yao (2017)

    [author] Li, ZengZ., Wang, QinwenQ. Yao, JianfengJ. (2017). Identifying the number of factors from singular values of a large sample auto-covariance matrix. The Annals of Statistics 45 257–288.

  • Menozzi, Piazza and Cavalli-Sforza (1978) [author] Menozzi, PaoloP., Piazza, AlbertoA. Cavalli-Sforza, LL. (1978). Synthetic maps of human gene frequencies in Europeans. Science 201 786–792.
  • Novembre et al. (2008) [author] Novembre, JohnJ., Johnson, TobyT., Bryc, KatarzynaK., Kutalik, ZoltánZ., Boyko, Adam RA. R., Auton, AdamA., Indap, AmitA., King, Karen SK. S., Bergmann, SvenS. Nelson, Matthew RM. R. (2008). Genes mirror geography within Europe. Nature 456 98.
  • Palla and Dudbridge (2015) [author] Palla, LuigiL. Dudbridge, FrankF. (2015). A fast method that uses polygenic scores to estimate the variance explained by genome-wide marker panels and the proportion of variants affecting a trait. The American Journal of Human Genetics 97 250–259.
  • Patterson, Price and Reich (2006) [author] Patterson, NickN., Price, Alkes LA. L. Reich, DavidD. (2006). Population structure and eigenanalysis. PLoS Genetics 2 e190.
  • Paul (2007) [author] Paul, DebashisD. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 17 1617–1642.
  • Price et al. (2006) [author] Price, Alkes LA. L., Patterson, Nick JN. J., Plenge, Robert MR. M., Weinblatt, Michael EM. E., Shadick, Nancy AN. A. Reich, DavidD. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904.
  • Prokopenko et al. (2015)

    [author] Prokopenko, DmitryD., Hecker, JulianJ., Silverman, Edwin K.E. K., Pagano, MarcelloM., Nöthen, Markus M.M. M., Dina, ChristianC., Lange, ChristophC. Fier, Heide LoehleinH. L. (2015). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics 32 1366–1372. 10.1093/bioinformatics/btv752

  • Purcell et al. (2007) [author] Purcell, ShaunS., Neale, BenjaminB., Todd-Brown, KatheK., Thomas, LoriL., Ferreira, Manuel ARM. A., Bender, DavidD., Maller, JulianJ., Sklar, PamelaP., De Bakker, Paul IWP. I. Daly, Mark JM. J. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81 559–575.
  • Reich, Price and Patterson (2008) [author] Reich, DavidD., Price, Alkes LA. L. Patterson, NickN. (2008). Principal component analysis of genetic data. Nature Genetics 40 491–492.
  • Shriner (2012) [author] Shriner, DanielD. (2012). Improved eigenanalysis of discrete subpopulations and admixture using the minimum average partial test. Human Heredity 73 73–83.
  • Skoglund et al. (2017) [author] Skoglund, PontusP., Thompson, Jessica CJ. C., Prendergast, Mary EM. E., Mittnik, AlissaA., Sirak, KendraK., Hajdinjak, MatejaM., Salie, TasneemT., Rohland, NadinN., Mallick, SwapanS. Peltzer, AlexanderA. (2017). Reconstructing prehistoric African population structure. Cell 171 59–71.
  • Song, Hao and Storey (2015) [author] Song, MinsunM., Hao, WeiW. Storey, John DJ. D. (2015). Testing for genetic associations in arbitrarily structured populations. Nature Genetics 47 550.
  • Sudmant et al. (2015) [author] Sudmant, Peter HP. H., Rausch, TobiasT., Gardner, Eugene JE. J., Handsaker, Robert ER. E., Abyzov, AlexejA., Huddleston, JohnJ., Zhang, YanY., Ye, KaiK., Jun, GooG. Fritz, Markus Hsi-YangM. H.-Y. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526 75–81.
  • Wang and Paul (2014) [author] Wang, LiliL. Paul, DebashisD. (2014). Limiting spectral distribution of renormalized separable sample covariance matrices when p/n→0. Journal of Multivariate Analysis 126 25–52. https://doi.org/10.1016/j.jmva.2013.12.015
  • Wigner (1958) [author] Wigner, Eugene PE. P. (1958). On the distribution of the roots of certain symmetric matrices. Annals of Mathematics 67 325–327.
  • Zhang, Dey and Lee (2020) [author] Zhang, DaiweiD., Dey, RounakR. Lee, SeunggeunS. (2020). Fast and robust ancestry prediction using principal component analysis. Bioinformatics 36 3439–3446. 10.1093/bioinformatics/btaa152
  • Zhou, Marron and Wright (2018) [author] Zhou, Yi-HuiY.-H., Marron, JSJ. Wright, Fred AF. A. (2018). Eigenvalue significance testing for genetic association. Biometrics 74 439–447.