1 Introduction
Inference of population structure is a fundamentally important problem in population genetics and also plays a critical role in genetic association studies (Novembre et al., 2008; Skoglund et al., 2017; Cao et al., 2020)
. Principal component analysis (PCA) based method has been popularized to capture the population structures from arraybased genotype data
(Menozzi, Piazza and CavalliSforza, 1978; Patterson, Price and Reich, 2006; Reich, Price and Patterson, 2008). PCA works by selecting top principal components (PCs) that can sufficiently capture population structures, and then those top PCs can be further used to correct for population stratification bias in genetic association studies, for example, using the popular EIGENSTRAT method (Price et al., 2006).A natural and necessary question when applying PCA on genetic data is to determine the number of PCs that can sufficiently capture the underlying unknown population structure. Patterson, Price and Reich (2006) developed the socalled sequential Tracy–Widom test using the “effective number of markers” as a plugin estimate of the original number of markers in the data set. The effective number of markers is used to estimate the number of the underlying uncorrelated markers so that the number of markers can be effectively reduced. With the use of effective number of markers, Patterson, Price and Reich (2006) tried to alleviate the possible violation of the assumption in the Tracy–Widom test that the sample size and the number of genetic markers should be comparably large. We will refer to this method as the PCATW test throughout this paper. However, after being applied in various empirical studies for several years, it has been found that the PCATW test might not perform well for capturing the true population structure in sequencing data (Song, Hao and Storey, 2015; Prokopenko et al., 2015; Zhang, Dey and Lee, 2020). This is because the traditional PCATW test was originally developed for arraybased genotype data sets that typically contain a moderate number of genetic markers, while the number of genetic markers is much larger in sequencing data. As largescale sequencing data sets become increasingly available, we need to develop a new method that can resolve the following two issues:
2 Method
Suppose that one is interested in estimating the number of subpopulations based on a raw by genotype matrix which contains genetic markers from individuals. Each entry represents the raw count of the minor alleles for the genetic marker on the individual . Assume that there are unknown (or latent) subpopulations and the th subpopulation is of size , where and . To refer to subpopulation specific individuals, the index of individual is written as follows:
where indexed the latent subpopulations and indexed the individuals in the th subpopulation. Using this notation, the raw count matrix can be rewritten as
where is a
dimensional vector containing information of the genetic variants for the
th individual. We consider the following asymptotic regime throughout this paper.Asymptotic Regime.
This asymptotic regime is reasonable in sequencing data, where the number of markers is far greater than the sample size .
2.1 The ANOVA Model
Our model builds on the key fact that individuals from different subpopulations have different minor allele frequencies (MAF) and individuals from the same subpopulation have the same MAF. This observation motivates us to model the raw minor allele count data matrix using the following analysis of variance (ANOVA) model,
(1) 
where the dimensional vectors are the subpopulationspecific mean counts of minor alleles across latent subpopulations, and the vectors are independent and identically distributed noise vectors with mean zeros and covariance . We emphasize here that even though we use this ANOVA model for the raw minor allele count matrix, however our model differs from the classical ANOVA model in three ways. First, we do not know the total number of subpopulations a priori and which individual belongs to which subpopulation. Second, the sample to marker ratio is nearly zero. Third, our model allows for the presence of different LD patterns. Those three salient features of our ANOVA model thus require modern random matrix theory to understand the sources of the variation in the sequencing data.
Following Patterson, Price and Reich (2006), we normalize the raw count matrix such that each column (genetic marker) has mean zero and unit variance. The estimates of the th subpopulationspecific mean vector and the estimate of the overall mean vector are given by
(2)  
We also need the following diagonal matrix with diagonal elements equal to the standard deviations of the genetic markers  
Then the normalized genotype data matrix is given by  
(3) 
Define the sample covariance matrix of the normalized data as . Then, we have the following standard ANOVA decomposition
where the betweengroup variation is given by  
and the withingroup variation is given by  
(4) 
In the next section, we will perform spectral analysis for using modern random matrix theory.
2.2 The Spikes and the Bulk
It can be seen from Equation 4 that the withingroup covariance matrix is essentially the noise covariance matrix. The betweengroup covariance matrix is of rank because it is the sum of rank one matrices and the betweengroup vectors satisfy the following linear constrain:
Hence the matrix can be viewed as a rank perturbation of the sample noise covariance matrix . According to the finiterank perturbation theory (BenaychGeorges and Nadakuditi, 2011; BenaychGeorges, Guionnet and Maida, 2011), under the assumption that , the nonzero ordered sample eigenvalues of matrix can be separated into two sets by relating them to the eigenvalues of either or (graphically illustrated in Figure 1):
3 Simulation Studies
In this section, we compare the performances of our new method ERStruct versus the traditional PCATW test based on their final estimated number of subpopulations. We consider two different settings depending on whether or not the LD is present.
4 Real Data Analysis
In this section, we apply our proposed ERStruct to the 1000 Genomes Project sequencing data to estimate the number of subpopulations for illustration. We also include the PCATW method for comparison purpose. The 1000 Genomes Project was established in January 2008 with the aim to build by then the most detailed catalogue of genetic variations in the human population. In 2015, the final phase of the project was completed (1000 Genomes Project Consortium, 2015; Sudmant et al., 2015). Here, we analyze the data from the 1000 Genomes Project, phase 3 v5a, which is publicly accessible at https://www.internationalgenome.org/data. In this data set, there are individuals from different subpopulations (as shown in Table 2).
Subpopulation  Sample size  

ACB  (African Caribbean in Barbados)  96 
ASW  (African Ancestry in Southwest US)  61 
BEB  (Bengali in Bangladesh)  86 
CDX  (Chinese Dai in Xishuangbanna, China)  93 
CEU  (Utah residents with Northern and Western European ancestry)  99 
CHB  (Han Chinese in Beijing, China)  103 
CHS  (Southern Han Chinese, China)  105 
CLM  (Colombian in Medellin, Colombia)  94 
ESN  (Esan in Nigeria)  99 
FIN  (Finnish in Finland)  99 
GBR  (British in England and Scotland)  91 
GIH  (Gujarati Indian in Houston, TX)  103 
GWD  (Gambian in Western Division, The Gambia)  113 
IBS  (Iberian populations in Spain)  107 
ITU  (Indian Telugu in the UK)  102 
JPT  (Japanese in Tokyo, Japan)  104 
KHV  (Kinh in Ho Chi Minh City, Vietnam)  99 
LWK  (Luhya in Webuye, Kenya)  99 
MSL  (Mende in Sierra Leone)  85 
MXL  (Mexican Ancestry in Los Angeles, California)  64 
PEL  (Peruvian in Lima, Peru)  85 
PJL  (Punjabi in Lahore, Pakistan)  96 
PUR  (Puerto Rican in Puerto Rico)  104 
STU  (Sri Lankan Tamil in the UK)  102 
TSI  (Toscani in Italy)  107 
YRI  (Yoruba in Ibadan, Nigeria)  108 
The raw data is preprocessed as follows. We first filter extremely rare genetic variants and retain genetic markers with MAF greater than using the PLINK software (Purcell et al., 2007). We applied our ERStruct and the traditional PCATW methods to this data set and perform the following three analyses.
5 Discussions
The traditional PCATW test was originally developed for arraybased genotype data set where the number of genetic markers is much smaller than that from modern sequencing data (Patterson, Price and Reich, 2006). Hence, the PCATW method has been found to perform not well when the sample to marker ratio is nearly zero as in the sequencing data. Our proposed method ERStruct has been shown to have improved performance over the traditional PCATW test in both simulated and real sequencing data.
Our ERStruct method enjoys several advantages. First, our ERStruct is based on the more robust eigenvalue ratios rather than the original eigenvalues for inference as it preserves the scale invariant property. Second, we obtain the more accurate null distribution for the ER test statistic from the most recent random matrix theory. Third, our method is not confined to a specific LD structure among genetic markers. Even though in the real genetic data with complicated LD structures, our ERStruct can still separate the spikes from the bulk. Fourth, our ERStruct method is also computationally efficient. In fact, our ERStruct achieves almost the same computational speed as the PCATW test, given that . For example, it took around 40 minutes for analyzing the whole 1000 Genomes data with 126G RAM and 5 cores of CPU using our ERStruct MATLAB toolbox.
Our proposed ERStruct method can also be used for inferring latent structures in other types of ultradimensional data. However, both our proposed ERStruct method and the PCATW test assume that the raw data is complete, that is, there are no missing values. But in singlecell sequencing data, it is common that most of the entries in the data matrix are missing. Inference of latent structures in such sparse data matrix is still very challenging because a large proportion of missing values will seriously impact the null distribution of the ER test statistic. One possible solution is to impute those missing values as described by
Aparicio et al. (2020). We will leave this as future work by extending our ERStruct method for sparse data matrices.References
 Aparicio et al. (2020) [author] Aparicio, LuisL., Bordyuh, MykolaM., Blumberg, Andrew J.A. J. Rabadan, RaulR. (2020). A Random Matrix Theory Approach to Denoise SingleCell Data. Patterns 1 100035. https://doi.org/10.1016/j.patter.2020.100035

Arnold (1971)
[author] Arnold, LudwigL. (1971). On Wigner’s semicircle law for the eigenvalues of random matrices. Probability Theory and Related Fields 19 191–198.

Bai and Yao (2008)
[author] Bai, ZhidongZ. Yao, JianfengJ. (2008). Central limit theorems for eigenvalues in a spiked population model. Annales de l’IHP Probabilités et Statistiques 44 447–474.

Baik, Arous and
Péché (2005)
[author] Baik, JinhoJ., Arous, Gérard BenG. B. Péché, SandrineS. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability 33 1643–1697.

Baik and
Silverstein (2006)
[author] Baik, JinhoJ. Silverstein, Jack WJ. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis 97 1382–1408.
 BenaychGeorges, Guionnet and Maida (2011) [author] BenaychGeorges, FlorentF., Guionnet, AliceA. Maida, MylèneM. (2011). Fluctuations of the extreme eigenvalues of finite rank deformations of random matrices. Electronic Journal of Probability 16 1621–1662.

BenaychGeorges and
Nadakuditi (2011)
[author] BenaychGeorges, FlorentF. Nadakuditi, Raj RaoR. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics 227 494–521.
 Cao et al. (2020) [author] Cao, YananY., Li, LinL., Feng, ZhiminZ., Sun, XiaohuiX., Lu, JieliJ., Xu, YuY., Du, PeinaP., Wang, TiangeT., Hu, RuyingR., Ye, ZhenZ., Shi, LixinL., Tang, XuleiX., Yan, LiL., Gao, ZhengnanZ., Chen, GangG., Zhang, YinfeiY., Chen, LuluL., Ning, GuangG. Wang, WeiqingW. (2020). The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Research 30 717–731.
 1000 Genomes Project Consortium (2015) [author] 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526 68–74.
 Johnstone (2001) [author] Johnstone, Iain MI. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics 29 295–327.

Li, Wang and Yao (2017)
[author] Li, ZengZ., Wang, QinwenQ. Yao, JianfengJ. (2017). Identifying the number of factors from singular values of a large sample autocovariance matrix. The Annals of Statistics 45 257–288.
 Menozzi, Piazza and CavalliSforza (1978) [author] Menozzi, PaoloP., Piazza, AlbertoA. CavalliSforza, LL. (1978). Synthetic maps of human gene frequencies in Europeans. Science 201 786–792.
 Novembre et al. (2008) [author] Novembre, JohnJ., Johnson, TobyT., Bryc, KatarzynaK., Kutalik, ZoltánZ., Boyko, Adam RA. R., Auton, AdamA., Indap, AmitA., King, Karen SK. S., Bergmann, SvenS. Nelson, Matthew RM. R. (2008). Genes mirror geography within Europe. Nature 456 98.
 Palla and Dudbridge (2015) [author] Palla, LuigiL. Dudbridge, FrankF. (2015). A fast method that uses polygenic scores to estimate the variance explained by genomewide marker panels and the proportion of variants affecting a trait. The American Journal of Human Genetics 97 250–259.
 Patterson, Price and Reich (2006) [author] Patterson, NickN., Price, Alkes LA. L. Reich, DavidD. (2006). Population structure and eigenanalysis. PLoS Genetics 2 e190.
 Paul (2007) [author] Paul, DebashisD. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 17 1617–1642.
 Price et al. (2006) [author] Price, Alkes LA. L., Patterson, Nick JN. J., Plenge, Robert MR. M., Weinblatt, Michael EM. E., Shadick, Nancy AN. A. Reich, DavidD. (2006). Principal components analysis corrects for stratification in genomewide association studies. Nature Genetics 38 904.

Prokopenko
et al. (2015)
[author] Prokopenko, DmitryD., Hecker, JulianJ., Silverman, Edwin K.E. K., Pagano, MarcelloM., Nöthen, Markus M.M. M., Dina, ChristianC., Lange, ChristophC. Fier, Heide LoehleinH. L. (2015). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics 32 1366–1372. 10.1093/bioinformatics/btv752
 Purcell et al. (2007) [author] Purcell, ShaunS., Neale, BenjaminB., ToddBrown, KatheK., Thomas, LoriL., Ferreira, Manuel ARM. A., Bender, DavidD., Maller, JulianJ., Sklar, PamelaP., De Bakker, Paul IWP. I. Daly, Mark JM. J. (2007). PLINK: a tool set for wholegenome association and populationbased linkage analyses. The American Journal of Human Genetics 81 559–575.
 Reich, Price and Patterson (2008) [author] Reich, DavidD., Price, Alkes LA. L. Patterson, NickN. (2008). Principal component analysis of genetic data. Nature Genetics 40 491–492.
 Shriner (2012) [author] Shriner, DanielD. (2012). Improved eigenanalysis of discrete subpopulations and admixture using the minimum average partial test. Human Heredity 73 73–83.
 Skoglund et al. (2017) [author] Skoglund, PontusP., Thompson, Jessica CJ. C., Prendergast, Mary EM. E., Mittnik, AlissaA., Sirak, KendraK., Hajdinjak, MatejaM., Salie, TasneemT., Rohland, NadinN., Mallick, SwapanS. Peltzer, AlexanderA. (2017). Reconstructing prehistoric African population structure. Cell 171 59–71.
 Song, Hao and Storey (2015) [author] Song, MinsunM., Hao, WeiW. Storey, John DJ. D. (2015). Testing for genetic associations in arbitrarily structured populations. Nature Genetics 47 550.
 Sudmant et al. (2015) [author] Sudmant, Peter HP. H., Rausch, TobiasT., Gardner, Eugene JE. J., Handsaker, Robert ER. E., Abyzov, AlexejA., Huddleston, JohnJ., Zhang, YanY., Ye, KaiK., Jun, GooG. Fritz, Markus HsiYangM. H.Y. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526 75–81.
 Wang and Paul (2014) [author] Wang, LiliL. Paul, DebashisD. (2014). Limiting spectral distribution of renormalized separable sample covariance matrices when p/n→0. Journal of Multivariate Analysis 126 25–52. https://doi.org/10.1016/j.jmva.2013.12.015
 Wigner (1958) [author] Wigner, Eugene PE. P. (1958). On the distribution of the roots of certain symmetric matrices. Annals of Mathematics 67 325–327.
 Zhang, Dey and Lee (2020) [author] Zhang, DaiweiD., Dey, RounakR. Lee, SeunggeunS. (2020). Fast and robust ancestry prediction using principal component analysis. Bioinformatics 36 3439–3446. 10.1093/bioinformatics/btaa152
 Zhou, Marron and Wright (2018) [author] Zhou, YiHuiY.H., Marron, JSJ. Wright, Fred AF. A. (2018). Eigenvalue significance testing for genetic association. Biometrics 74 439–447.
References
 Aparicio et al. (2020) [author] Aparicio, LuisL., Bordyuh, MykolaM., Blumberg, Andrew J.A. J. Rabadan, RaulR. (2020). A Random Matrix Theory Approach to Denoise SingleCell Data. Patterns 1 100035. https://doi.org/10.1016/j.patter.2020.100035

Arnold (1971)
[author] Arnold, LudwigL. (1971). On Wigner’s semicircle law for the eigenvalues of random matrices. Probability Theory and Related Fields 19 191–198.

Bai and Yao (2008)
[author] Bai, ZhidongZ. Yao, JianfengJ. (2008). Central limit theorems for eigenvalues in a spiked population model. Annales de l’IHP Probabilités et Statistiques 44 447–474.

Baik, Arous and
Péché (2005)
[author] Baik, JinhoJ., Arous, Gérard BenG. B. Péché, SandrineS. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability 33 1643–1697.

Baik and
Silverstein (2006)
[author] Baik, JinhoJ. Silverstein, Jack WJ. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis 97 1382–1408.
 BenaychGeorges, Guionnet and Maida (2011) [author] BenaychGeorges, FlorentF., Guionnet, AliceA. Maida, MylèneM. (2011). Fluctuations of the extreme eigenvalues of finite rank deformations of random matrices. Electronic Journal of Probability 16 1621–1662.

BenaychGeorges and
Nadakuditi (2011)
[author] BenaychGeorges, FlorentF. Nadakuditi, Raj RaoR. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics 227 494–521.
 Cao et al. (2020) [author] Cao, YananY., Li, LinL., Feng, ZhiminZ., Sun, XiaohuiX., Lu, JieliJ., Xu, YuY., Du, PeinaP., Wang, TiangeT., Hu, RuyingR., Ye, ZhenZ., Shi, LixinL., Tang, XuleiX., Yan, LiL., Gao, ZhengnanZ., Chen, GangG., Zhang, YinfeiY., Chen, LuluL., Ning, GuangG. Wang, WeiqingW. (2020). The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Research 30 717–731.
 1000 Genomes Project Consortium (2015) [author] 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526 68–74.
 Johnstone (2001) [author] Johnstone, Iain MI. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics 29 295–327.

Li, Wang and Yao (2017)
[author] Li, ZengZ., Wang, QinwenQ. Yao, JianfengJ. (2017). Identifying the number of factors from singular values of a large sample autocovariance matrix. The Annals of Statistics 45 257–288.
 Menozzi, Piazza and CavalliSforza (1978) [author] Menozzi, PaoloP., Piazza, AlbertoA. CavalliSforza, LL. (1978). Synthetic maps of human gene frequencies in Europeans. Science 201 786–792.
 Novembre et al. (2008) [author] Novembre, JohnJ., Johnson, TobyT., Bryc, KatarzynaK., Kutalik, ZoltánZ., Boyko, Adam RA. R., Auton, AdamA., Indap, AmitA., King, Karen SK. S., Bergmann, SvenS. Nelson, Matthew RM. R. (2008). Genes mirror geography within Europe. Nature 456 98.
 Palla and Dudbridge (2015) [author] Palla, LuigiL. Dudbridge, FrankF. (2015). A fast method that uses polygenic scores to estimate the variance explained by genomewide marker panels and the proportion of variants affecting a trait. The American Journal of Human Genetics 97 250–259.
 Patterson, Price and Reich (2006) [author] Patterson, NickN., Price, Alkes LA. L. Reich, DavidD. (2006). Population structure and eigenanalysis. PLoS Genetics 2 e190.
 Paul (2007) [author] Paul, DebashisD. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 17 1617–1642.
 Price et al. (2006) [author] Price, Alkes LA. L., Patterson, Nick JN. J., Plenge, Robert MR. M., Weinblatt, Michael EM. E., Shadick, Nancy AN. A. Reich, DavidD. (2006). Principal components analysis corrects for stratification in genomewide association studies. Nature Genetics 38 904.

Prokopenko
et al. (2015)
[author] Prokopenko, DmitryD., Hecker, JulianJ., Silverman, Edwin K.E. K., Pagano, MarcelloM., Nöthen, Markus M.M. M., Dina, ChristianC., Lange, ChristophC. Fier, Heide LoehleinH. L. (2015). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics 32 1366–1372. 10.1093/bioinformatics/btv752
 Purcell et al. (2007) [author] Purcell, ShaunS., Neale, BenjaminB., ToddBrown, KatheK., Thomas, LoriL., Ferreira, Manuel ARM. A., Bender, DavidD., Maller, JulianJ., Sklar, PamelaP., De Bakker, Paul IWP. I. Daly, Mark JM. J. (2007). PLINK: a tool set for wholegenome association and populationbased linkage analyses. The American Journal of Human Genetics 81 559–575.
 Reich, Price and Patterson (2008) [author] Reich, DavidD., Price, Alkes LA. L. Patterson, NickN. (2008). Principal component analysis of genetic data. Nature Genetics 40 491–492.
 Shriner (2012) [author] Shriner, DanielD. (2012). Improved eigenanalysis of discrete subpopulations and admixture using the minimum average partial test. Human Heredity 73 73–83.
 Skoglund et al. (2017) [author] Skoglund, PontusP., Thompson, Jessica CJ. C., Prendergast, Mary EM. E., Mittnik, AlissaA., Sirak, KendraK., Hajdinjak, MatejaM., Salie, TasneemT., Rohland, NadinN., Mallick, SwapanS. Peltzer, AlexanderA. (2017). Reconstructing prehistoric African population structure. Cell 171 59–71.
 Song, Hao and Storey (2015) [author] Song, MinsunM., Hao, WeiW. Storey, John DJ. D. (2015). Testing for genetic associations in arbitrarily structured populations. Nature Genetics 47 550.
 Sudmant et al. (2015) [author] Sudmant, Peter HP. H., Rausch, TobiasT., Gardner, Eugene JE. J., Handsaker, Robert ER. E., Abyzov, AlexejA., Huddleston, JohnJ., Zhang, YanY., Ye, KaiK., Jun, GooG. Fritz, Markus HsiYangM. H.Y. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526 75–81.
 Wang and Paul (2014) [author] Wang, LiliL. Paul, DebashisD. (2014). Limiting spectral distribution of renormalized separable sample covariance matrices when p/n→0. Journal of Multivariate Analysis 126 25–52. https://doi.org/10.1016/j.jmva.2013.12.015
 Wigner (1958) [author] Wigner, Eugene PE. P. (1958). On the distribution of the roots of certain symmetric matrices. Annals of Mathematics 67 325–327.
 Zhang, Dey and Lee (2020) [author] Zhang, DaiweiD., Dey, RounakR. Lee, SeunggeunS. (2020). Fast and robust ancestry prediction using principal component analysis. Bioinformatics 36 3439–3446. 10.1093/bioinformatics/btaa152
 Zhou, Marron and Wright (2018) [author] Zhou, YiHuiY.H., Marron, JSJ. Wright, Fred AF. A. (2018). Eigenvalue significance testing for genetic association. Biometrics 74 439–447.
Comments
There are no comments yet.