1.1 Principal component analysis (PCA)
in the early part of the 20th century. PCA performs a linear transformation of multivariate data into a new set of variables, the principal components (PCs), that are linear combinations of the original variables, are uncorrelated and have sequentially maximum variance
. PCA can be equivalently defined by the set of uncorrelated vectors that provide the best low-rank matrix approximation, in a least squares sense. The solution to PCA is given by the eigenvalue decomposition of the sample covariance matrix with the variance of the PCs specified by the eigenvalues and the PC directions defined by the eigenvectors.
Because PCA is defined in terms of the eigenvectors and eigenvalues of the sample covariance matrix, it is related to a wide range of matrix analysis methods and multivariate statistical techniques with an extremely large number of applications . Although most commonly used for unsupervised linear dimensionality reduction and visualization, PCA has been successful applied to statistical problems including regression (e.g., principal components regression), clustering (e.g., clustering on PCs, seed directions for SOM ) and analysis of time series data (functional PCA ). In the biomedical domain, PCA has been extensively employed for the analysis of genomic data including measures of DNA variation, DNA methylation, RNA expression and protein abundance . Common features of these datasets, and the motivation for eigenvalue decomposition methods, are the high dimensionality of the feature space (i.e., from thousands to over one million), comparatively low sample size (i.e., ) and significant collinearity between the features. The most common uses of PCA with genomic data involve dimensionality reduction for visualization or clustering , with population genetics an important use case 
. PCA has also been used as the basis for feature selection, and gene clustering . More recent applications include gene set enrichment of bulk or single cell gene expression data [11, 12].
1.2 Mathematical notation for PCA
Let be an matrix that holds independent samples drawn from a
-dimensional joint distribution with population covariance matrix. Without loss of generality, we can assume the columns of are mean-centered. The unbiased sample covariance matrix is therefore given by . PCA can be performed via the direct eigenvalue decomposition of with the eigenvalues equal to the PC variances and the eigenvectors
equal to the PC loadings. For computational reasons, PCA is more commonly performed via the singular value decomposition (SVD) of: , where and are both orthonormal matrices (i.e., ), the columns of represent the PC loading vectors, the entries in the diagonal matrix , arranged in decreasing order, are proportional to the square roots of the PC variances () and the columns of are the principal components. Consistent with the optimal low-rank matrix approximation property of PCA, the first components of the SVD minimize the squared Frobenius norm between the original and a rank reconstructed version of . Specifically, if , and hold the first columns of , and , is the SVD rank reconstruction of , which minimizes for all possible rank reconstructions.
1.3 Running example
To help illustrate the concepts discussed in this paper, we will introduce a simple example data set based on a 10-dimensional multivariate normal (MVN) distribution. The population mean for data is set to the zero vector, , and the population covariance matrix is given the following block covariance structure:
To simplify the example, we set the population variance for all variables to 1, which aligns with the common practice of performing PCA after standardization. For this , the first population PC has equal non-zero loadings for just the first four variables and the second population PC has equal non-zero loadings for just the last two variables. If the matrix holds the population PC loadings, the first two columns are:
The variances of these population PCs are:
For one example set of 100 independent samples are drawn from this MVN distribution, the loadings of the first two sample PCs (rounded to three decimal places) are:
The variances of the first two sample PCs (again rounded to three decimal places) are:
The minimum rank 2 reconstruction for this case (computed as the squared Frobenius norm of the residual matrix) is 597.531.
1.4 Sparse PCA
As demonstrated by the simple example above, all variables typically make a non-zero contribution to each sample PC even when the data follows a statistical model with sparse population PCs. This property makes interpretation of the PCs challenging, especially when the underlying data is high dimensional. For example, PCA of gene expression data will generate PCs that are linear combinations of thousands of genes and attempting to ascertain the biological relevance of such a large linear combination of genes is a very difficult task . The challenge of PC interpretation has motivated a number of approaches for generating approximate PCs that have non-zero loadings for just a small subset of the measured variables. Such sparse PCA techniques include simple components (i.e., PC loading vectors constrained to values from )  and methods which compute approximate PCs using cardinality [15, 16, 17] or LASSO-based [18, 19, 20, 21] constraints on the component loadings. By generating approximate PCs with few non-zero loadings, all of these techniques improve interpretability by associating only a small number of variables with each PC.
Among the existing approaches for sparse PCA, we believe the LASSO-penalized method developed by Witten, Tibshirani and Hastie  still represents the current state-of-the-art with numerous successful applications over the past decade. Evaluation of our proposed approach is therefore based on a comparison relative to the Witten et al. method, which we will refer to as Sparse Principal Components (SPC) in the remainder of this paper. The SPC method has an elegant formulation via LASSO-penalized matrix decomposition. Specifically, SPC modifies the standard SVD matrix reconstruction optimization problem to include a LASSO penalty on the PC loadings. Using the notation introduced in Section 1.2, the SPC optimization problem for the first PC can be stated as:
Due to the LASSO penalty on the components of , this optimization problem generates a sparse version of the first PC with the level of sparsity controlled by the parameter . Subsequent sparse PCs can be computed by iteratively applying SPC to the residual matrix generated by subtracting the rank 1 reconstruction from the original . Witten et al. also outline a more complex approach for computing multiple PCs that ensures orthogonality. Similar to LASSO-penalized generalized linear models , the penalty parameter can be specified to achieve a desired level of sparsity in the PC loadings or can be selected via cross-validation to minimize the average matrix reconstruction error. Since the true level of sparsity is typically not known, the cross-validation approach is usually employed to determine and we will use this approach in our comparative evaluation of EESPCA and SPC. A common alternative to picking to minimize the average reconstruction error is to set
to smallest value whose reconstruction error is within 1 standard error of the minimum error. This alternate version, which we will refer to as SPC.1se, generates a more sparse version of the PC loadings at the cost of increased reconstruction error.
Applying the SPC method as implemented in the PMA R package  to compute sparse versions of the first two PCs of the example data simulated according to Section 1.3 generates the following sparse loadings when is set to minimize the average cross-validation reconstruction error (orthogonality not enforced):
The SPC solution in this case generates zero loadings on the first PC for four of the six variables with a zero population loading. For the second PC, only one of the eight variables with a zero population loading is set to zero. As expected, the reconstruction error for this sparse solution is larger than that generated via PCA (598.148 vs. 597.531). If the SPC.1se variant is used instead, the following sparse loadings are produced:
The SPC.1se solution does a better job of capturing the true sparsity of the PCs (all true zero loadings for the first PC are captured and two out of eight are captured for the second PC) at the cost of significantly increased reconstruction error (642.760 vs. 598.148).
1.5 Eigenvectors from eigenvalues
Denton et al.  recently reported on a fascinating association between the squared normed eigenvector elements of a Hermitian matrix and the eigenvalues of the full matrix and associated sub-matrices. Importantly in this context, the sample covariance matrix is Hermitian, which implies that squared normed PC loadings can be computed as a function of PC variances for the full data set and all of the leave-one-out variable subsets. To briefly restate the results from Denton et al., let be a Hermitian matrix with eigenvalues and normed eigenvectors for . Let the elements of each eigenvector be denoted for . Let be the submatrix of that results from removing the the column and the row from . Denote the eigenvalues of by ) for . Given these values, Denton et al.  state their key result in Lemma 2: Lemma 2 (from Denton et al. ). The norm squared of the elements of the eigenvectors are related to the eigenvalues and the submatrix eigenvalues:
Which can alternatively be represented as a ratio of the product of eigenvalue differences:
In the context of PCA, can replaced by the sample covariance matrix and (3) provides a formula for computing squared normed PC loadings from the PC variances of the full matrix and associated sub-matrices. In the remainder of this paper, we will detail an approximation of (3) relevant to the sparse PCA problem, our proposed EESPCA method based on this approximation and results from a comparative evaluation of the EESPCA, SPC and SPC.1se techniques on simulated data.
2.1 Approximate eigenvector from eigenvalue formulation
If is replaced by the sample covariance matrix , represents the sample covariance sub-matrix with variable removed, and computation focuses on the loadings for the first PC, (3) can be restated as:
Although a theoretically interesting result, (4) does not have obvious practical utility for PCA since it only generates squared normed values (i.e., it fails to capture the sign of the loading) and requires computation of the eigenvalues of all of the covariance sub-matrices. However, if we make the approximation that the eigenvalues for are equal to the corresponding sub-matrix eigenvalues , many of the eigenvalue difference terms in (4) cancel and we can simplify as follows:
If we also assume that , we can further simplify the approximate squared normed loadings vector to:
This approximation has several appealing properties in the context of sparse PCA. First, it greatly reduces the computational cost since only the largest eigenvalues of the full covariance matrix and associated sub-matrices are needed, which can be efficiently computed even for very large matrices using the method of power iteration. The computational cost can be further lowered by using the estimated principal eigenvector of the full matrix to initialize the power iteration calculation for the sub-matrices. When this approximation is applied to the population covariance matrix , it will correctly estimate zero squared loadings since for these variables; for variables that have a non-zero loading on the first population PC, the approximation will be less than or equal to the squared value of the true loading. For the example introduced in Section 1.3, the squared normed loadings for the first population PC are:
and the approximate squared normed loadings computed via (5) are:
When applied to the sample covariance matrix , the approximate squared loadings will again be less than or equal to the squared PC loadings with the values for variables with a zero loading on the population PC approaching zero as and given the consistency of the sample PC loading vectors in this asymptotic regime . Importantly, the ratio of approximate-to-true loadings, , will tend to be larger for variables that have a non-zero loading on the population PC than for variables with a zero population loading. This is due to the fact that the eigenvalue difference term retained in the numerator of the approximation, , captures most of the squared loading value for variables with a true non-zero loading on the first population PC. If variable only has a non-zero loading on the first population PC and , then it follows from PCA consistency  that as . On the other hand, variables that have a zero loading on the first population PC will have a sample loading that approaches 0 as and, for finite , have a non-zero value based on random contributions spread across the spectrum of eigenvalue difference terms. In this case, the approximation based on the difference term for just the largest eigenvalue, , will tend to underestimate the sample loading by a greater degree than for variables with a true non-zero population PC loading. This property of approximation (5) provides important information regarding the true sparsity of the PC loadings and plays a key role in the EESPCA method detailed below. For the example data, the ratio (rounded to three decimal places) of approximate to real normed loadings for the first sample PC are:
Figure 1 illustrates the relationship between approximate and true squared loadings for first PC as computed on 1,000 data sets simulated according to the model in Section 1.3. As shown in this figure, the ratio for variables with a non-zero loading on the population PC is markedly larger on average than the ratio for variables with a zero population loading.
Our proposed EESPCA method, which is based on the eigenvector from eigenvalue approximation (5) detailed above, computes a sparse version of the first sample PC for mean-centered matrix using the following procedure:
Estimate the unbiased covariance matrix .
Use the power iteration method to compute the principal eigenvector and associated eigenvalue of .
Use formula (5) to approximate the squared normed principal eigenvector . For efficient calculation of the required sub-matrix eigenvalues, the power iteration method is used with the initial value set to a subsetted version of , i.e., with the relevant variable removed.
Compute the ratio of approximate-to-real normed eigenvectors: .
Use to compute a sparse version of . As outlined in Section 2.1 above, the entries in will be less than one and larger for variables with a non-zero population loading than for variables with a zero population loading. This property suggests that one could use to scale followed by renormalization to generate sample PC loadings that more closely match the population loadings, however, this would not generate sparse loadings since the elements of are themselves non-zero. To produce a sparse version of , we follow the initial scaling and renormalization with a thresholding operation that sets any adjusted loadings less than (element of a unit length vector that has identical values) to 0 and then renormalize a final time to produce a sparse and unit length loadings vector .
Compute the associated eigenvalue as:
To compute multiple sparse PCs, the EESPCA method uses a similar approach to that employed by the SPC method, i.e., it repeatedly applies to the procedure outlined above for calculating the first sparse PC to the residual matrix formed by subtracting the rank 1 reconstruction of X generated using from the input . Note that multiple sparse PCs generated using this recursive approach are not guaranteed to be orthogonal. When applied to the example data described in Section 1.3, the EESPCA method produces the following sparse PC loading vectors (rounded to three decimal places):
For this example, the EESPCA method was more accurate than either the SPC or SPC.1se methods in estimating the true population sparsity structure with no errors for either the first or second PC; by contrast, SPC had nine errors and SPC.1se had six errors. This improved sparsity estimation was achieved with a reconstruction error (605.072) close to the error for the SPC method (598.148) and much less than the error for the SPC.1se method (642.760).
2.3 Simulation study design
To explore the comparative performance of the EESPCA, SPC and SPC.1se methods, we simulated MVN data for a range of sample sizes, variable dimensionality and covariance structures. Specifically, we simulated data sets containing independent samples drawn from a -dimensional MVN data with a zero population mean vector and population covariance matrix with all variances set to 1 and all covariances set to 0 except for the covariances between the first variables, which were set to a non-zero . According to this , the first population PC will have equal, non-zero loadings for the first variables and zero loadings for the remaining variables. Data was simulated for a range of , , and values with 50 data sets generated per unique parameter combination.
The EESPCA, SPC and SPC.1se methods were used to compute the first sparse PC of each simulated data set and method performance was quantified according to classification accuracy (i.e., ability of method to correctly assign zero loadings to variables that with a zero population loading), computational speed and reconstruction error. The SPC and SPC.1se methods were realized using the SPC() method in version 1.2.1 of the PMA R package  with the optimal penalty parameter computed using the SPC.cv() method with nfolds=5, niter=10 and sumabsv=seq(1, sqrt(p), len=20).
3 Results and discussion
3.1 Classification performance
As illustrated in Figure 2, the EESPCA method has superior classification performance (as measured by balanced accuracy) relative to the SPC and SPC.1se methods across the full range of simulation models with the difference particularly pronounced at low parameter values. As expected, the accuracy of all methods tends to increase as either , or increase; improved performance at larger with fixed was unexpected. In contrast to EESPCA and SPC.1se, the accuracy of the SPC method plateaus and then declines slightly at high parameter values. The corresponding sensitivity and specificity values, shown in Figures 3 and 4, provide more insight into the relative performance of the methods. Looking at the sensitivity plots in Figure 3, the SPC.1se method had almost perfect sensitivity (i.e., ability to assign zero loadings for variables with a zero population loading), which is consistent with the fact that the SPC.1se method generates a sparser solution than the SPC variant. The sensitivity plots also demonstrate that the poor balanced accuracy of the SPC method on higher parameter values is due largely to poor specificity, i.e, the SPC solution, which is driven by the minimization of reconstruction error, is insufficiently sparse at these parameter values. Looking at the specificity plots in Figure 4, both EESPCA and SPC have near perfect specificity (i.e., ability to assign a non-zero loading to variables with a non-zero population loading) for most parameter values. In contrast, the SPC.1se method has much lower specificity, which is expected given the sparser solution generated by this approach. Overall, EESPCA is able to achieve high sensitivity and specificity for almost all simulation models whereas SPC and SPC.1se have consistently good performance for just one of these criteria (specificity for SPC and sensitivity for SPC.1se).
3.2 Computational speed
Figure 5 illustrates the relative computational cost of the EESPCA and SPC methods. Because the SPC.1se and SPC methods have identical computational cost, only the SPC method is included in the plots. As shown in this figure, the execution time of the SPC method is two-orders-of-magnitude larger than the execution time for the EESPCA method for almost all parameter settings. As expected, computational speed is fairly insensitive to changes in or for both methods and tends to increase as either or are increased. Importantly, the computational performance of EESPCA decreases slighly across the range of tested values while the cost of SPC increases suggesting that the EESPCA performance benefits will be even greater on very large data sets.
3.3 Reconstruction error
Figure 6 shows the relative rank 1 reconstruction error for EESPCA, SPC and SPC.1se as compared to the reconstruction error for standard PCA. As expected, the error for all three methods decreases as parameter values increase with SPC.1se producing the largest error for all models. Although the SPC method has the lowest error for most simulation models, the EESPCA error is lower at small parameter values and only slightly higher for larger settings.
The EESPCA method is a novel sparse PCA technique based on an approximation of the recently described formula for calculating eigenvectors from eigenvalues for Hermitian matrices . Compared to the state-of-the-art SPC technique of Witten et al.  (and the SPC.1se variant), EESPCA can more accurately estimate the true population sparsity structure at a dramatically lower computational cost without the need for parameter tuning. Importantly, EESPCA provides improved classification accuracy and lower computational cost without significantly increasing the matrix reconstruction error. For analysis problems that prioritize correct estimation of the PC sparsity structure, EESPCA should be preferred over SPC or SPC.1se. EESPCA should also be preferred for computationally demanding problems, such as the analysis of very large data matrices or the use of techniques like resampling that require repeated application of sparse PCA. If minimization of reconstruction error is the key goal, a less sparse solution is acceptable and computational speed is not a major concern, then the SPC method is optimal.
This work was funded by National Institutes of Health grants K01LM012426, P20GM130454 and P30CA023108. We would like to acknowledge the supportive environment at the Geisel School of Medicine at Dartmouth where this research was performed.
-  Pearson, K.: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2(6), 559–572 (1901)
-  Hotelling, H.: Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24, 498–520 (1933)
-  Jolliffe, I.T.: Principal Component Analysis. Springer Series in Statistics. Springer, New York, NY (2002)
-  Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed edn. Springer series in statistics. Springer, New York, NY (2009)
Kohonen, T., Schroeder, M.R., Huang, T.S. (eds.): Self-Organizing Maps, 3rd edn. Springer, Secaucus, NJ, USA (2001)
-  Ma, S., Dai, Y.: Principal component analysis based methods in bioinformatics studies. Briefings in Bioinformatics 12(6, SI), 714–722 (2011). doi:10.1093/bib/bbq09
-  Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W.M. 3rd, Hao, Y., Stoeckius, M., Smibert, P., Satija, R.: Comprehensive integration of single-cell data. Cell 177(7), 1888–190221 (2019). doi:10.1016/j.cell.2019.05.031
-  Patterson, N., Price, A.L., Reich, D.: Population structure and eigenanalysis. PLOS Genetics 2(12), 190 (2006). doi:10.1371/journal.pgen.0020190
-  Lu, J., Kerns, R.T., Peddada, S.D., Bushel, P.R.: Principal component analysis-based filtering improves detection for affymetrix gene expression arrays. Nucleic Acids Research 39(13), 86 (2011). doi:10.1093/nar/gkr241
-  Kluger, Y., Basri, R., Chang, J.T., Gerstein, M.: Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Research 13(4), 703–716 (2003). doi:10.1101/gr.648603
-  Tomfohr, J., Lu, J., Kepler, T.B.: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics 6, 225 (2005). doi:10.1186/1471-2105-6-225
-  Fan, J., Salathia, N., Liu, R., Kaeser, G.E., Yung, Y.C., Herman, J.L., Kaper, F., Fan, J.-B., Zhang, K., Chun, J., Kharchenko, P.V.: Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods 13(3), 241–4 (2016). doi:10.1038/nmeth.3734
-  Frost, H.R., Li, Z., Moore, J.H.: Principal component gene set enrichment (PCGSE). BioData Mining 8, 25 (2015). doi:10.1186/s13040-015-0059-z
-  Vines, S.: Simple principal components. Journal of the Royal Statistical Society. Series C (Applied Statistics) 49(Part 4), 441–451 (2000). doi:10.1111/1467-9876.0020
-  Moghaddam, B., Weiss, Y., Avidan, S.: Spectral bounds for sparse pca: Exact and greedy algorithms. Advances in neural information processing systems 18, 915 (2006)
-  d’Aspremont, A., El Ghaoui, L., Jordan, M.I., Lanckriet, G.R.G.: A direct formulation for sparse PCA using semidefinite programming. SIAM Review 49(3), 434–448 (2007). doi:10.1137/050645506. Accessed 2013-01-01
Sriperumbudur, B.K., Torres, D.A., Lanckriet, G.R.G.: A majorization-minimization approach to the sparse generalized eigenvalue problem. Machine Learning85(1-2), 3–39 (2011). doi:10.1007/s10994-010-5226-3
-  Jolliffe, I., Trendafilov, N., Uddin, M.: A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics 12(3), 531–547 (2003). doi:10.1198/106186003214
-  Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. Journal of Computational and Graphical Statistics 15(2), 265–286 (2006). doi:10.1198/106186006X113430
Shen, H., Huang, J.Z.: Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis99(6), 1015–1034 (2008). doi:10.1016/j.jmva.2007.06.00
-  Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009). doi:10.1093/biostatistics/kxp008
-  Tibshirani, R.: Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 73(Part 3), 273–282 (2011)
-  Denton, P.B., Parke, S.J., Tao, T., Zhang, X.: Eigenvectors from Eigenvalues (2019). 1908.03795
-  Johnstone, I.M., Lu, A.Y.: On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104(486), 682–693 (2009). doi:10.1198/jasa.2009.0121. Accessed 2013-01-01