Large matrices are a common way of representing modern, massive datasets, since an real-valued matrix provides a natural structure for encoding information about objects, each of which is described by
features. Principal Components Analysis (PCA) and the Singular Value Decomposition (SVD) are fundamental data analysis tools, expressing a data matrix in terms of a sequence of orthogonal vectors of decreasing importance. While these vectors enjoy strong optimality properties and are often interpreted as fundamental latent factors that underlie the observed data, they are linear combinations of up to all the data points and features. As a result, they are notoriously difficult to interpret in terms of the underlying processes generating the data[MD09].
The seminal work of [dGJ07] introduced the concept of Sparse PCA, where sparsity constraints are enforced on the singular vectors in order to improve interpretability. As noted in [dGJ07, MD09, PDK13], an example where sparsity implies interpretability is document analysis, where sparse principal components can be mapped to specific topics by inspecting the (few) keywords in their support. Formally, Sparse PCA can be defined as111Recall that the norm of a vector is defined as when ; we will only use , and . Furthermore, note that is the “norm”, which is the number of non-zero entries of the input vector . (see eqn. (1) in [PDK13]):
In the above formulation, the parameter controls the sparsity of the resulting vector and is part of the input; is the symmetric positive semidefinite (PSD) covariance matrix that represents all pairwise feature similarities for the data matrix and denotes a vector that achieves the optimal value in the above formulation. In words, the above optimization problem seeks a sparse, unit norm vector
that maximizes the data variance. The following two facts are well-known: first, solving the above optimization problem is NP-hard [MWA12] and, second, this hardness is due to the sparsity constraint. Indeed, if the sparsity constraint was removed, then the resulting optimization problem can be easily solved: the optimal vector is the top (left or right) singular vector of and the related maximal value is equal to the top singular value of . Finally, we note that more than one sparse singular vectors can be constructed using a simple deflation procedure; see Section 3.1 for details. Following the lines of [PDK13], we will only focus on the formulation of eqn. (1).
Related work. The simplest sparse PCA approaches are to either rotate [Jol95] or threshold [CJ95] the top singular vector of the matrix . Such simple methods are computationally efficient and tend to perform very well in practice (see Section 3). However, there exist cases where they fail (see [CJ95] and Section 3 here). An alternative line of research focused on solving relaxations of eqn. (1). For example, an relaxation of eqn. (1) was first used in SCoTLASS [JTU03]. Another possible relaxation is a regression-type approximation [ZHT06], which was implemented in [SCLE12]. (We will compare our approach to this method.) Finally, efficient optimization methods have been developed for the sparse PCA problem. For example, the generalized power method was proposed in [JNRS10]: this method calculates stationary points for penalized versions of eqn. (1).
Despite the many approaches that were developed for sparse PCA, only a handful of them provide any type of theoretical guarantees regarding the quality of the obtained (approximate) solution. For example, the semidefinite relaxation of [dGJ07] was analyzed in [AW08], albeit for the special case where is a spiked covariance matrix with a sparse maximal singular vector. Briefly, [AW08] studies conditions for the dimensions and of the data matrix , and the sparsity parameter , so that the semidefinite relaxation of [dGJ07] recovers the sparsity pattern of the optimal solution of eqn. (1). Other attempts for provable results include the work of [dBG08], which was later analyzed in [dBG14]. In the latter paper, the authors show bounds for the semidefinite relaxation of [dBG08], in the special case that the data points are sampled using Gaussian models with a single sparse leading singular vector. Strong compressed-sensing-type conditions were used in [YZ13] to guarantee recovery of the optimal solution of eqn. (1) using a truncated power method. However, [YZ13] requires that the optimal solution is approximately sparse and also that the noise matrix has sparse submatrices with small spectral norm. Finally, [PDK13] describes a greedy combinatorial approach for sparse PCA and provides relative-error bounds for the resulting solution under the assumption that the covariance matrix has a decaying spectrum. It is important to note that in all the above papers special assumptions are necessary regarding the input data in order to guarantee the theoretical bounds. Our work here does not make any assumptions on the input data, but we do pay the cost of increased sparsity in the final solution (see Theorem 1 for details).
Our algorithm. We present and analyze a simple, two-step algorithm to approximate the optimal solution of the problem of eqn. (1). Our approach first solves a -penalized version of this NP-hard optimization problem. Then, a randomized rounding strategy is employed to sparsify the resulting dense solution of the -penalized problem. More precisely, we first solve:
Notice that we replaced the constraint on the norm of the vector by a (tighter) constraint on the norm of . It is important to mention that problem (2) is difficult and all we can hope in practice is to calculate a stationary point. However, one should not discount the quality of stationary points. In Section 3 we show that by calculating stationary points we capture as much of the variance as computationally expensive convex relaxations. Having said that, the theoretical analysis which follows assumes that we work with the globally optimal solution of problem (2).
Let be a vector that achieves the optimal value for problem (2). Clearly, is not necessarily sparse. Therefore, we employ a randomized rounding strategy to sparsify it by keeping larger entries of
with higher probability. Specifically, we employ Algorithm1 on the vector to get a sparse vector that is our approximate solution to the sparse PCA formulation of eqn. (1).
It is obvious that, in expectation, the vector has at most non-zero entries. We will discuss the appropriate choice for in Theorem 1 below.
Our results: theory. Surprisingly, this simple randomized rounding approach has not been analyzed in prior work on Sparse PCA. Theorem 1 is our main theoretical result and guarantees an additive error approximation to the NP-hard problem of eqn. (1). For simplicity of presentation, we will assume that the rows (and therefore columns) of the matrix have at most unit norms222We can relax this assumption by increasing – our sampling factor – by a factor that depends on the upper bound of the row and column norms of ..
Let be the optimal solution of the Sparse PCA problem of eqn. (1) satisfying and . Let be the vector returned when Algorithm 1 is applied on the optimal solution of the optimization problem of eqn. (2), with , where is an accuracy parameter. Then, has the following properties:
With probability at least 3/4,
With probability at least 3/4,
In words, the above theorem states that our sparse vector is almost as good as the optimal vector in terms of capturing (with constant probability) almost as much of the spectrum of as does. This comes at a penalty: the sparsity of , which is equal to , has to be relaxed to . This provides an elegant trade-off between sparsity and accuracy333A less important artifact of our approach is the fact that the Euclidean norm of the vector is slightly larger than one.. It is worth emphasizing that one should not worry about the small success probability of our approach: by repeating the rounding times and keeping the vector that satisfies the second bound and maximizes , we can immediately guarantee that we will achieve both bounds with probability at least .
Our results:experiments. We empirically evaluate our approach on real and synthetic data. We chose to compare our algorithm with the state-of-the-art Spasm toolbox of [SCLE12, ZHT06]. We also compare our solution with the simple MaxCompheuristic, which computes the top singular vector of matrix and returns a sparse vector by greedily keeping the top largest components (in absolute value) and setting the remaining ones to zero. Our empirical evaluation indicates that our simple, provably accurate approach is competitive in practice.
2 Proof of Theorem 1
Consider the indicator random variablesfor all which take the following values:
It is easy to see that for all . The following trivial properties hold for all and will be used repeatedly in the proofs: , , and .
2.2 Bounds for and
The following lemma immediately implies the second bound of Theorem 1 by setting .
Given our notation, with probability at least 3/4,
Proof: For simplicity of notation, we will drop the subscript opt from , , and in this proof. It is more intuitive to provide a bound on the expectation of and then leverage the triangle inequality in order to bound . Using the indicator variables and linearity of expectation, we get
We will now prove the following inequality, which will be quite useful in later proofs:
Towards that end, we will split the set of indices in two subsets: the set corresponding to indices such that and the set corresponding to indices such that . Note that for all it must be the case that
We now proceed as follows:
For the last inequality we used the fact that . We now use Markov’s inequality to conclude that, with probability at least 3/4,
To conclude the proof note that, from the triangle inequality,
Combining with eqn. (5), after taking the square root of both sides, concludes the proof of the lemma.
2.3 Proving eqn. (3)
Our first lemma states that the solution of the penalized version of the Sparse PCA problem is at least as good as the solution of the Sparse PCA problem. (SeeAppendix, Section 5.2 for the proof.)
Given our notation, is a feasible solution of the relaxed Sparse PCA formulation of eqn. (2). Thus,
Our next goal is to get a lower bound for the quantity after the randomized rounding step. Intuitively, one would expect to be close to , since the (entry-wise) expectation of the vector is equal to . Then, combining with Lemma 2, it is reasonable to expect that is larger than minus a small error term. It is actually easy to prove (see Appendix, Section 5.3) that
Unfortunately, the above equation only holds in expectation and cannot be used to derive the results of Theorem 1.
Getting a lower bound for is the toughest part of Theorem 1. Towards that end, the next lemma bounds the error as a function of two other quantities which will be easier to bound.
Given our notation,
The proof of the above lemma emerges by repeatedly applying the triangle inequality (see Appendix, Section 5.4). Our next lemma will provide a bound for the first of the two quantities of interest in Lemma 3 (see Appendix, Section 5.5 for the proof).
Given our notation, with probability at least 7/8,
Our next lemma will provide a bound for the second of the two quantities of interest in Lemma 3. The proof of the lemma is tedious and a number of cases need to be considered.
Given our notation, with probability at least 7/8,
Proof: For simplicity of notation, we will drop the subscript opt from , , and in this proof. Using the indicator variables and linearity of expectation, we get
Recall that for all . Notice that if any of the four indices appears only once, then the expectation corresponding to those indices equals zero. This expectation is non-zero if the four indices are paired in couples or if all four are equal. That is, non-zero expectation happens if
For case (A), let and let , in which case the corresponding terms in eqn. (7) become (notice that and are independent random variables since ):
In the first inequality we used for all and and added extra positive terms (corresponding to ), which reinforce the inequality. The last inequality follows from eqn. (4).
For case (B), let and let , in which case the corresponding terms in eqn. (7) become (notice that and are independent random variables since ):
In the first inequality we used for all and the fact that the diagonal entries of a symmetric positive definite matrix are non-negative, which allows us to add extra positive terms (corresponding to ) to reinforce the inequality. The remainder of the derivation is identical to case (A).
For case (C), let and let , in which case the corresponding terms in eqn. (7) become (notice that and are independent random variables since ):
In the first equality we used the fact that is symmetric; the remainder of the derivation is identical to case (A).
Finally, for case (D), let , in which case the corresponding terms in eqn. (7) become:
In the above derivation, we used . We also split the set of indices in two subsets: the set corresponding to indices such that and the set corresponding to indices such that . Note that for all it must be the case that Thus,
Recall that for any and use and to get
Using Markov’s inequality and taking square roots concludes the proof of the lemma.
Since each of Lemmas 4 and 5 fail with probability at most 1/8, it follows from the union bound that the above inequality holds with probability at least . Therefore, setting guarantees (after some algebra) that with probability at least ,
Using the triangle inequality and Lemma 2 we get that with probability at least 3/4
which concludes the proof of eqn. (3).
We perform experiments on both real and synthetic datasets. We chose to compare our algorithm with the solution returned by the state-of-the-art Spasm toolbox of [SCLE12], which implements the approach proposed in [ZHT06]. We also compare our solution with the simple MaxComp heuristic: this method computes the top singular vector of matrix and returns a sparse vector by keeping the top largest components (in absolute value) and setting the remaining ones to zero.
The final version of the paper will include a link to all data and code.
3.1 Experimental setup
Let with denote the object-feature data matrix, where every column has zero mean, and recall that in eqn. (2). We use the following function to measure the quality of an approximate solution to the sparse PCA problem:
Notice that for all which satisfy . The closer is to one the more the vector is parallel to the top singular vector of , or, equivalently, the closer is to one the more captures the variance of the matrix that corresponds to its top singular value. Our goal is to calculate sparse vectors with .
Our approach first finds a stationary point of the optimization problem of eqn. (2) and then uses Algorithm 1 (with ) to obtain a sparse solution vector . We note that the chosen value of is much smaller than the value stipulated by Theorem 1. Also, in practice, our choice of works very well and results in solutions that are comparable or better than competing methods in our data.
We note that in our use of Spasm, we used soft thresholding by setting the STOP parameter to and (both STOP and are parameters of Spasm toolbox). This implies that the solutions obtained by Spasm and our approach will have the same number of non-zeros, thus making the comparison fair. Similarly, for MaxComp, after computing the top singular vector of , we select the largest (in absolute value) coordinates to form the sparse solution. A final technical note is that the solutions obtained using either our method or MaxComp may result in vectors with non-unit Euclidean norms. In order to achieve a fair comparison in terms of eqn. (11), there are two options. The first one (naive approach) is to simply normalize the resulting vectors. However, a better approach (SVD-based) is possible: given a sparse solution vector , we could keep the rows and columns of that correspond to the non-zero entries in and compute the top singular vector of the induced matrix. Note that the induced matrix would be a matrix and its top singular vector would be a
-dimensional vector. Obviously, this latter vector would be a unit norm vector and it could be padded with zeros to derive a unit norm vector inwith the same sparsity pattern as . It is straight-forward to argue that this vector would improve the value of compared to the naive normalized vector . In our experimental evaluation, we will evaluate both the naive and the SVD-based normalization methods.
In some of our experiments we will need to extract more than one sparse singular vectors. Towards that end, we can use a deflation approach (Algorithm 2) to construct more than one sparse singular vectors by first projecting the residual matrix into the space that is perpendicular to the top sparse singular vector and then computing the top sparse singular vector of the residual matrix. In Algorithm 2, rspca refers to the solution of the optimization problem of eqn. (2) followed by Algorithm 1).
3.2 Data Sets
We used 22 matrices emerging from population genetics, namely the 22 matrices (one for each chromosome) that encode all autosomal genotypes that are available in both the Human Genome Diversity Panel [The07] and the HAPMAP [LAT08] project. Each of these matrices has approximately 2,500 samples (objects) described with respect to tens of thousands of features (Single Nucleotide Polymorphisms or SNPs); see [PLJD10] for details. We also used a gene expression dataset (GSE10072, lung cancer gene expression data) from the NCBI Gene Expression Omnibus database. This dataset includes 107 samples (58 lung tumor cases and 49 normal lung controls) measured using 22,215 probes (features) from the GPL96 platform annotation table. Both the population genetics and the gene expression datasets are interesting in the context of sparse PCA beyond numerical evaluations, since the sparse components can be directly interpreted to identify small sets of SNPs or genes that capture the data variance.
Our synthetic dataset has been carefully designed in order to highlight a setting where the MaxComp heuristic will fail. More specifically, the absolute values of the entries of the largest singular vector of a matrix in this family of matrices is not a good indicator of the importance of the respective entry in a sparse PCA solution. Instead, the vector that emerges from the optimization problem of eqn. (2) is a much better indicator. For a detailed description of the synthetic data generator see Section 5.6 in the Appendix.
Our third data set comes from the field of text categorization and information retrieval. In such applications, documents are often represented as a “bag of words” and a vector space model is used. We use a subset of the Classic-3444ftp://ftp.cs.cornell.edu/pub/smart/ document collection, which we will call Classic-2. This subset consists of the CISI (Comités Interministériels pour la Société de l’Information) collection (1,460 information retrieval abstracts) and the CRANFIELD collection (1,398 aeronautical systems abstracts). We created a document-by-term matrix using the Text-to-Matrix Generator (TMG) [ZG06]555The matrix was created using the stoplist provided by TMG, the tf-idf weighting scheme, and document normalization. We removed numbers and alphanumerics and we didn’t use stemming.; the final matrix is a sparse matrix with entries between zero and one, representing the weight of each term in the corresponding document.
First we compare the performance of the different methods on a synthetic dataset, using the data generator which was described in Section 3.2 with , , and . In Figure (a)a we plot (in the -axis) the value of the performance ratio (as defined in eqn. (11)) for our method (cvx+Alg.1), the Spasm software, and the MaxComp heuristic. We also note that for each of the three methods, we use two different approaches to normalize the resulting sparse vector: the naive and the SVD-based ones (see the last paragraph in Section 3.1). As a result, we have a total of six possible methods to create and normalize a sparse vector for sparse PCA. The -axis shows the sparsity ratio of the resulting vector, namely . We remark that all six methods produce sparse vectors with exactly the same sparsity in order to have a fair comparison. Notice that in Figure (a)a, the MaxComp heuristic has worse performance when compared to either our approach or to Spasm: this is expected, since we constructed this family of matrices in order to precisely guarantee that the largest components of the top singular vector would not be good elements to retain in the construction of a sparse vector. To further visualize this, we look at the sparse vectors, returned by the different methods, in Figure 4 in the Appendix. In this figure, we present the resulting sparse vectors for each of the three methods (normalization, obviously, is not relevant in Figure 4) for a particular choice of the sparsity ratio (). Notice that MaxComp fails to capture the right sparsity pattern, whereas our method and Spasm succeed.
In the second experiment, we present the performance of the different methods on the real datasets described in Section 3.2. The results are shown in Figures (b)b, (c)c, and (d)d. (We only show results for the first two chromosomes of the joint HapMap and HGDP datasets; the other 20 chromosomes behave very similarly and are shown in Figures 5, 6, 7, and 8 in the Appendix.) We note that in the population genetics data, our method has approximately the same or better performance compared to both MaxComp and Spasm. Not surprisingly, the naive normalization approach is consistently worse than the SVD-based one. It is worth noting that our SVD-based normalization approach easily improves the output of Spasm. This is because Spasm does detect the correct sparsity pattern but fails to compute the appropriate coefficients of the resulting sparse vectors.
Finally, we evaluate the performance of our algorithm in a clustering application, using the Classic-2 document collection. First, we run Algorithm 2 to obtain two sparse singular vectors. In Figure (b)b, our sparse PCA approach (Randomized sPCA) separates two clusters, each roughly aligning to each of the axes, separating the two collections of Classic-2. Table 1 summarizes the variance captured by the top two principal components using PCA, randomized sPCA (rspca) or solving eqn. (2) (cvx). Table 2 summarizes the sparsity of the first and second principal components in each step of our approach.
|Top Principal Comp.||41.9%||41.8%||40.8%|
|Top two Principal Comp.||61.7%||61.5%||59.7%|
|Top Principal Comp.|
|2nd Top Principal Comp.|
4 Open problems
From a theoretical perspective, it would be interesting to explore whether other relaxations of the sparse PCA problem of eqn. (1), combined with randomized rounding procedures, could improve our error bounds in Theorem 1. It would also be interesting to formally analyze the deflation algorithm that computes more than one sparse singular vectors in a randomized manner (Algorithm 2). Finally, from a complexity theory perspective, we are not aware of any inapproximability results for the sparse PCA problem; to the best of our knowledge, it is not known whether relative error approximations are possible (without any assumptions on the input matrix).
- [AW08] A. A. Amini and M. J. Wainwright. High-dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on,, pages 2454–2458, 2008.
- [CJ95] J. Cadima and I. T. Jolliffe. Loading and correlations in the interpretation of principle compenents. Journal of Applied Statistics, pages 203–214, 1995.
A. d Aspremont, F. Bach, and L. El Ghaoui.
Optimal Solutions for Sparse Principal Component Analysis.
The Journal of Machine Learning Research, pages 1269–1294, 2008.
- [dBG14] A. d Aspremont, F. Bach, and L. El Ghaoui. Approximation Bounds for Sparse Principal Component Analysis. Math. Program., Ser. B, pages 89–110, 2014.
- [dGJ07] A. d Aspremont, L. El Ghaoui, and M. I. Jordan. A Direct Formulation for Sparse PCA using Semidefinite Programming. SIAM Review, pages 434–448, 2007.
- [JNRS10] Michel Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre. Generalized Power Method for Sparse Principal Component Analysis. Journal of Machine Learning Research, pages 517–553, 2010.
- [Jol95] I. T. Jolliffe. Rotation of principal components: choice of normalization constraints. Journal of Applied Statistics, pages 29–35, 1995.
- [JTU03] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics, pages 531–547, 2003.
- [LAT08] J. Z. Li, D. M. Absher, H. Tang, A. M. Southwick, A. M. Casto, S. Ramachandran, H. M. Cann, G. S. Barsh, M. Feldman, L. L. Cavalli-Sforza, and R. M. Myers. Worldwide human relationships inferred from genome-wide patterns of variation. Science, 319:1100–1104, 2008.
- [MD09] M.W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. In Proceedings of the National Academy of Sciences, pages 697–702, 106 (3), 2009.
- [MWA12] B. Moghaddam, Y. Weiss, and S. Avidan. Generalized Spectral Bounds for Sparse lda. In Proceedings of the 23rd international conference on Machine learning, pages 641–648, 2012.
- [PDK13] D. S. Papailiopoulos, A. G. Dimakis, and S. Korokythakis. Sparse PCA through Low-rank Approximations. http://arxiv.org/abs/1303.0551, 2013.
- [PLJD10] P. Paschou, J. Lewis, A. Javed, and P. Drineas. Ancestry informative markers for fine-scale individual assignment to worldwide populations. J. Med. Genet., 47(12):835–847, Dec 2010.
- [SCLE12] K. Sj strand, L.H. Clemmensen, R. Larsen, and B. Ersb ll. Spasm: A matlab toolbox for sparse statistical modeling. In Journal of Statistical Software (Accepted for publication), 2012.
- [The07] The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449:851–861, 2007.
X.-T. Yuan and T. Zhang.
Truncated Power Method for Sparse Eigenvalue Problems.Journal of Machine Learning Research, pages 899–925, 2013.
- [ZG06] D. Zeimpekis and E. Gallopoulos. TMG: A MATLAB toolbox for generating term document matrices from text collections. In J. Kogan, C. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering, pages 187–210. Springer, 2006.
- [ZHT06] H. Zou, T. Hastie, and R Tibshirani. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, pages 265–286, 2006.
5.1 A bound for
5.2 Proof of Lemma 2
Recall that is a unit norm vector whose zero norm is at most . Then, if we let denote the vector of signs for (with the additional convention that if is equal to zero then the -th entry of is also set to zero), we get
The second inequality follows since has at most non-zero entries. Thus, is feasible for the optimization problem of eqn. (2) and the conclusion of the lemma follows immediately.
5.3 Proving eqn. (6)
For simplicity of notation, we will drop the subscript opt from , , and in this proof. Using the indicator variables (see Section 2.1) for notation) and linearity of expectation, we get