Sparse Principal Component Analysis via Rotation and Truncation

03/06/2014 ∙ by Zhenfang Hu, et al. ∙ Zhejiang University 0

Sparse principal component analysis (sparse PCA) aims at finding a sparse basis to improve the interpretability over the dense basis of PCA, meanwhile the sparse basis should cover the data subspace as much as possible. In contrast to most of existing work which deal with the problem by adding some sparsity penalties on various objectives of PCA, in this paper, we propose a new method SPCArt, whose motivation is to find a rotation matrix and a sparse basis such that the sparse basis approximates the basis of PCA after the rotation. The algorithm of SPCArt consists of three alternating steps: rotate PCA basis, truncate small entries, and update the rotation matrix. Its performance bounds are also given. SPCArt is efficient, with each iteration scaling linearly with the data dimension. It is easy to choose parameters in SPCArt, due to its explicit physical explanations. Besides, we give a unified view to several existing sparse PCA methods and discuss the connection with SPCArt. Some ideas in SPCArt are extended to GPower, a popular sparse PCA algorithm, to overcome its drawback. Experimental results demonstrate that SPCArt achieves the state-of-the-art performance. It also achieves a good tradeoff among various criteria, including sparsity, explained variance, orthogonality, balance of sparsity among loadings, and computational speed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many research areas, the data we encountered are usually of very high dimensions, for examples, signal processing, machine learning, computer vision, document processing, computer network, and genetics etc. However, almost all data in these areas have much lower intrinsic dimensions. Thus, how to handle these data is a traditional problem.

1.1 Pca

Principal component analysis (PCA) [1]

is one of the most popular analysis tools to deal with this situation. Given a set of data, whose mean is removed, PCA approximates the data by representing them in another orthonormal basis, called loading vectors. The coefficients of the data when represented using these loadings are called principal components. They are obtained by projecting the data onto the loadings, i.e. inner products between the loading vectors and the data vector. Usually, the loadings are deemed as a set of ordered vectors, in that the variances of data explained by them are in a decreasing order, e.g. the leading loading points to the maximal-variance direction. If the data lie in a low dimensional subspace, i.e. the distribution mainly varies in a few directions, a few loadings are enough to obtain a good approximation; and the original high-dimensional data now can be represented by the low-dimensional principal components, so dimensionality reduction is achieved.

Commonly, the dimensions of the original data have some physical explanations. For example, in financial or biological applications, each dimension may correspond to a specific asset or gene [2]. However, the loadings obtained by PCA are usually dense, so the principal component, got by inner product, is a mixture of all dimensions, which makes it difficult to interpret. If most of the entries in the loadings are zeros (sparse), each principal component becomes a linear combination of a few non-zero entries. This facilitates the understanding of the physical meaning of the loadings as well as the principal components [1]. Further, the physical interpretation would be clearer if different loadings have different non-zero entries, i.e. corresponding to different dimensions.

1.2 Sparse PCA

Sparse PCA aims at finding a sparse basis to make the result more interpretable [3]. At the same time, the basis is required to represent the data distribution faithfully. Thus, there is a tradeoff between the statistical fidelity and the interpretability.

During the past decade, a variety of methods for sparse PCA have been proposed. Most of them have considered the tradeoff between sparsity and explained variance. However, there are three points that have not received enough attentions yet: the orthogonality between loadings, the balance of sparsity among loadings, and the pitfall of deflation algorithms.

  • Orthogonality. PCA loadings are orthogonal. But in pursuing sparse loadings, this property is easy to lose. Orthogonality is desirable in that it indicates the independence of physical meaning of the loadings. When the loadings are sufficiently sparse, orthogonality usually implies non-overlapping of their supports. So under the background of improving the interpretation of PCA, now each loading is associated with distinctive physical variables, so are the principal components. This makes the interpretation much easier. Besides, if the loadings are not an orthogonal basis, the inner products between the data and the loadings that are used to compute the components do not constitute an exact projection. For an extreme example, if two loadings are very close, the two components would be similar too. This is meaningless.

  • Balance of sparsity. There should not be any member of the loadings highly dense, particularly those leading ones that take account of most variance, otherwise it is meaningless. We emphasize this point, because quite a few of existing algorithms yield loadings with the leading ones highly dense (close to those of PCA) while the minor ones highly sparse; so sparsity is achieved by the minor ones while variance is explained by the dense ones. This is unreasonable.

  • Pitfall of deflation. Existing work can be categorized into two groups: deflation group and block group. To obtain sparse loadings, the deflation group computes one loading at a time; more are got via removing components that have been computed [4]. This follows traditional PCA. The block group finds all loadings together. Generally, the optimal loadings found when we restrict the subspace to be of dimension may not overlap with the optimal loadings when the dimension increases to [5]. This problem does not occur for PCA, whose loadings successively maximize the variance, and the loadings found via deflation are always globally optimal for any . But it is not the case for sparse PCA, the deflation method is greedy and cannot find optimal sparse loadings. However, the block group has the potential.

Finally we mention that by deflation the obtained loadings are nearly orthogonal, while the block group usually does not equip with mechanism to ensure the orthogonality.

1.3 Our Method: SPCArt

In this paper, we propose a new approach called SPCArt (Sparse PCA via rotation and truncation). In contrast to most of traditional work which are based on adding some sparsity penalty on the objective of PCA, the motivation of SPCArt is distinctive. SPCArt aims to find a rotation matrix and a sparse basis such that the sparse basis approximates the loadings of PCA after the rotation. The resulting algorithm consists of three alternative steps: rotate PCA loadings, truncate small entries of them, and update the rotation matrix.

SPCArt turns out to resolve or alleviate the previous three points. It has the following merits. (1) It is able to explain as much variance as the PCA loadings, since the sparse basis spans almost the same subspace as the PCA loadings. (2) The new basis is close to orthogonal, since it approximates the rotated PCA loadings. (3) The truncation tends to produce more balanced sparsity, since vectors of the rotated PCA loadings are of equal length. (4) It is not greedy compared with the deflation group, it belongs to the block group.

The contributions of this paper are four-fold: (1) we propose an efficient algorithm SPCArt achieving good performance over a series of criteria, some of which have been overlooked by previous work; (2) we devise various truncation operations for different situations and provide performance analysis; (3) we give a unified view for a series of previous sparse PCA approaches, together with ours; (4) under the unified view, we find the relation between GPower, rSVD, and our method, and extend GPower [6] and rSVD [7] to a new implementation, called rSVD-GP, to overcome their drawbacks–parameter tuning problem and imbalance of sparsity among loadings.

The rest of the paper is organized as follows: Section 2 introduces representative work on sparse PCA. Section 3 presents our method SPCArt and four types of truncation operations, and analyzes their performance. Section 4 gives a unified view for a series of previous work. Section 5 shows the relation between GPower, rSVD, and our method, and extends GPower and rSVD to a new implementation, called rSVD-GP. Experimental results are provided in Section 6. Finally, we conclude this paper in Section 7.

PCA [1] ST [8] SPCA [9] PathSPCA [2] ALSPCA [10]
GPower [6],
rSVD-GP,
TPower [11]
GPowerB [6],
rSVD-GPB
SPCArt


Table 1: Time complexities for computing loadings from samples of dimension . is the number of iterations. is the cardinality of a loading. The preprocessing and initialization overheads are omitted. ST and SPCArt have the additional cost of PCA. The complexities of SPCArt listed below are of the truncation types T- and T-. Those of T-sp and T-en are .

2 Related Work

Various sparse PCA methods have been proposed during the past decade. We give a brief review below.

1. Post-processing PCA. In early days, interpretability is gained via post-processing the PCA loadings. Loading rotation (LR) [5] applies various criteria to rotate the PCA loadings so that ’simple structure’ emerges, e.g. varimax criterion drives the entries to be either small or large, which is close to a sparse structure. Simple thresholding (ST) [8] instead obtains sparse loadings via directly setting the entries of PCA loadings below a small threshold to zero.

2. Covariance matrix maximization. More recently, systematic approaches based on solving explicit objectives were proposed, starting from SCoTLASS [3] which optimizes the classical objective of PCA, i.e. maximizing the quadratic form of covariance matrix, while additionally imposing a sparsity constraint on each loading.

3. Matrix approximation. SPCA [9] formulates the problem as a regression-type optimization, so as to facilitate the use of LASSO [12] or elastic-net [13] techniques to solve the problem. rSVD [7] and SPC [14] obtain sparse loadings by solving a sequence of rank-1 matrix approximations, with sparsity penalty or constraint imposed.

4. Semidefinite convex relaxation. Most of the methods proposed so far are local ones, which suffer from getting trapped in local minima. DSPCA [15] transforms the problem into a semidefinite convex relaxation problem, thus global optimality of solution is guaranteed. This distinguishes it from most of the other local methods. Unfortunately, its computational complexity is as high as ( is the number of variables), which is expensive for most applications. Later, a variable elimination method [16] of complexity was developed in order to make the application on large scale problem feasible.

5. Greedy methods. In [17], greedy search and branch-and-bound methods are used to solve small instances of the problem exactly. Each step of the algorithm has a complexity , leading to a total complexity of for a full set of solutions (solutions of cardinality ranging from 1 to ). Later, this bound is improved in the classification setting [18]. In another way, a greedy algorithm PathSPCA [2] was presented to further approximate the solution process of [17], resulting in a complexity of for a full set of solutions. For a review of DSPCA, PathSPCA, and their applications, see [19].

6. Power methods. The GPower method [6] formulates the problem as maximization of a convex objective function and the solution is obtained by generalizing the power method [20] that is used to compute the PCA loadings. Recently, a new power method TPower [11], and a somewhat different but related power method ITSPCA [21] that aims at recovering sparse principal subspace, were proposed.

7. Augmented lagrangian optimization. ALSPCA [10] solves the problem based on an augmented lagrangian optimization. The most special feature of ALSPCA is that it simultaneously considers the explained variance, orthogonality, and correlation among principal components.

Among them only LR [5], SCoTLASS [3], ALSPCA [10] have considered the orthogonality of loadings. SCoTLASS, rSVD [7], SPC [14], the greedy methods [17, 2], one version of GPower [6], and TPower [11] belong to the deflation group. Only DSPCA’s solution [15] is ensured to be globally optimal.

The computational complexities of some of the above algorithms are summarized in Table 1.

3 SPCArt: Sparse PCA via Rotation and Truncation

We first give a brief overview of SPCArt, next introduce the motivation, and then the objective and optimization, and then the truncation types, and finally provide performance analysis.

The idea of SPCArt is simple. Since any rotation of the PCA loadings constitutes an orthogonal basis spanning the same subspace, (, ), we want to find a rotation matrix through which is transformed to a sparsest basis . It is difficult to solve this problem directly, so instead we would find a rotation matrix and a sparse basis such that the sparse basis approximates the PCA loadings after the rotation .

The major notations used are listed in Table 2.


notation
interpretation

data matrix with samples of variables

PCA loadings arranged column-wise. denotes the th column. denotes the first columns

rotation matrix

rotated PCA loadings, i.e.

spare loadings arranged column-wise, similar to

for a matrix , , let the thin SVD be , , then

. For a vector , is entry-wise soft thresholding: , where if and otherwise

. For a vector , is entry-wise hard thresholding: , i.e. if , otherwise

. For a vector , sets the smallest entries (absolute value) to be zero

. For a vector , sets the smallest entries, whose energy take up at most , to be zero. is found as following: sort in ascending order to be ,

Table 2: Major notations.

3.1 Motivation

Our method is motivated by the solution of the Eckart-Young theorem [22]. This theorem considers the problem of approximating a matrix by the product of two low-rank ones.

Theorem 1.

(Eckart-Young Theorem) Assume the SVD of a matrix is , in which , , is diagonal with , and . A rank () approximation of is to solve the following problem:

(1)

where and . A solution is

(2)

where is the first columns of .

Alternatively, the solution can be expressed as

(3)

where is the orthonormal component of the polar decomposition of a matrix [6]. From the more familiar SVD perspective, its equivalent definition is provided in Table 2.

Note that if the row vectors of have been centered to have mean zero, are the loadings obtained by PCA. Clearly, and , and is also a solution of (1). This implies that any rotation of the

orthonormal leading eigenvectors

is a solution of the best rank approximation of . That is, any orthonormal basis in the corresponding eigen-subspace is capable of representing the original data distribution as well as the original basis. Thus, a natural idea for sparse PCA is to find a rotation matrix so that becomes sparse, i.e.,

(4)

where denotes the sum of (pseudo) norm of the columns of a matrix, i.e. it counts the non-zeros of a matrix.

3.2 Objective and optimization

Unfortunately, the above problem is hard to solve. So we approximate it instead. Since , we want to find a rotation matrix through which a sparse basis approximates the original PCA loadings. Without confusion, we use to denote hereafter. For simplicity, the version will be postponed to next section, we consider the version first:

(5)

is the norm of a vector, i.e. sum of absolute values. It is well-known that norm is sparsity inducing, which is a convex surrogate of the norm [23]. Under this objective, the solution may not be orthogonal, and may deviate from the eigen-subspace spanned by . However, if the approximation is accurate enough, i.e., , then would be nearly orthogonal and explain similar variance as . Note that the above objective turns out to be a matrix approximation problem as Eckart-Young theorem. The key difference is that sparsity penalty is added. But the solutions still share some common features.

There is no closed-form solutions for and simultaneously. We can solve the problem by fixing one and optimizing the other alternately. Both subproblems have closed-form solutions.

3.2.1 Fix , solve

When is fixed, it becomes a Procrustes problem [9]:

(6)

. It has the same form as the right of (3).

3.2.2 Fix , solve

When is fixed, it becomes

(7)

There are independent subproblems, one for each column: , where . It is equivalent to . The solution is [6]. is entry-wise soft thresholding, defined in Table 2. This is truncation type T-: soft thresholding.

It has the following physical explanations. is rotated PCA loadings, it is orthonormal. is obtained via truncating small entries of . On one hand, because of the unit length of each column in , a single threshold is feasible to make the sparsity distribute evenly among the columns in ; otherwise we have to apply different thresholds for different columns which are hard to determine. On the other hand, because of the orthogonality of and small truncations, is still possible to be nearly orthogonal. These are the most distinctive features of SPCArt. They enable easy analysis and parameter setting.

The algorithm of SPCArt is presented in Algorithm 1, where the truncation in line 7 can be any type, including T- and the others that will be introduced in next section.

The computational complexity of SPCArt is shown in Table 1. Except the computational cost of PCA loadings, SPCArt scales linearly about data dimension. When the number of loadings is not too large, it is efficient.

1:  Input: data matrix , number of loadings , truncation type , truncation parameter .
2:  Output: sparse loadings .
3:  PCA: compute rank- SVD of : , .
4:  Initialize : .
5:  repeat
6:     Rotation: .
7:     Truncation: , .
8:     Update : thin SVD of : , .
9:  until convergence
Algorithm 1 SPCArt

3.3 Truncation Types

In this section, given rotated PCA loadings , we introduce the truncation operation , where is any of the following four types: T- soft thresholding , T- hard thresholding , T-sp truncation by sparsity , and T-en truncation by energy . T- has been introduced in last section, which is resulted from penalty.

T-: hard thresholding. Set the entries below threshold to be zero: . is defined in Table 2. It is resulted from penalty:

(8)

The optimization is similar to the case. Fixing , . Fixing , the problem becomes . Let , it can be decomposed to entry-wise subproblems, and the solution is apparent: if , then , otherwise . Hence the solution can be expressed as .

There is no normalization for compared with the case. This is because if unit length constraint is added, there will be no closed form solution. However, in practice, we still let for consistency, since empirically no significant difference is observed.

Note that both and penalties only result in thresholding operation on and nothing else (only make line 7 of Algorithm 1

different). Hence, we may devise other heuristic truncation types irrespective of explicit objective:

T-sp: truncation by sparsity. Truncate the smallest entries: , . Table 2 gives the precise definition of . It can be shown that this heuristic type is resulted from the constraint:

(9)

When is fixed, the solution is the same as and cases above. When is fixed, the solution is , where . The proof is put in Appendix A.

T-en: truncation by energy. Truncate the smallest entries whose energy (sum of square) take up percentage: . is described in Table 2. However, we are not aware of any objective associated with this type.

Algorithm 1 describes the complete algorithm of SPCArt with any truncation type.

SPCArt promotes the seminal ideas of simple thresholding [8] and loading rotation [5]. When using T-, the first iteration of SPCArt, i.e. , corresponds to the ad-hoc simple thresholding ST, which is frequently used in practice and sometimes produced good results [9, 17]. In another way, the motivation of SPCArt, i.e. (4), is similar to the loading rotation, whereas SPCArt explicitly seeks sparse loadings via pseudo-norm, loading rotation seeks ’simple structure’ via various criteria, e.g. the varimax criterion, which maximizes the variances of squared loadings , where , drives the entries to distribute unevenly, either small or large (see Section 7.2 in [1]).

3.4 Performance Analysis

This section discusses the performance bounds of SPCArt with each truncation type. For , , we study the following problems:

(1) How much sparsity of is guaranteed?

(2) How much deviates from ?

(3) How is the orthogonality of ?

(4) How much variance is explained by ?

The performance bounds derived are functions of . Thus, we can directly or indirectly control sparsity, orthogonality, and explained variance via .111Theorem 13 is specific to SPCArt, which concerns the important explained variance. The other results apply to more general situations: Proposition 6-11 apply to any orthonormal , Theorem 12 applies to any matrix . To obtain results specific to SPCArt, we may have to make assumption of the data distribution. Nevertheless, they are still the absolute performance bounds of SPCArt and can guide us to set for some performance guarantee. We give some definitions first.

Definition 2.

, the sparsity of is the proportion of zero entries:

Definition 3.

, , , the deviation of from is , where is the included angle between and , . If , is defined to be .

Definition 4.

, , , the nonorthogonality between and is , where is the included angle between and .

Definition 5.

Given data matrix containing samples of dimension , basis , , the explained variance of is . Let be any orthonormal basis in the subspace spanned by , then the cumulative percentage of explained variance is [7].

Intuitively, larger leads to higher sparsity and larger deviation. When two truncated vectors deviate from their originally orthogonal vectors, in the worst case, the nonorthogonality of them degenerates as the ‘sum’ of their deviations. In another way, if the deviations of a sparse basis from the rotated loadings are small, we expect the sparse basis still represents the data well, and the explained variance or cumulative percentage of explained variance maintains similar level to that of PCA. So, both the nonorthogonality and the explained variance depend on the deviations, and the deviation and sparsity in turn are controlled by . We now go into details. The proofs of some of the following results are included in Appendix B.

3.4.1 Orthogonality

Proposition 6.

The relative upper bound of nonorthogonality between and , , is

(10)

The bounds can be obtained by considering the two conical surfaces generated by axes with rotational angles . The proposition implies the nonorthogonality is determined by the sum of deviated angles. When the deviations are small, the orthogonality is good. The deviation depends on , which is analyzed below.

3.4.2 Sparsity and Deviation

The following results only concern a single vector of the basis. We will denote by , and by for simplicity, and derive bounds of sparsity and deviation for each . They depend on a key value , which is the entry value of a uniform vector.

Proposition 7.

For T-, the sparsity bounds are

(11)

Deviation , where is the truncated part: if , and otherwise. The absolute bounds are:

(12)

All the above bounds are achievable.

Because when , there is no sparsity guarantee, is usually set to be in practice. Generally it works well.

Proposition 8.

For T-, the bounds of and lower bound of are the same as T-. In addition, there are relative deviation bounds

(13)

It is still an open question that whether T- has the same upper bound of as T-. By the relative lower bounds, we have

Corollary 9.

The deviation due to soft thresholding is always larger than that of hard thresholding, if the same is applied.

This implies that results got by T- have potentially greater sparsity and less explained variance than those of T-.

Proposition 10.

For T-sp, , and

(14)

Except the unusual case that has many zeros, . The main advantage of T-sp lies in its direct control on sparsity. If specific sparsity is wanted, it can be applied.

Proposition 11.

For T-en, . In addition

(15)

If , there is no sparsity guarantee. When is moderately large, .

Due to the discrete nature of operand, the actually truncated energy can be less than . But in practice, especially when is moderately large, the effect is negligible. So we usually have . The main advantage of T-en is that it has direct control on deviation. Recall that the deviation has direct influence on the explained variance. Thus, if it is desirable to gain specific explained variance, T-en is preferable. Besides, if is moderately large, T-en also gives nice control on sparsity.

3.4.3 Explained Variance

Finally, we derive bounds on the explained variance . Two results are provided. The first one is general and is applicable to any basis not limited to sparse ones. The second one is tailored to SPCArt.

Theorem 12.

Let rank- SVD of be , . Given , assume SVD of is , , . Then

(16)

and .

The theorem can be interpreted as follows. If is a basis that approximates rotated PCA loadings well, then will be close to one, and so the variance explained by is close to that explained by PCA. Note that variance explained by PCA loadings is the largest value that is possible to be achieved by orthonormal basis. Conversely, if deviates much from the rotated PCA loadings, then tends to zero, so the variance explained by is not guaranteed to be much. We see that the less the sparse loadings deviates from rotated PCA loadings, the more variance they explain.

When SPCArt converges, i.e. , , and

hold simultaneously, we have another estimation. It is mainly valid for T-en.

Theorem 13.

Let , i.e. , and let be with diagonal elements removed. Assume and , , then

(17)

When is sufficiently small,

(18)

Since the sparse loadings are obtained by truncating small entries of the rotated PCA loadings, and is the deviation angle between these sparse loadings and the rotated PCA loadings, the theorem implies, if the deviation is small then the variance explained by the sparse loadings is close to that of PCA, as . For example, if the truncated energy is about 0.05, then 95% EV of PCA loadings is guaranteed.

The assumptions and , , are roughly satisfied by T-en using small . Uniform deviation , , can be achieved by T-en as indicated by Proposition 11. means the sum of projected length is less than 1, when is projected onto each . It must be satisfied if is exactly orthogonal, whereas it is likely to be satisfied if is nearly orthogonal (note may not lie in the subspace spanned by ), which can be achieved by setting small according to Proposition 6. In this case, about is guaranteed.

In practice, we prefer CPEV [7] to EV. CPEV measures the variance explained by subspace rather than basis. Since it is also the projected length of onto the subspace spanned by , the higher CPEV is, the better represents the data. If is not an orthogonal basis, EV may overestimates or underestimates the variance. However, if is nearly orthogonal, the difference is small, and it is nearly proportional to CPEV.

4 A Unified View to Some Prior Work

A series of methods: PCA [1], SCoTLASS [3], SPCA [9], GPower [6], rSVD [7], TPower [11], SPC [14], and SPCArt, though proposed independently and formulated in various forms, can be derived from the common source of Theorem 1, the Eckart-Young Theorem. Most of them can be seen as the problems of matrix approximation (1), with different sparsity penalties. Most of them have two matrix variables, and the solutions of them usually can be obtained by an alternating scheme: fix one and solve the other. Similar to SPCArt, the two subproblems are a sparsity penalized/constrained regression problem and a Procrustes problem.

PCA [1]. Since , substituting into (1) and optimizing , the problem is equivalent to

(19)

The solution is provided by Ky Fan theorem [24]: , . If has been centered to have mean zero, the special solution are exactly the loadings obtained by PCA.

SCoTLASS [3]. Constraining to be sparse in (19), we get SCotLASS

(20)

However, the problem is not easy to solve.

SPCA [9]. If we substitute into (1) and separate the two ’s into two independent variables and (so as to solve the problem via alternating), and then impose some penalties on , we get SPCA

(21)

where is treated as target sparse loadings and ’s are weights. When is fixed, the problem is equivalent to elastic-net problems: . When is fixed, it is a Procrustes problem: , and .

GPower [6]. Except some artificial factors, the original GPower solves the following and versions of objectives:

(22)
(23)

They can be seen as derived from the following more fundamental ones (details are included in Appendix C).

(24)
(25)

These two objectives can be seen as derived from (1): a mirror version of Theorem 1 exists , , where is still seen as a data matrix containing samples of dimension . The solution is and . Adding sparsity penalties to , we get (24) and (25).

Following the alternating optimization scheme. When is fixed, in both cases . When is fixed, the case becomes . Let , then ; the case becomes , . The th loading is obtained by normalizing to unit length.

The iterative steps combined together produce essentially the same solution processes to the original ones in [6]. But, the matrix approximation formulation makes the relation of GPower to SPCArt and others apparent. The three methods rSVD, TPower, and SPC below can be seen as special cases of GPower.

rSVD [7]. rSVD can be seen as a special case of GPower, i.e. the single component case . Here reduces to unit length normalization. More loadings can be got via deflation [4, 7], e.g. update and run the procedure again. Now, since , the subsequent loadings obtained are nearly orthogonal to .

If the penalties in rSVD are replaced with constraints, we obtain TPower and SPC.

TPower [11]. The case is

(26)

There are closed form solutions , . sets the smallest entries to zero.222[7] did implement this version for rSVD, but using as a heuristic trick. By iteration, , which indicates equivalence to the original TPower.

SPC [14]. The case is , , , , . serves as the length of in (26). If the other variables are fixed, . If is fixed, the problem is: , , , . A small modification leads to SPC:

which is biconvex. . However, there is no analytic solution for ; it is solved by linear searching.

5 Relation of GPower to SPCArt and an Extension

5.1 Relation of GPower to SPCArt

Note that (24) and (25) are of similar forms to (8) and (5) respectively. There are two important points of differences. First, SPCArt deals with orthonormal PCA loadings rather than original data. Second, SPCArt takes rotation matrix rather than merely orthonormal matrix as variable. These differences are the key points for the success of SPCArt.

Compared with SPCArt, GPower has some drawbacks. GPower can work on both the deflation mode (, i.e. rSVD) and the block mode (). In the block mode, there is no mechanism to ensure the orthogonality of the loadings. Here is not orthogonal, so after thresholding, also does not tend to be orthogonal. Besides, it is not easy to determine the weights, since lengths of ’s usually vary in great range. E.g., if we initialize , then , which are scaled PCA loadings whose lengths usually decay exponentially. Thus, if we simply set the thresholds ’s uniformly, it is easy to lead to unbalanced sparsity among loadings, in which leading loadings may be highly denser. This deviates from the goal of sparse PCA. For the deflation mode, though it produces nearly orthogonal loadings, the greedy scheme makes its solution not optimal. And there still exists a problem of how to set the weights appropriately.333Even if is initialized with the maximum-length column of as [6] does, it is likely to align with . Besides, for both modes, performance analysis may be difficult to obtain.

5.2 Extending rSVD and GPower to rSVD-GP

A major drawback of rSVD and GPower is that they cannot use uniform thresholds when applying thresholding . The problem does not exist in SPCArt since the inputs are of unit length. But, we can extend the similar idea to GPower and rSVD: let , which is equivalent to truncating according to its length, or using adaptive thresholds . The other truncation types T-en and T-sp can be introduced into GPower too. T-sp is insensitive to length, so there is no trouble in parameter setting; and the deflation version happens to be TPower.

The deflation version of the improved algorithm rSVD-GP is shown in Algorithm 2, and the block version rSVD-GPB is shown in Algorithm 3. rSVD-GPB follows the optimization described in Section 4. For rSVD-GP, since reduces to normalization of vector, and the extended truncation is insensitive to the length of input, we can combine the step with the step and ignore the length during the iterations. Besides, it is more efficient to work with the covariance matrix, if .

1:  Input: data matrix (or covariance matrix ), number of loadings , truncation type , parameter .
2:  Output: sparse loading vectors .
3:  for  to  do
4:     Initialize : (or ), set , , .
5:     repeat
6:         (or ).
7:        Truncation: .
8:     until convergence
9:     Normalization: .
10:     Deflation: (or ).
11:  end for
Algorithm 2 rSVD-GP (deflation version)
1:  Input: data matrix , number of loadings , truncation type , parameter .
2:  Output: sparse loadings .
3:  PCA: compute rank- SVD of : .
4:  repeat
5:     .
6:     Truncation: , .
7:     Update : thin SVD of : , .
8:  until convergence
9:  Normalize : , .
Algorithm 3 rSVD-GPB (block version)

6 Experiments

The data sets used include: (1) a synthetic data with some underlying sparse loadings [9]; (2) the classical Pitprops data [25]; (3) a natural image data with moderate dimension and relatively large sample size, on which comprehensive evaluations are conducted; (4) a gene data with high dimension and small sample size [26]; (5) a set of random data with increasing dimensions for the purpose of speed test.

We compare our methods with five methods: SPCA [9], PathSPCA [2], ALSPCA [10], GPower [6], and TPower [11]. For SPCA, we use toolbox [27], which implements and constraint versions. We use GPowerB to denote the block version of GPower, as rSVD-GPB. We use SPCArt(T-) to denote SPCArt using T-; the other methods use the similar abbreviations. Note that, rSVD-GP(T-sp) is equivalent to TPower [11]. Except our SPCArt and rSVD-GP(B), the codes of the others are downloaded from the authors’ websites.

There are mainly five criteria for the evaluation. (1) SP: mean of sparsity of loadings. (2) STD: standard deviation of sparsity of loadings. (3) CPEV: cumulative percentage of explained variance (that of PCA loadings is CPEV(V)). (4) NOR: nonorthogonality of loadings,

where is the number of loadings. (5) Time cost, including the initialization. Sometimes we may use the worst sparsity among loadings, , instead of STD, when it is more appropriate to show the imbalance of sparsity.

All methods involved in the comparison have only one parameter that induces sparsity. For those methods that have direct control on sparsity, we view them as belonging to T-sp and let denote the number of zeros of a vector. GPowerB is initialized with PCA, and its parameters are set as , and ’s are uniform for all loadings.444The original random initialization for r-1 vectors [6] may fall out of data subspace and result in zero solution. When using PCA as initialization, distinct setting in effect artificially alters data variance. For ALSPCA, since we do not consider correlation among principal components, we set , , , and . In SPCArt, for T- and T- we set by default, since it is the minimal threshold to ensure sparsity and the maximal threshold to avoid truncating to zero vector. The termination conditions of SPCArt, SPCA are the relative change of loadings or iterations exceed . rSVD-GP(B) uses similar setting. All codes are implemented using MATLAB, run on a computer with 2.93GHz duo core CPU and 2GB memory.


algorithm
loading
pattern
CPEV
SPCA T-sp 4 5-10; 1-4 0.9848
2.2 1-4,9-10; 5-8 0.8286
PathSPCA T-sp 4 5-10; 1-4,9-10 0.9960
ALSPCA 0.7 5-10; 1-4 0.9849
rSVD-GP T- 5-10; 1-4 0.9849
T- 5-10; 1-4 0.9808
T-sp 4 5-10; 1-4,9-10 0.9960
T-en 0.1 5-10; 1-4 0.9849
SPCArt T- 5-10; 1-4 0.9848
T- 5-10; 1-4 0.9728
T-sp 4 5-10; 1-4,9-10 0.9968
T-en 0.1 5-10; 1-4 0.9848
Table 3: Recovering of sparse loadings on a synthetic data. . CPEV(V) = 0.9973. Loading pattern 5-10; 1-4,9-10 means the nonzero support of the first loading vector is 5 to 10, and the second is 1 to 4 and 9 to 10.
algorithm NZ
loading
patterns
STD NOR CPEV
ALSPCA 0.65 17 722213 0.1644 0.0008 0.8011
T- GPower 0.1 19 712162 0.2030 0.0259 0.8111
rSVD-GP 0.27 17 612422 0.1411 0.0209 0.8117
GPowerB 0.115 17 724112 0.1782 0.0186 0.8087
rSVD-GPB 0.3 18 534132 0.1088 0.0222 0.7744
SPCArt 18 424332 0.0688 0.0181 0.8013
T-sp SPCA 10 18 333333 0 0.0095 0.7727
PathSPCA 10 18