# The Sparse Principal Component of a Constant-rank Matrix

The computation of the sparse principal component of a matrix is equivalent to the identification of its principal submatrix with the largest maximum eigenvalue. Finding this optimal submatrix is what renders the problem NP-hard. In this work, we prove that, if the matrix is positive semidefinite and its rank is constant, then its sparse principal component is polynomially computable. Our proof utilizes the auxiliary unit vector technique that has been recently developed to identify problems that are polynomially solvable. Moreover, we use this technique to design an algorithm which, for any sparsity value, computes the sparse principal component with complexity O(N^D+1), where N and D are the matrix size and rank, respectively. Our algorithm is fully parallelizable and memory efficient.

## Authors

• 6 publications
• 2 publications
• 2 publications
• ### ReFACTor: Practical Low-Rank Matrix Estimation Under Column-Sparsity

Various problems in data analysis and statistical genetics call for reco...
05/22/2017 ∙ by Matan Gavish, et al. ∙ 0

• ### A Framework for Private Matrix Analysis

We study private matrix analysis in the sliding window model where only ...
09/06/2020 ∙ by Jalaj Upadhyay, et al. ∙ 0

• ### Kullback-Leibler Principal Component for Tensors is not NP-hard

We study the problem of nonnegative rank-one approximation of a nonnegat...
11/21/2017 ∙ by Kejun Huang, et al. ∙ 0

• ### Faster Principal Component Regression and Stable Matrix Chebyshev Approximation

We solve principal component regression (PCR), up to a multiplicative ac...
08/16/2016 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• ### Sparse PCA through Low-rank Approximations

We introduce a novel algorithm that computes the k-sparse principal comp...
03/03/2013 ∙ by Dimitris S. Papailiopoulos, et al. ∙ 0

• ### "Average" Approximates "First Principal Component"? An Empirical Analysis on Representations from Neural Language Models

Contextualized representations based on neural language models have furt...
04/18/2021 ∙ by Zihan Wang, et al. ∙ 0

• ### Principal Component Projection and Regression in Nearly Linear Time through Asymmetric SVRG

Given a data matrix A∈R^n × d, principal component projection (PCP) and ...
10/15/2019 ∙ by Yujia Jin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Principal component analysis (PCA) is a well studied and broadly used dimensionality reduction tool. The principal components (PCs) of a set of observations on

variables capture orthogonal directions of maximum variance and offer a distance-optimal, low-dimensional representation that -for many purposes- conveys sufficient amount of information. Without additional constraints, the PCs of a data set can be computed in polynomial time in

using the eigenvalue decomposition.

A disadvantage of conventional PCA is that, in general, the extracted components are expected to have nonzero elements in all their entries. In many applications, sparse vectors that convey information are more favorable either due to sparsity of the actual signals [1][2] or because sparsity implies interpretability [3] when each coordinate of a PC corresponds, for example, to a different word in text analysis applications or the expression of a particular gene in bio data sets. Thus, provided that the application requires it, some of the maximum variance of the true PCs may be traded for sparsity. Recently, there has been an increasing interest in computing sparse components of data sets with applications that range from signal processing, communication networks, and machine learning, to bioinformatics, finance, and meteorology [4]-[13].

To enforce sparsity on the extracted components, a linearly constrained -norm minimization problem is usually considered [1][2][14][15]. This problem is equivalent to the sparse variance maximization, that is, the maximization of the Rayleigh quotient of a matrix under an -norm constraint on the maximizing argument [4]-[6][8][11][13][16]-[23]. In both problems, due to the additional cardinality constraint that is enforced, the sparsity-aware flavor of PCA, termed sparse PCA, comes at a higher cost: sparse PCA is an -hard problem [16].

To approximate sparse PCA, various methods have been introduced in the literature. A modified PCA technique based on the LASSO was introduced in [24]. In [3], a nonconvex regression-type optimization approach combined with LASSO penalty was used to approximately tackle the problem. A nonconvex technique, locally solving difference-of-convex-functions programs, was presented in [17]. Semidefinite programming (SDP) was used in [5][22], while [18] augmented the SDP approach with an extra greedy step that offers favorable optimality guarantees under certain sufficient conditions. The authors of [4] considered greedy and branch-and-bound approaches. Generalized power method techniques using convex programs were also used to approximately solve sparse PCA [20]. A sparse-adjusted deflation procedure was introduced in [19] and in [6] optimality guarantees were shown for specific types of covariance matrices under thresholding and SDP relaxations. Iterative thresholding was also considered in [25] in conjunction with certain guarantees while a truncated power method was presented in [26].

In this present work, we prove that the sparse principal component of an matrix can be obtained in polynomial time under a new sufficient condition: when

can be written as a sum of a scaled identity matrix and a positive semidefinite update, i.e.,

, and the rank of the update is not a function of the problem size.111If , then we simply have a constant-rank matrix . Under this condition, we show that sparse PCA is solvable with complexity . Our proof utilizes the auxiliary unit vector technique that we developed in [27][28]. This technique has been inspired by the work in [29], which reappeared in [30] and was used in [31]. It introduces an auxiliary unit vector that unlocks the constant-rank structure of a matrix (in this present work, matrix ). The constant-rank property along with the auxiliary vector enable us to scan a constant-dimensional space and identify a polynomial number of candidate vectors (i.e., candidate solutions to the original problem). Interestingly, the optimal solution always lies among these candidates and a polynomial time search can always retrieve it. As a result, we have applied the auxiliary unit vector technique to identify the polynomial solvability of certain optimization problems and provide polynomial-time algorithms that are directly implementable, fully parallelizable, and memory efficient [32]-[35].

The rest of this paper is organized as follows. In Section II, we state the sparse PCA problem and indicate its -hardness. Then, in Section III, we follow the principles of the auxiliary unit vector technique to present a proof of the polynomial solvability of the sparse PCA problem under a constant-rank condition. Moreover, we design a novel algorithm222Early versions of our algorithm appeared in [36][37]. which, for any sparsity value, computes the sparse principal component with complexity . Our algorithm is simply implementable, fully parallelizable, and memory efficient. Especially for the case , an alternative nonparallelizable version of our algorithm with complexity is presented in Section IV.333Alternative (non)parallelizable implementations of our algorithm for have been presented in [36]. A few conclusions are drawn in Section V.

## Ii Problem Statement

We are interested in the computation of the real, unit-norm, and at most -sparse principal component of a matrix , i.e.,

 xopt△=argmaxx∈RN∥x∥=1,∥x∥0≤K{xTCx}. (1)

Interestingly, when can be decomposed as a constant-rank positive semidefinite update of the identity matrix, i.e.,

 C=σIN+A (2)

where , is the identity matrix, and is a positive semidefinite matrix with rank , then the optimization (1) can always be rewritten as

 (3)

Since is positive semidefinite and has rank , it can be decomposed as

 A=VVT, (4)

where

 (5)

is an matrix, and problem (1) can be written as

 (6)

For the optimization problem in (6), we note that

 (7)

where . In (7), set (which we call the support) consists of the indices of the potentially nonzero elements of . For a given support , the inner maximization is a -dimensional principal-component problem, where is the corresponding submatrix of . The solution to the inner maximization is denoted by

 x(I)△=argmaxx∈RK∥x∥=1∥∥VTI,:x∥∥ (8)

and equals the principal left singular vector of . Then, our optimization problem in (7) becomes

 Iopt△=∗argmaxI⊆[N]|I|=K{σmax(VI,:)} (9)

where

denotes the principal singular value of matrix

. That is, to solve our original problem in (1), according to (9), we need to find the -row submatrix of whose principal singular value is the maximum one among all submatrices. The indices that are contained in the optimal support that solves (9) correspond to the nonzero loadings of the solution to (1). Then, according to (8), the values of these nonzero loadings are directly computed by the left singular vector of .

From the above discussion, it turns out that the hardness of the original problem in (1) comes from the identification of the optimal support in (9). To obtain the optimal support , we could simply perform an exhaustive search among all possible supports and compare them against the metric of interest in (9). However, if is not constant but grows with , then such an approach has complexity that is exponential in , indicating the -hardness of (1), which was shown in [16]. In this present work, we show that, if the rank of is constant, then (9) can be solved in time polynomial in . In fact, we develop an algorithm that has complexity and returns candidate supports, one of which is guaranteed to be the solution to (9). Then, by an exhaustive search among only these candidate supports, we identify the optimal support in (9) and, hence, the sparse principal component of and with complexity polynomial in , for any sparsity value between and (that is, even if grows with ).

## Iii Computation of the Sparse Principal Component in Time O(ND+1)

Prior to presenting the main result for the general rank- case, in the following subsection we provide insights as to why the sparse principal component of constant-rank matrices can be solved in polynomial time by first considering the trivial case .

### Iii-a Rank-1: A motivating example

In this case, has rank and . For a given support , we have . Then, our optimization problem in (7) becomes

 (10)

where, for any given support , the corresponding vector in (8) is

 x(I)=argmaxx∈RK∥x∥=1∣∣vTIx∣∣=vI∥vI∥. (11)

Therefore, (7) becomes

 (12)

and the optimal support is

 Iopt=argmaxI⊆[N]|I|=K∥vI∥=argmaxI⊆[N]|I|=K∑n∈Iv2n. (13)

That is, to determine the solution to (9), we only need to compare the elements of and select the largest ones. Then, their indices are the elements of .

The above observation, although simple, turns out to be critical for the developments that follow. Hence, to simplify the presentation, we define function which is parameterized in an integer , has as input a vector of length , and returns the indices of the largest values in :

 topk(u)△=argmaxI⊆[N]|I|=k∥uI∥. (14)

Function operates by selecting the indices of the largest values among , , , . Its complexity is  [38].

We conclude this subsection by noting that, if , then the optimal support in (9) is

 Iopt=topK(v) (15)

and is computed with linear complexity.

### Iii-B Rank-D: Utilizing the auxiliary unit vector technique

We consider now the case where has rank and, hence, is an matrix. Without loss of generality (w.l.o.g.), we assume that each row of has at least one nonzero element, i.e., , . Indeed, as explained in [28], if there exists an index such that , then, independently of the value of the corresponding element of , the contribution of this row to the value of in (6) will be zero. Hence, there is no point in “spending” in a weight that could be distributed to other elements of ; we can ignore the th row of , replace by , and, hence, reduce the problem size from to . In the final solution , will be set to zero.

In our subsequent developments, we rely on the auxiliary unit vector technique that was introduced in [27] for matrices of rank and generalized in [28] for matrices of rank . This technique utilizes an auxiliary vector to generate the subspace spanned by the columns of and result in a rank- problem for each value of . Interestingly, for several problems, the number of different solutions that we obtain as scans the unit-radius hypersphere has polynomial size. If the rank- problem for each value of is polynomially solvable (as, for example, in the optimization problem of this work, as indicated in Subsection III-A), then the optimal solution is obtained with overall polynomial complexity. In a few words, the auxiliary unit vector technique of [27][28] is a fully parallelizable and memory efficient technique that translates -dimensional problems into a polynomial collection of rank- problems among which one results in the overall optimal solution.

For our sparse-principal-component problem, the auxiliary unit vector technique works as follows. Consider a unit vector . By Cauchy-Schwartz Inequality, for any ,

 (16)

with equality if and only if is collinear to . Then,

 maxc∈RD,∥c∥=1∣∣aTc∣∣=∥a∥. (17)

Using (17), our optimization problem in (7) becomes

 (18)

where

 u(c)△=Vc. (19)

The rightmost equality in (18) is obtained by interchanging the maximizations. This is a critical step of the auxiliary unit vector technique. It unlocks the constant-rank structure of and allows us to consider a simple rank- problem for each value of . Indeed, for each , the inner double maximization problem

 (20)

is equivalent to the rank- optimization problem in (10) that, according to (15), results in the optimal support (for fixed )

 I(c)△=topK(u(c)) (21)

which is obtained with complexity . Then, according to (18), the solution to our original problem in (9) is met by collecting all possible supports as scans the unit-radius -dimensional hypersphere. That is, in (9) belongs to

 S△=⋃c∈RD,∥c∥=1I(c). (22)

Set contains candidate supports one of which is the solution to our original optimization problem. If was available, then one would have to compare the elements of against the metric of interest in (9) to obtain the optimal support . Therefore, the size of and the complexity to build determine the overall complexity to solve (9). Our major contribution in this work is that we show that the cardinality of is upper bounded by

 (23)

and develop an algorithm to build with complexity . After is constructed, each element (support) of it is mapped to the principal singular value of the matrix with complexity , since is constant. Finally, all computed singular values are compared with each other to obtain the optimal support in (9). Then, the solution to our original problem in (1) is the principal left singular vector of the matrix , computed with complexity . Therefore, we compute the optimal support and the sparse principal component of a rank- matrix with complexity .

A constructive proof is presented in detail in the next three subsections. To give some insight of the proof, we begin with the simple case . Then, we generalize our proof for the case of any arbitrary .

### Iii-C Rank-2: A simple instance of our proof

If , then has size and the auxiliary vector is a length-, unit vector that, as in [27], can be parameterized in an auxiliary angle . That is,

 c(ϕ)△=[sinϕcosϕ],ϕ∈Φ△=(−π2,π2]. (24)

Hence, lies on the unit-radius semicircle.444We ignore the other semicircle because any pair of angles and with difference results in opposite vectors which, however, are equivalent with respect to the optimization metric in (18) and produce the same support in (21). Then, the candidate set in (22) is re-expressed as

 S=⋃ϕ∈ΦI(ϕ) (25)

where, according to (21),

 I(ϕ)△=topK(u(ϕ)) (26)

and, according to (19),

 u(ϕ)△=Vc(ϕ). (27)

That is, for any given , the corresponding support is obtained with complexity by selecting the indices of the absolutely largest elements of .

However, why should simplify the computation of a solution? The intuition behind the auxiliary unit vector technique is that every element of is actually a continuous function of , i.e., a curve (or -manifold) in , since

 u(ϕ)=Vc(ϕ)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣V1,1sinϕ+V1,2cosϕV2,1sinϕ+V2,2cosϕ⋮VN,1sinϕ+VN,2cosϕ⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (28)

Hence, the support that corresponds to the absolutely largest elements of at a given point is a function of . Due to the continuity of the curves and the discrete nature of the support, we expect that the support will retain the same elements in an area around . Therefore, we expect the formation of intervals in , within which the indices of the absolutely largest elements of remain unaltered. A support might change only if the sorting of the amplitudes of two elements in , say and , changes. This occurs at points where , that is, points where two curves intersect. Finding all these intersection points is sufficient to determine intervals and construct all possible candidate supports . Among all candidate supports, lies the support that corresponds to the optimal -sparse principal component. Exhaustively checking the supports of all intervals suffices to retrieve the optimal one. The number of these intervals is exactly equal to number of possible intersections among the amplitudes of , which is exactly equal to , counting all possible combinations of element pairs.

Before we proceed, in Fig. 1, we illustrate the interval partition of for an arbitrary matrix (i.e., ). We plot the curves that originate from the rows of and observe the intervals that are formed, within which the sorting of the curves does not change. The borders of the intervals are denoted by vertical dashed lines at points of curve intersections. Our approach creates intervals which exceeds the total number of possible supports, however this is not true for greater values of . In addition, for sparsity , we observe that is partitioned into regions (sets of adjacent intervals); within each region , although the sorting changes, the set of largest curves does not change. For example, in Fig. 1, we identify the regions , , , and where the candidate support remains fixed. These regions are an interesting feature that might further decrease the number of intervals we need to check. We exploit this feature in the serial implementation of our algorithm in Section IV.

### Iii-D Rank-2: Algorithmic developments and complexity

Our goal is the construction of all possible candidate -sparse vectors, determined by the support of each interval in . This is a two-step process. First, we identify interval borders and, then, we determine the supports associated with these intervals.

Algorithmic Steps: We first determine all possible intersections of curve pairs in . Any pair of distinct elements in is associated with two intersections: and . Solving these two equations with respect to determines two points

 ^ϕ=∓tan−1(Vi,2−Vj,2Vi,1−Vj,1)∈Φ (29)

where a new support of indices of the -largest values of might occur. We note that, when varies in (from to ), changes of the support may occur only over intersection points given in (29), for any with .

Next, we observe that, at an intersection point , the intersecting curves are equal, i.e., , while their relative order changes over . That is, if in the interval immediately preceding , then in the interval immediately following , and vice versa. Exactly at the point of intersection, we can determine the order of the curves in .

• If both or none of the th and th curves are included in the set of largest curves at , then none of the two curves leaves or joins, respectively, that set at , despite the change in their relative order. Assuming that no other pair of curves intersect exactly at ,555Recall that the support can change only at an intersection point. If a second pair of curves, say the th and th ones, intersect exactly at causing the support to change over , then the correct support sets for the two incident intervals will be determined when the intersections of the th and th curves are examined. the two intervals incident to are associated with the same support which can be determined exactly at the intersection point .666In fact, since the support does not change at , there exists another intersection point where the same support will be collected. Hence, the support at may safely be ignored.

• If that is not the case, i.e., if the th and th curves occupy the th and th order at , then the change in their relative order affects the support : one of the two curves leaves and the other one joins the set of largest curves at . The support sets associated with the two adjacent intervals differ only in one element (one contains index and the other contains index instead), while the remaining common indices correspond to the largest curves at the intersection point .777Since any interval in follows an intersection point, we actually need to evaluate the support of only the interval that follows the intersection point . The other interval will be examined at the intersection point that precedes it. To identify which of the two indices and is contained in the support of the interval that follows the intersection point , we can visit the “rightmost” point of , that is, . There, due to the continuity of and , the relative order of and within the interval that follows will be the identical or opposite relative order of and , depending on whether is positive or negative, respectively.

We have fully described a way to calculate the (at most) two support sets associated with the two intervals incident to one intersection point. Since all intervals are incident to at least one intersection point, it suffices to consider all intersection points, that is, all pairwise combinations of elements in , to collect the corresponding support sets.

Computational Complexity: A single intersection point can be computed in time . At an intersection point , the th-order element of (and the elements larger than that) can be determined in time which equals the construction time of the (at most) two support sets associated with the intervals incident to . Collecting all candidate supports requires examining all intersection points, implying a total construction cost of . Since we obtain (at most) two supports for any of the intersection points, the size of the candidate support set is .

### Iii-E Rank-D: A generalized proof

In the general case, is a matrix. In this subsection, we present our main result where we prove that the problem of identifying the -sparse principal component of a rank- matrix is solvable with complexity . The statement is true for any value of (that is, even if is a function of ). The rest of this subsection contains a constructive proof of this statement.

Since has size , the auxiliary vector is a length-, unit vector. We begin our constructive proof by introducing the auxiliary-angle vector and parameterizing , as in [28], according to

 (30)

Hence, lies on the unit-radius semihypersphere.888As in the rank- case, we ignore the other semihypersphere because any pair of vectors and whose first elements and , respectively, have difference results in opposite vectors which, however, are equivalent with respect to the optimization metric in (18) and produce the same support in (21). Then, the candidate set in (22) is re-expressed as

 S=⋃ϕ∈ΦD−1I(ϕ) (31)

where, according to (21),

 I(ϕ)△=%topK(u(ϕ)) (32)

and, according to (19),

 u(ϕ)△=Vc(ϕ). (33)

That is, for any given , the corresponding support is obtained with complexity by selecting the indices of the absolutely largest elements of .

To gain some intuition into the purpose of inserting the auxiliary-angle vector , notice that every element of in (34) is actually a continuous function of and so are the elements of . That is, each element of is a hypersurface (or -manifold) in the -dimensional space . When we sort the elements of at a given point , we actually sort the hypersurfaces at point . The key observation in our algorithm is that, due to their continuity, the hypersurfaces will retain their sorting in an area “around” . This implies the partition of into cells each of which (say, cell ) is associated with a single set of indices of hypersurfaces that lie above and a single set of indices of hypersurfaces that lie below it. Moreover, each cell contains at least one vertex (that is, intersection of hypersurfaces). Finally, for any , there is a unique cell , called “normal,” which contains uncountably many points in and is associated with a single index-set of cardinality (that is, exactly hypersurfaces lie above ). In fact, the indices of these hypersurfaces (that is, the elements of ) are the elements of support . Although our discussion refers to the general- case, for illustrative purposes we consider again the case and, in Fig. 2, we revisit the example that we presented in Subsection III-C. The normal cells that are created by the curves are the shaded ones. These cells carry the property that lie below exactly curves. We observe that there is a one-to-one correspondence between normal cells and regions .

According to (31) and the above observations, we need to determine the index-set of every normal cell in the partition. If we collect all such index-sets, then we have constructed in (31). This will be achieved if, instead, we identify all cells in and, for each cell, determine the largest hypersurfaces that lie above an arbitrary point of it. The latter will return the desired index-set if the cell is normal. In Fig. 2, we observe that, for each normal cell, the indices of the largest curves that lie above it can be computed at the leftmost vertex of it (we can ignore the leftmost normal cell because it produces the same indices with the rightmost one). In the following, we identify all cells in the partition and compute a size- support for each such cell. This way, we obtain the index-set of any normal cell, among which one is the optimal support in (9).

Since each cell contains at least one vertex, we only need to find all vertices in the partition and determine for all neighboring cells. Recall that a vertex is an intersection of hypersurfaces. Consider arbitrary hypersurfaces; say, for example, , , , . Their intersection satisfies and is computed by solving the system of equations

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩u1(ϕ)±u2(ϕ)=0u1(ϕ)±u3(ϕ)=0⋮u1(ϕ)±uD(ϕ)=0⎫⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪⎭ (35)

or, equivalently,

 ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣V1,:±V2,:V1,:±V3,:⋮V1,:±VD,:⎤⎥ ⎥ ⎥ ⎥ ⎥⎦c(ϕ)=0. (36)

For any sign combination, the solution to the latter consists of the spherical coordinates of the unit vector in the null space of the leftmost matrix.999If the matrix is full-rank, then its null space has rank and is uniquely determined (within a sign ambiguity which, however, does not affect the final decision on the index-set). If, instead, the matrix is rank-deficient, then the intersection of the hypersurfaces (i.e., the solution of (36)) is a -manifold (with ) on the -dimensional space and does not generate a new cell. Hence, combinations of rows of that result in linearly dependent rows of the matrix in (36) can be simply ignored. Then, the index-set that corresponds to a neighboring cell is computed by

 topK(Vc(ϕ)). (37)

Note that the intersecting hypersurfaces have the same value at . Hence, (37) returns ambiguity regarding the sorting of these particular hypersurfaces. If hypersurfaces of these belong to the largest ones, then, due to this ambiguity, we have to consider all combinations of hypersurfaces among the intersecting ones, where . Finally, we have to repeat the above procedure for all sign combinations in (36) and any combination of intersecting hypersurfaces among the ones. The total number of combinations is , hence the cardinality of is upper bounded by .

A pseudocode that includes all the above steps is presented in Algorithm 1. Its complexity to build is determined by the complexity to build each element of it (i.e., each index-set ) for each examined intersection through (37). Note that function has complexity and the cardinality of is . Hence, the overall complexity to build is . Finally, we mention that the computation of each element of is performed independently of each other. Therefore, the proposed Algorithm 1 that builds and solves (1) or, equivalently, (9) with complexity is fully parallelizable and memory efficient.

## Iv An Algorithm of Complexity O(N2logN) for Rank-2 Matrices

In the special case of a rank- matrix (i.e., ), Algorithm 1 computes the sparse principal component of with complexity for any sparsity value. In this section, we develop an algorithm that computes the sparse principal component of with complexity .

### Iv-a A serial algorithm for rank-2 matrices

For the rank- case, Algorithm 1, discussed in Subsection III-D, relied on identifying a polynomial number of non-overlapping intervals on , each associated with a candidate support set . These intervals are induced by the pairwise intersections of curves in . The candidate support set associated with an interval is determined at the leftmost point of the interval. The two-step algorithm first computes all pairwise intersection points. In the second step, it determines , the set of indices of the largest elements of , at each intersection point individually, in linear time. In this section, we present an algorithmic enhancement that reduces the overall computational complexity, exploiting the correlation between candidate support sets of neighboring intervals.

The relative order of and changes only at the points where the th and th curves intersect. Conversely, if those two are the only curves intersecting at a point , then the ordering of the remaining elements of in the intervals immediately preceding and succeeding is identical. The limited change on ordering of the curves in across an intersection point implies that the differences between the candidate support sets of the adjacent intervals cannot be arbitrary. The support sets associated with two neighboring intervals differ in at most one element. More formally, let be the intersection point of the th and th curves of . There exists an such that is the only intersection point lying in . If in , then in , and vice versa. The ordering of all other curves remains unaltered over . If the th and th curves are both members of the candidate support in the interval preceding , then . The same holds if neither is included in . On the contrary, if exactly one of and belongs to , then the candidate sets associated with the two neighboring intervals differ in exactly one element. If, w.l.o.g., the th curve is the one belonging to , then .

The key observation is that, if the candidate support set associated with a particular interval is known, then the set associated with a neighboring interval can be determined with a constant complexity. The above observations readily suggest the following procedure for determining the candidate support sets:

1. Compute all pairwise intersection points of the curves in .

2. Sort the intersection points in increasing order. Let be the sorted sequence.

3. Determine , the set of indices of the largest curves in at the first intersection point .

4. Successively determine at consecutive intersection points. At , the candidate support set is determined by appropriately updating , the set associated with the previous interval.

Once all candidate support sets have been collected, the optimal solution is determined as in Algorithm 1.

An illustrative figure that explains the above steps is presented in Fig. 3, using the same matrix as the one examined in Section III (Figs. 2 and 3). We seek the optimal -sparse solution, i.e., . The algorithm starts at and scans , keeping track of the th order curve, highlighted with a black solid line. Vertical dashed lines indicate the intersections points at which the support of the optimal changes. Regions of corresponding to different support have a different background color. The optimal support in each region is depicted in the top of the figure. Note that 2 of the support sets need not be considered: sets and do not appear for any .

A pseudocode that implements the above serial algorithmic steps is presented in Algorithm 2. It improves upon the computational complexity of its parallel counterpart (Algorithm 1) circumventing the construction of the candidate set at each individual intersection point. The computation and sorting of all intersection points (steps 1 and 2) is performed in operations. The construction of is performed in linear time. Finally, each successive update in the last step requires operations. Therefore, the overall complexity of the serial Algorithm 2 is as opposed to the complexity of the parallel Algorithm 1 for . The disadvantage of Algorithm 2 is that it is not parallelizable.

### Iv-B A modified serial algorithm for rank-2 matrices

Algorithms 1 and 2 described so far collect the candidate support sets by examining all pairwise intersection points. The number of distinct candidate support sets, however, can be significantly less: multiple intervals on may be associated with the same support set. In the following, we describe a simple modification of Algorithm 2 that aims at reducing the total number of intersection points computed and examined. Although the worst case complexity remains the same, the modified serial algorithm may significantly speed up execution in certain cases.

Consider three consecutive intersection points , , and , such that the candidate support sets on the two sides of are different (see, for example, Fig. 4). Let denote the set of indices of the largest curves in in the interval . Similarly, is the set associated with . The two sets differ by one element: if is the set of the common elements, then and , for some curves and .

At , over which the candidate support set changes, the two curves intersect. The th curve joins the set of largest curves of , displacing the th curve which was a member of . In particular, the th curve must be the smallest element of in . To see that, assume for the sake of contradiction that among the curves in , the th curve was the smallest one in , where . Then, in . By assumption, at . Due to the continuity of the curves, there must exist a point in where either or . However, no intersection point lies in . Following a similar argument, the th curve must be the largest curve in among those not included in . Moreover, the th curve becomes the th-order element of in , i.e., it is the smallest curve in .

The key observation is that, along , the candidate support set changes only at the intersection points of the th-order curve in . Assume that the candidate support set is known at and the th curve is the th-order curve in , i.e., the smallest curve in . Moving along , the first point where the candidate support set can potentially change is the closest intersection point of the th curve. Let be that point and be the intersecting curve. If , then the th curve joins the set at , displacing the th curve. If , then is only a point of internal reordering for , hence . In either case, however, the th curve becomes the th-order curve immediately after . Proceeding in a similar fashion, the next point where the candidate support set can potentially change is a point closest to , where the th curve intersects one of the other curves.

A pseudocode that implements the above steps and modifications is presented in Algorithm 3. In summary, instead of computing all pairwise intersections at the first step, Algorithm 3 postpones the computation of the intersection points of the th curve until the latter becomes the th-order curve in . The motivation behind this enhancement lies on the fact that multiple curves might never become th in order.

To justify this, in Fig. 5, we present the average number of intersection points computed101010Although even fewer points will be eventually visited, the complexity of Algorithm 3 is dominated by the number of intersections computed. by Algorithm 3 as a function of , for sparsity , , and

. The average is estimated over

independent instances of the matrix . We also present the average number of distinct candidate supports for the three different values of and the total number of intersection points . We observe that Algorithm 3 computes noticeably fewer points than the total number of pairwise intersection points, indicating that a significant fraction of the curves never rises to the th order. Finally, we note that the average number of distinct candidate supports, i.e., the cardinality of , is significantly smaller than the number of intersection points computed by Algorithm 3, revealing potential for further reduction of the computational complexity of the algorithms developed in this present work.

## V Conclusions

We proved that the sparse principal component of an matrix is computable with complexity if the matrix is positive semidefinite and its rank is constant. This holds true for any sparsity value (that is, even if grows with ). Our constructive proof was accompanied by a fully parallelizable and memory efficient algorithm which computes the sparse principal component with complexity . For the special case of rank- matrices, we developed alternative serial algorithms of complexity . The construction steps and properties of the algorithms that we presented in this work indicate that implementations of even lower complexity may be possible.

## References

• [1] D. L. Donoho, M. Elad, and V. N. Temlyakov, “Stable recovery of sparse overcomplete representations in the presence of noise,” IEEE Trans. Inf. Theory, vol. 52, pp. 6-18, Jan. 2006.
• [2] J. A. Tropp, “Just relax: Convex programming methods for identifying sparse signals in noise,” IEEE Trans. Inf. Theory, vol. 52, pp. 1030-1051, Mar. 2006.
• [3] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” J. Computational and Graphical Statistics, vol. 15, pp. 265-286, 2006.
• [4] B. Moghaddam, Y. Weiss, and S. Avidan, “Spectral bounds for sparse PCA: Exact and greedy algorithms,” Advances in Neural Information Processing Systems, vol. 18, pp. 915-922, 2006.
• [5] A. d’ Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet, “A direct formulation for sparse PCA using semidefinite programming,” SIAM Review, vol. 49, pp. 434-448, 2007.
• [6] A. A. Amini and M. J. Wainwright, “High-dimensional analysis of semidefinite relaxations for sparse principal components,” in Proc. IEEE ISIT 2008, Toronto, ON, July 2008, pp. 2454-2458.
• [7] E. Diederichs, A. Juditsky, V. Spokoiny, and C. Schütte, “Sparse non-Gaussian component analysis,” IEEE Trans. Inf. Theory, vol. 56, pp. 3033-3047, June 2010.
• [8]

C. Shen, S. Paisitkriangkrai, and J. Zhang, “Efficiently learning a detection cascade with sparse eigenvectors,”

IEEE Trans. Image Process., vol. 20, pp. 22-35, Jan. 2011.
• [9] M. O. Ulfarsson and V. Solo, “Vector sparse variable PCA,” IEEE Trans. Signal. Process., vol. 59, pp. 1949-1958, May 2011.
• [10]

C. Boutsidis, P. Drineas, and M. Magdon-Ismail, “Sparse features for PCA-like linear regression,”

Advances in Neural Information Processing Systems, vol. 24, pp. 2285-2293, 2011.
• [11] N. Singh, B. A. Miller, N. T. Bliss, and P. J. Wolfe, “Anomalous subgraph detection via sparse principal component analysis,” in Proc. IEEE SSP 2011, Nice, France, June 2011, pp. 485-488.
• [12] I. D. Schizas and G. B. Giannakis, “Covariance eigenvector sparsity for compression and denoising,” IEEE Trans. Signal. Process., vol. 60, pp. 2408-2421, May 2012.
• [13] D. Wei, C. K. Sestok, and A. V. Oppenheim, “Sparse filter design under a quadratic constraint: Low-complexity algorithms,” IEEE Trans. Signal Process., vol. 61, pp. 857-870, Feb. 2013.
• [14] M. Elad and I. Yavneh, “A plurality of sparse representations is better than the sparsest one alone,” IEEE Trans. Inf. Theory, vol. 55, pp. 4701-4714, Oct. 2009.
• [15] C. G. Tsinos, A. S. Lalos, and K. Berberidis, “Sparse subspace tracking techniques for adaptive blind channel identification in OFDM systems,” in Proc. IEEE ICASSP 2012, Kyoto, Japan, Mar. 2012, pp. 3185-3188.
• [16] B. Moghaddam, Y. Weiss, and S. Avidan, “Generalized spectral bounds for sparse LDA,” in Proc. 23rd Intern. Conf. Machine Learning, Pittsburgh, PA, June 2006, pp. 641-648.
• [17] B. K. Sriperumbudur, D. A. Torres, and G. R. G. Lanckriet, “Sparse eigen methods by DC programming,” in Proc. 24th Intern. Conf. Machine Learning, Corvallis, OR, June 2007, pp. 831-838.
• [18] A. d’ Aspremont, F. Bach, and L. El Ghaoui, “Optimal solutions for sparse principal component analysis,” J. Machine Learning Research, vol. 9, pp. 1269-1294, July 2008.
• [19] L. Mackey, “Deflation methods for sparse PCA,” Advances in Neural Information Processing Systems, vol. 21, pp. 1017-1024, 2009.
• [20] M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre, “Generalized power method for sparse principal component analysis,” J. Machine Learning Research, vol. 11, pp. 517-553, Feb. 2010.
• [21] Y. Zhang and L. El Ghaoui, “Large-scale sparse principal component analysis with application to text data,” Advances in Neural Information Processing Systems, vol. 24, pp. 532-539, 2011.
• [22] Y. Zhang, A. d’ Aspremont, and L. El Ghaoui, “Sparse PCA: Convex relaxations, algorithms and applications,” in Handbook on Semidefinite, Cone and Polynomial Optimization, vol. 166, pp. 915-940, 2012.
• [23] M. Grbovic, C. R. Dance, and S. Vucetic, “Sparse principal component analysis with constraints,” in

Proc. 26th AAAI Conf. Artificial Intelligence

, Toronto, ON, July 2012, pp. 935-941.
• [24] T. Jolliffe, N. T. Trendafilov, and M. Uddin, “A modified principal component technique based on the LASSO,” J. Computational and Graphical Statistics, vol. 12, pp. 531-547, 2003.
• [25] Z. Ma, “Sparse principal component analysis and iterative thresholding,” Ann. Statist., vol. 41, pp. 772-801, Apr. 2013.
• [26] X.-T. Yuan and T. Zhang, “Truncated power method for sparse eigenvalue problems,” J. Machine Learning Research, vol. 14, pp. 899-925, Apr. 2013.
• [27] G. N. Karystinos and D. A. Pados, “Rank--optimal adaptive design of binary spreading codes,” IEEE Trans. Inf. Theory, vol. 53, pp. 3075-3080, Sept. 2007.
• [28] G. N. Karystinos and A. P. Liavas, “Efficient computation of the binary vector that maximizes a rank-deficient quadratic form,” IEEE Trans. Inf. Theory, vol. 56, pp. 3581-3593, July 2010.
• [29] K. M. Mackenthun, Jr., “A fast algorithm for multiple-symbol differential detection of MPSK,” IEEE Trans. Commun., vol. 42, pp. 1471-1474, Feb./Mar./Apr. 1994.
• [30] W. Sweldens, “Fast block noncoherent decoding,” IEEE Commun. Lett., vol. 5, pp. 132-134, Apr. 2001.
• [31] I. Motedayen, A. Krishnamoorthy, and A. Anastasopoulos, “Optimal joint detection/estimation in fading channels with polynomial complexity,” IEEE Trans. Inf. Theory, vol. 53, pp. 209-223, Jan. 2007.
• [32] D. S. Papailiopoulos and G. N. Karystinos, “Maximum-likelihood noncoherent OSTBC detection with polynomial complexity,” IEEE Trans. Wireless Commun., vol. 9, pp. 1935-1945, June 2010.
• [33] D. S. Papailiopoulos, G. Abou Elkheir, and G. N. Karystinos, “Maximum-likelihood noncoherent PAM detection,” IEEE Trans. Commun., vol. 61, pp. 1152-1159, Mar. 2013.
• [34] A. Kyrillidis and G. N. Karystinos, “Fixed-rank Rayleigh quotient maximization by an PSK sequence,” submitted to IEEE Trans. Commun., June 2013.
• [35] P. P. Markopoulos, G. N. Karystinos, and D. A. Pados, “Some options for -subspace signal processing,” in Proc. IEEE ISWCS 2013, Ilmenau, Germany, Aug. 2013, pp. 622-626.
• [36] M. Asteris, “Sparse rank-deficient variance maximization,” Diploma Thesis, Department of ECE, Technical Univ. Crete, July 2010.
• [37] M. Asteris, D. S. Papailiopoulos, and G. N. Karystinos, “Sparse principal component of a rank-deficient matrix,” in Proc. IEEE ISIT 2011, Saint Petersburg, Russia, Aug. 2011, pp. 673-677.
• [38] T. H Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. Cambridge, MA: MIT Press, 2001.