Innovation Pursuit: A New Approach to Subspace Clustering

12/02/2015 ∙ by Mostafa Rahmani, et al. ∙ University of Central Florida 0

In subspace clustering, a group of data points belonging to a union of subspaces are assigned membership to their respective subspaces. This paper presents a new approach dubbed Innovation Pursuit (iPursuit) to the problem of subspace clustering using a new geometrical idea whereby subspaces are identified based on their relative novelties. We present two frameworks in which the idea of innovation pursuit is used to distinguish the subspaces. Underlying the first framework is an iterative method that finds the subspaces consecutively by solving a series of simple linear optimization problems, each searching for a direction of innovation in the span of the data potentially orthogonal to all subspaces except for the one to be identified in one step of the algorithm. A detailed mathematical analysis is provided establishing sufficient conditions for iPursuit to correctly cluster the data. The proposed approach can provably yield exact clustering even when the subspaces have significant intersections. It is shown that the complexity of the iterative approach scales only linearly in the number of data points and subspaces, and quadratically in the dimension of the subspaces. The second framework integrates iPursuit with spectral clustering to yield a new variant of spectral-clustering-based algorithms. The numerical simulations with both real and synthetic data demonstrate that iPursuit can often outperform the state-of-the-art subspace clustering algorithms, more so for subspaces with significant intersections, and that it significantly improves the state-of-the-art result for subspace-segmentation-based face clustering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The grand challenge of contemporary data analytics and machine learning lies in dealing with ever-increasing amounts of high-dimensional data from multiple sources and different modalities. The high-dimensionality of data increases the computational complexity and memory requirements of existing algorithms and can adversely degrade their performance

[1, 2]. However, the observation that high-dimensional datasets often have intrinsic low-dimensional structures has enabled some noteworthy progress in analyzing such data. For instance, the high-dimensional digital facial images under different illumination were shown to approximately lie in a very low-dimensional subspace, which led to the development of efficient algorithms that leverage low-dimensional representations of such images [3, 4].

Linear subspace models are widely used in data analysis since many datasets can be well-approximated with low-dimensional subspaces [5]

. When data in a high-dimensional space lies in a single subspace, conventional techniques such as Principal Component Analysis can be efficiently used to find the underlying low-dimensional subspace 

[6, 7, 8]. However, in many applications the data points may be originating from multiple independent sources, in which case a union of subspaces can better model the data [9].

Subspace clustering is concerned with learning these low-dimensional subspaces and clustering the data points to their respective subspaces [10, 11, 12, 13, 14, 9, 15, 16, 17]

. This problem arises in many applications, including computer vision (e.g. motion segmentation

[9], face clustering [18]), gene expression analysis [19, 20], and image processing [21]. Some of the difficulties associated with subspace clustering are that neither the number of subspaces nor their dimensions are known, in addition to the unknown membership of the data points to the subspaces.

I-a Related work

Numerous approaches for subspace clustering have been studied in prior work, including statistical-based approaches [22], spectral clustering [10], the algebraic-geometric approach [15] and iterative methods [23]. In this section, we briefly discuss some of the most popular existing approaches for subspace clustering. We refer the reader to [9] for a comprehensive survey on the topic. Iterative algorithms such as [23, 16] were some of the first methods addressing the multi-subspace learning problem. These algorithms alternate between assigning the data points to the identified subspaces and updating the subspaces. Some of the drawbacks of this class of algorithms is that they can converge to a local minimum and typically assume that the dimension of the subspaces and their number are known.

Another reputable idea for subspace segmentation is based on the algebraic geometric approach. These algorithms, such as the Generalized Principal Component Analysis (GPCA) [15], fit the data using a set of polynomials whose gradients at a point are orthogonal to the subspace containing that point. GPCA does not impose any restrictive conditions on the subspaces (they do not need to be independent), albeit it is sensitive to noise and has exponential complexity in the number of subspaces and their dimensions.

A class of clustering algorithms, termed statistical clustering methods, make some assumptions about the distribution of the data in the subspaces. For example, the iterative algorithm in [24, 25] assumes that the distribution of the data points in the subspaces is Gaussian. These algorithms typically require prior specifications for the number and dimensions of the subspaces, and are generally sensitive to initialization. Random sample consensus (RANSAC) is another iterative statistical method for robust model fitting [26], which recovers one subspace at a time by repeatedly sampling small subsets of data points and identifying a consensus set consisting of all the points in the entire dataset that belong to the subspace spanned by the selected points. The consensus set is removed and the steps are repeated until all the subspaces are identified. The main drawback of this approach is scalability since the number of trials required to select points in the same subspace grows exponentially with the number and dimension of the subspaces.

Much of the recent research work on subspace clustering is focused on spectral clustering [27] based methods [28, 29, 10, 11, 13, 30, 14, 31, 17, 32]. These algorithms consist of two main steps and mostly differ in the first step. In the first step, a similarity matrix is constructed by finding a neighborhood for each data point, and in the second step spectral clustering [27] is applied to the similarity matrix. Recently, several spectral clustering based algorithms were proposed with both theoretical guarantees and superior empirical performance. The Sparse Subspace Clustering (SSC) [10] uses -minimization for neighborhood construction. In [14], it was shown that under certain conditions, SSC can yield exact clustering even for subspaces with intersection. Another algorithm called Low-Rank Representation (LRR) [13]

uses nuclear norm minimization to find the neighborhoods (i.e., build the similarity matrix). LRR is robust to outliers but has provable guarantees only when the data is drawn from independent subspaces. In this paper, we show that the idea of innovation pursuit underlying the proposed iterative subspace clustering method can also be used to design a new spectral clustering based method. The result spectral clustering based algorithm is shown to notably outperform the existing spectral clustering based methods.

I-B Contributions

The proposed method advances the state-of-the-art research in subspace clustering on several fronts. First, iPursuit rests on a novel geometrical idea whereby the subspaces are identified by searching the directions of innovation in the span of the data. Second, to the best of our knowledge this is the first scalable iterative algorithm with provable guarantees – the computational complexity of iPursuit only scales linearly in the number of subspaces and quadratically in their dimensions (c.f. Section II-E). By contrast, GPCA [15, 9] (without spectral clustering) and RANSAC [26], which are popular iterative algorithms, have exponential complexity in the number of subspaces and their dimensions. Third, innovation pursuit in the data span enables superior performance when the subspaces have considerable intersections, and brings about substantial speedups, in comparison to the existing subspace clustering approaches. Fourth, the formulation enables many variants of the algorithm to inherit robustness properties in highly noisy settings (c.f. Section III). Fifth, the proposed innovation search approach can be integrated with spectral clustering to yield a new spectral clustering based algorithm, which is shown to notably outperform the existing spectral clustering based methods and yields the state-of-the-art results in the challenging face clustering problem.

I-C Notation and definitions

Bold-face upper-case letters are used to denote matrices and bold-face lower-case letters are used to denote vectors. Given a matrix

, denotes its spectral norm. For a vector , denotes its -norm and its -norm. Given two matrices and with equal number of rows, the matrix is the matrix formed from the concatenation of and . Given matrices with equal number of rows, we use the union symbol to define the matrix

as the concatenation of the matrices . For a matrix , we overload the set membership operator by using the notation to signify that is a column of . A collection of subspaces is said to be independent if where denotes the direct sum operator and is the dimension of . Given a vector , is the vector whose elements are equal to the absolute value of the elements of . For a real number , is equal to if , if , and if . The complement of a set is denoted . Also, for any positive integer , the index set is denoted .

Innovation subspace

Consider two subspaces and , such that , and one is not contained in the other. This means that each subspace carries some innovation w.r.t. the other. As such, corresponding to each subspace we define an innovation subspace, which is its novelty (innovation) relative to the other subspaces. More formally, the innovation subspace is defined as follows.

Definition 1.

Assume that and are two orthonormal bases for and , respectively. We define the subspace as the innovation subspace of over that is spanned by In other words, is the complement of in the subspace .

In a similar way, we can define as the innovation subspace of over . The subspace is the complement of in . Fig. 1 illustrates a scenario in which the data lies in a two-dimensional subspace , and a one-dimensional subspace . The innovation subspace of over is orthogonal to . Since and are independent, and have equal dimension. It is easy to see that the dimension of is equal to the dimension of minus the dimension of .

Ii Proposed Approach

In this section, the core idea underlying iPursuit is described by first introducing a non-convex optimization problem. Then, we propose a convex relaxation and show that solving the convex problem yields the correct subspaces under mild sufficient conditions. First, we present the idea of iPursuit and its analysis within the proposed iterative framework. In Section IV, we present an alternative framework wherein the proposed innovation search approach is integrated with spectral clustering to yield a new variant of spectral-clustering-based algorithms.

In this section, it is assumed that the given data matrix follows the following data model.

Data Model 1. The data matrix can be represented as where is an arbitrary permutation matrix. The columns of lie in , where is an -dimensional linear subspace, for , and, . Define as an orthonormal basis for . In addition, define as the space spanned by the data, i.e., . Moreover, it is assumed that any subspace in the set of subspaces has an innovation over the other subspaces, to say that, for , the subspace does not completely lie in . In addition, the columns of are normalized, i.e., have unit -norm.

Ii-a Innovation pursuit: Insight

iPursuit is a multi-step algorithm that identifies one subspace at a time. In each step, the data is clustered into two subspaces. One subspace is the identified subspace and the other one is the direct sum of the other subspaces. The data points of the identified subspace are removed and the algorithm is applied to the remaining data to find the next subspace. Accordingly, each step of the algorithm can be interpreted as a subspace clustering problem with two subspaces. Therefore, for ease of exposition we first investigate the two-subspace scenario then extend the result to multiple (more than two) subspaces. Thus, in this subsection, it is assumed that the data follows Data model 1 with .

To gain some intuition, we consider an example before stating our main result. Consider the case where and are not orthogonal and assume that . The non-orthogonality of and is not a requirement, but is merely used herein to easily explain the idea underlying the proposed approach. Let be the optimal point of the following optimization problem

(1)

where is the -norm. Hence, is equal to the number of non-zero elements of . The first constraint forces the search for in the span of the data, and the equality constraint is used to avoid the trivial solution. Assume that the data points are distributed in and uniformly at random. Thus, the data is not aligned with any direction in and

with high probability (whp).

Fig. 1: The subspace is the innovation subspace of over . The subspace is orthogonal to .

The optimization problem (1) searches for a non-zero vector in the span of the data that is orthogonal to the maximum number of data points. We claim that the optimal point of (1) lies in whp given the assumption that the number of data points in is greater than the number of data points in . In addition, since the feasible set of (1) is restricted to , there is no feasible vector that is orthogonal to the whole data. To further clarify this argument, consider the following scenarios:

  1. If lies in , then it cannot be orthogonal to most of the data points in

    since the data is uniformly distributed in the subspaces. In addition, it cannot be orthogonal to the data points in

    given that and are not orthogonal. Therefore, the optimal point of (1) cannot be in given that the optimal vector should be orthogonal to the maximum number of data points. Similarly, the optimal point cannot lie in .

  2. If lies in , then it is orthogonal to the data points in . However, . Thus, If lies in (which is orthogonal to ) the cost function of (1) can be decreased.

  3. If lies in none of the subspaces , , and , then it is not orthogonal to nor . Therefore, cannot be orthogonal to the maximum number of data points.

Therefore, the algorithm is likely to choose the optimal point from . Thus, if , we can obtain from the span of the columns of corresponding to the non-zero elements of . The following lemma ensures that these columns span .

Lemma 1.

The columns of corresponding to the non-zero elements of span if both conditions below are satisfied:

  1. .

  2. cannot follow Data model 1 with , that is, the data points in do not lie in the union of lower dimensional subspaces within each with innovation w.r.t. to the other subspaces.

We note that if the second requirement of Lemma 1 is not satisfied, the columns of follow Data model 1 with , in which case the problem can be viewed as one of subspace clustering with more than two subspaces. The clustering problem with more than two subspaces will be addressed in Section II-D.

Remark 1.

At a high level, the innovation search optimization problem (1) finds the most sparse vector in the row space of

. Interestingly, finding the most sparse vector in a linear subspace has bearing on, and has been effectively used in, other machine learning problems, including dictionary learning and spectral estimation

[33, 34]. In addition, it is interesting to note that in contrast to SSC which finds the most sparse vectors in the null space of the data, iPursuit searches for the most sparse vector in the row space of the data.

Fig. 2: Data distributions in a two-dimensional subspace. The blue stars and red circles are the data points and their projections on the unit circle, respectively. In the left plot, the data points are distributed uniformly at random. Thus, they are not aligned along any specific directions and the permeance statistic cannot be small. In the right plot, the data points are aligned, hence the permeance statistic is small.

Ii-B Convex relaxation

The cost function of (1) is non-convex and the combinatorial -norm minimization may not be computationally tractable. Since the -norm is known to provide an efficient convex approximation of the -norm, we relax the non-convex cost function and rewrite (1) as

(2)

The optimization problem (2) is still non-convex in view of the non-convexity of its feasible set. Therefore, we further substitute the equality constraint with a linear constraint and rewrite (2) as

(3)

(IP) is the core program of iPursuit to find a direction of innovation. Here, is a unit -norm vector which is not orthogonal to . The vector can be chosen as a random unit vector in . In Section II-F, we develop a methodology to learn a good choice for from the given data matrix. The relaxation of the quadratic equality constraint to a linear constraint is a common technique in the literature [34].

Ii-C Segmentation of two subspaces: Performance guarantees

Based on Lemma 1, to show that the proposed program (3) yields correct clustering, it suffices to show that the optimal point of (3) lies in given that condition ii of Lemma 1 is satisfied, or lies in given that condition ii of Lemma 1 is satisfied for . The following theorem provides sufficient conditions for the optimal point of (3) to lie in provided that

(4)

If the inequality in (4) is reversed, then parallel sufficient conditions can be established for the optimal point of (3) to lie in the alternative subspace . Hence, assumption (4) does not lead to any loss of generality.

Since and are orthogonal to and , respectively, condition (4) is equivalent to

(5)

Conceptually, assumption (5) is related to the assumption used in the example of Section II-A in the sense that it makes it more likely for the direction of innovation to lie in 111Henceforth, when the two-subspace scenario is considered and (5) is satisfied, the innovation subspace refers to .

The sufficient conditions of Theorem 2 for the optimal point of (3) to lie in are characterized in terms of the optimal solution to an oracle optimization problem (OP), where the feasible set of (IP) is replaced by . The oracle problem (OP) is defined as

(6)

Before we state the theorem, we also define the index set comprising the indices of the columns of orthogonal to (the optimal solution to (OP)),

(7)

with cardinality and a complement set .

Theorem 2.

Suppose the data matrix follows Data model 1 with . Also, assume that condition (5) and the requirement of Lemma 1 for are satisfied (condition ii of Lemma 1). Let be the optimal point of the oracle program (OP) and define

(8)

Also, let denote an orthonormal basis for , the cardinality of defined in (7), and assume that is a unit -norm vector in that is not orthogonal to . If

(9)

then , which lies in , is the optimal point of (IP) in (3), and iPursuit clusters the data correctly.

In what follows, we provide a detailed discussion of the significance of the sufficient conditions (9), which reveal some interesting facts about the properties of iPursuit.

1. The distribution of the data matters.
The LHS of (9) is known as the permeance statistic [8]. For a set of data points in a subspace , the permeance statistic is defined as

(10)

The permeance statistic is an efficient measure of how well the data is distributed in the subspace. Fig. 2 illustrates two scenarios for the distribution of data in a two-dimensional subspace. In the left plot, the data points are distributed uniformly at random. In this case, the permeance statistic cannot be small since the data points are not concentrated along any directions. In the right plot, the data points are concentrated along some direction and hence the data is not well distributed in the subspace. In this case, we can find a direction along which the data has small projection.

Having on the RHS underscores the relevance of the distribution of the data points within since cannot be simultaneously orthogonal to a large number of columns of if the data does not align along particular directions. Hence, the distribution of the data points within each subspace has bearing on the performance of iPursuit. We emphasize that the uniform distribution of the data points is not a requirement of the algorithm as shown in the numerical experiments. Rather, it is used in the proof of Theorem 2, which establishes sufficient conditions for correct subspace identification under uniform data distribution in worst case scenarios.

2. The coherency of with is an important factor.
An important performance factor in iPursuit is the coherency of the vector with the subspace . To clarify, suppose that (5) is satisfied and assume that the vector lies in . If the optimal point of (3) lies in , iPursuit will yield exact clustering. However, if is strongly coherent with (i.e., the vector has small projection on ), then the optimal point of (3) may not lie in . The rationale is that the Euclidean norm of any feasible point of (3) lying in will have to be large to satisfy the equality constraint when is incoherent with , which in turn would increase the cost function. As a matter of fact, the factor in the second inequality of (9) confirms our intuition about the importance of the coherency of with . In particular, (9) suggests that iPursuit could have more difficulty yielding correct clustering if the projection of on the subspace is increased (i.e., the projection of the vector on the subspace is decreased). The coherence property could have a more serious effect on the performance of the algorithm for non-independent subspaces, especially when the dimension of their intersection is significant. For instance, consider the scenario where the vector is chosen randomly from , and define as the dimension of the intersection of and . It follows that has dimension . Thus, Therefore, a randomly chosen vector is likely to have a small projection on the innovation subspace when is large. As such, in dealing with subspaces with significant intersection, it may not be favorable to choose the vector at random. In Section II-F and section III, we develop a simple technique to learn a good choice for from the given data. This technique makes iPursuit remarkably powerful in dealing with subspaces with intersection as shown in the numerical results section.

Now, we demonstrate that the sufficient conditions (9) are not restrictive. The following lemma simplifies the sufficient conditions of Theorem 2 when the data points are randomly distributed in the subspaces. In this setting, we show that the conditions are naturally satisfied.

Lemma 3.

Assume follows Data model 1 with and the data points are drawn uniformly at random from the intersection of the unit sphere and each subspace. If

(11)

then the optimal point of (3) lies in with probability at least for all .

When the data does not align along any particular directions, will be much smaller than since the vector can only be simultaneously orthogonal to a small number of the columns of . Noting that the LHS of (11) has order and the RHS has order (which is much smaller than when is sufficiently large), we see that the sufficient conditions are naturally satisfied when the data is well-distributed within the subspaces.

Ii-D Clustering multiple subspaces

In this section, the performance guarantees provided in Theorem 2 are extended to more than two subspaces. iPursuit identifies the subspaces consecutively, i.e., one subspace is identified in each step. The data lying in the identified subspace is removed and optimal direction-search (3) is applied to the remaining data points to find the next subspace. This process is continued to find all the subspaces. In order to analyze iPursuit for the scenarios with more than two subspaces, it is helpful to define the concept of minimum innovation.

Definition 2.

is said to have minimum innovation w.r.t. the vector in the set if and only if

(12)

for every , where is a unit -norm in .

If has minimum innovation in the set (w.r.t. the vector used in the first step), then we expect iPursuit to find in the first step. Similar to Theorem 2, we make the following assumption without loss of generality.

Assumption 1.

Assume that has minimum innovation w.r.t. in the set for .

According to Assumption 1, if is used in the first step as the linear constraint of the innovation pursuit optimization problem, iPursuit is expected to first identify . In each step, the problem is equivalent to disjoining two subspaces. In particular, in the step, the algorithm is expected to identify , which can be viewed as separating and by solving

(13)

Note that is the span of the data points that have not been yet removed. Based on this observation, we can readily state the following theorem, which provides sufficient conditions for iPursuit to successfully identify in the step. The proof of this theorem follows directly from Theorem 2 with two subspaces, namely, and . Similar to Theorem 2, the sufficient conditions are characterized in terms of the optimal solution to an oracle optimization problem ( defined below.

(14)

We also define with cardinality , where is the optimal point of (14).

Theorem 4.

Suppose that the data follows Data model 1 with and assume that cannot follow Data Model 1 with . Assume that has minimum innovation with respect to in the set . Define as the optimal point of in (14) and let . Assume is a unit -norm vector in . If

(15)

where is an orthonormal basis for and is an orthonormal basis for , then , which lies in , is the optimal point of in (13), i.e., the subspace is correctly identified.

The sufficient conditions provided in Theorem 4 reveal another intriguing property of iPursuit. Contrary to conventional wisdom, increasing the number of subspaces may improve the performance of iPursuit, for if the data points are well distributed in the subspaces, the LHS of (15) is more likely to dominate the RHS. In the appendix section, we further investigate the sufficient conditions (15) and simplify the LHS to the permeance statistic.

Ii-E Complexity analysis

In this section, we use an Alternating Direction Method of Multipliers (ADMM) [35] to develop an efficient algorithm for solving (IP). Define as an orthonormal basis for , where is the rank of . Thus, the optimization problem (3) is equivalent to . Define and . Hence, this optimization problem is equivalent to

(16)

with a regularization parameter . The Lagrangian function of (16) can be written as

(17)

where and are the Lagrange multipliers. The ADMM approach uses an iterative procedure. Define as the optimization variables and the Lagrange multipliers at the iteration. Define and the element-wise function . Each iteration consists of the following steps:

1. Obtain by minimizing the Lagrangian function with respect to while the other variables are held constant. The optimal is obtained as

2. Similarly, update as

3. Update the Lagrange multipliers as follows

These 3 steps are repeated until the algorithm converges or the number of iterations exceeds a predefined threshold. The complexity of the initialization step of the solver is plus the complexity of obtaining . Obtaining an appropriate has complexity by applying the clustering algorithm to a random subset of the rows of (with the rank of sampled rows equal to ). In addition, the complexity of each iteration of the solver is . Thus, the overall complexity is less than since the number of data points remaining keeps decreasing over the iterations. In most cases, , hence the overall complexity is roughly .

iPursuit brings about substantial speedups over most existing algorithms due to the following: i) Unlike existing iterative algorithms (such as RANSAC) which have exponential complexity in the number and dimension of subspaces, the complexity of iPursuit is linear in the number of subspaces and quadratic in their dimension. In addition, while iPursuit has linear complexity in , spectral clustering based algorithms have complexity for their spectral clustering step plus the complexity of obtaining the similarity matrix, ii) More importantly, the solver of the proposed optimization problem has complexity per iteration, while the other operations – whose complexity are and – sit outside of the iterative solver. This feature makes the proposed method notably faster than most existing algorithms which solve high-dimensional optimization problems. For instance, solving the optimization problem of the SSC algorithm has roughly complexity per iteration [10].

Ii-F How to choose the vector ?

The previous analysis revealed that the coherency of the vector with the innovation subspace is a key performance factor for iPursuit. While our investigations have shown that the proposed algorithm performs very well when the subspaces are independent even when the vector is chosen at random, randomly choosing the vector may not be favorable when the dimension of their intersection is increased (c.f. Section II-C). This motivates the methodology described next that aims to identify a “good” choice of the vector .

Consider the following least-squares optimization problem,

(18)

The optimization problem (18) searches for a vector in that has a small projection on the columns of . The optimal point of (18

) has a closed-form solution, namely, the singular vector corresponding to the least non-zero singular value of

. When the subspaces are close to each other, the optimal point of (18) is very close to the innovation subspace . This is due to the fact that is orthogonal to , hence a vector in the innovation subspace will have a small projection on . As such, when the subspaces are close to each other, the least singular vector is coherent with the innovation subspace and can be a good candidate for the vector . In the numerical results section, it is shown that this choice of leads to substantial improvement in performance compared to using a randomly generated . However, in settings in which the singular values of decay rapidly and the data is noisy we may not be able to obtain an exact estimate of . This may lead to the undesirable usage of a singular vector corresponding to noise as the constraint vector. In the next section, we investigate stability issues and present robust variants of the algorithm in the presence of noise. We remark that with real data or when the data is noisy, by the least singular vector we refer to the least dominant singular vector and not to the one corresponding to the smallest singular value which is surely associated with the noise component.

Iii Noisy data

In the presence of additive noise, we model the data as

(19)

where is the noisy data matrix, the clean data which follows Data model 1 and the noise component. The rank of is equal to . Thus, the singular values of can be divided into two subsets: the dominant singular values (the first singular values) and the small singular values (or the singular values corresponding to the noise component). Consider the optimization problem (IP) using , i.e.,

(20)

Clearly, the optimal point of (20) is very close to the subspace spanned by the singular vectors corresponding to the small singular values. Thus, if denotes the optimal solution of (20), then all the elements of will be fairly small and we cannot distinguish the subspaces. However, the span of the dominant singular vectors is approximately equal to . Accordingly, we propose the following approximation to (IP),

(21)

where is an orthonormal basis for the span of the dominant singular vectors. The first constraint of (21) forces the optimal point to lie in , which serves as a good approximation to . For instance, consider , where the columns of lie in a 5-dimensional subspace , and the columns of lie in another 5-dimensional subspace . Define and as the optimal points of (20) and (21), respectively. Fig. 3 shows and with the maximum element scaled to one. Clearly, can be used to correctly cluster the data. In addition, when is low rank, the subspace constraint in (21) can filter out a remarkable portion of the noise component.

Fig. 3: The left plot shows the output of (20), while the right plot shows the output of iPursuit when its search domain is restricted to the subspace spanned by the dominant singular vectors as per (21).

When the data is noisy and the singular values of decay rapidly, it may be hard to accurately estimate . If the dimension is incorrectly estimated, may contain some singular vectors corresponding to the noise component, wherefore the optimal point of (21) could end up near a noise singular vector. In the sequel, we present two techniques to effectively avoid this undesirable scenario.

1. Using a data point as a constraint vector: A singular vector corresponding to the noise component is nearly orthogonal to the entire data, i.e., has small projection on all the data points. Thus, if the optimal vector is forced to have strong projection on a data point, it is unlikely to be close to a noise singular vector. Thus, we modify (21) as

(22)

where is the column of . The modified constraint in (22) ensures that the optimal point is not orthogonal to . If lies in the subspace , the optimal point of (22) will lie in the innovation subspace corresponding to whp. To determine a good data point for the constraint vector, we leverage the principle presented in section II-F. Specifically, we use the data point closest to the least dominant singular vector rather than the least dominant singular vector itself.

2. Sparse representation of the optimal point: When is low rank, i.e., , any direction in the span of the data – including the optimal direction sought by iPursuit – can be represented as a sparse combination of the data points. For such settings, we can rewrite (22) as

(23)

where is a regularization parameter. Forcing a sparse representation in (23) for the optimal direction averts a solution that lies in close proximity with the small singular vectors, which are normally obtained through linear combinations of a large number of data points. Using the data points as constraint vectors effectively accords robustness to the imperfection in forming the basis matrix . However, our investigations show that enforcing the sparse representation for the optimal direction can enhance the performance in some cases. The table of Algorithm 1 details the proposed method for noisy data along with used notation and definitions.

Initialization Set and as integers greater than 1, and set and equal to positive real numbers less than 1.

While The number of identified subspaces is less than or the number of columns of is greater than .

1. Obtaining the basis for the remaining Data: Construct as the orthonormal matrix formed by the dominant singular vectors of .

2. Choosing the vector : Set the column of closest to the last column of .

3. Solve (23) and define , where is the optimal point and define .

4. Finding a basis for the identified subspace: Construct the matrix from the columns of corresponding to the elements of greater than . Define matrix as a orthonormal basis for the dominant left singular vectors of .

5. Finding a basis for the rest of the data: Define the vector whose entries are equal to the -norm of the columns of . Normalize as . Construct as the columns of corresponding to the elements of greater than . Define as an orthonormal basis for the columns of .

6. Find the data point belonging to the identified subspace: Assign to the identified subspace if .

7. Remove the data points belonging to the identified subspace: Update by removing the columns corresponding to the identified subspace.

End While

Algorithm 1 Innovation pursuit for noisy data (the iterative framework)
Remark 2.

The proposed method can be made parameter-free if we can avoid thresholding in Steps 4 and 5 of Algorithm 1. Indeed, in Step 4 we can construct using the columns of corresponding to the largest elements of . has to be chosen large enough such that the sampled columns span the identified subspace, and hence can be set if we have access to an upper bound on the dimension of the subspaces, which is naturally available in many applications. For example, in motion segmentation we know that . The matrix can be constructed in a similar way, i.e., without thresholding.

Iii-a Minimizing error propagation

If (or ) and the threshold in Algorithm 1 are chosen appropriately, the algorithm exhibits strong robustness in the presence of noise. Nonetheless, if the data is too noisy, an error incurred in one step of the algorithm may propagate and unfavorably affect the performance in subsequent steps. In the following, we discuss the two main sources of error and present some techniques to effectively neutralize their impact on subsequent iterations.

A.1 Some data points are erroneously included in and : Suppose that is the subspace to be identified in a given step of the algorithm, i.e., the optimal point of (23) lies in the innovation subspace corresponding to . If the noise component is too strong, few data points from the other subspaces may be erroneously included in . In this subsection, we present a technique to remove these erroneous data points (In [36], we analyze the proposed technique as a robust subspace recovery algorithm). Consider two columns and of , where belongs to and to one of the other subspaces, and define the inner products and . Since contains many data points that are coherent with , whp. Thus, by removing a portion of the columns of with small inner products, the columns belonging to the other subspaces are likely to be removed. In addition, we obtain from the principal directions of which mitigates the impact of noise and erroneous data points. The same technique can be used to remove the wrong data points from . The table of Algorithm 2 presents the details of using the proposed idea in the fourth step of Algorithm 1 to remove the erroneous data points from . The complexity of this extra step is roughly . The proposed method is remarkably faster than the state-of-the-art subspace clustering algorithms even with this additional step since the complexity of solving the underlying optimization problem is linear in the number of data points. In section V-D, we compare the run time of the proposed method to that of the state-of-the-art algorithms.

Initialization Set equal to an integer between 0 and 50.
4. Finding a basis for the identified subspace
4.1 Construct the matrix from the columns of corresponding to the elements of that are greater than , or using the columns of corresponding to the largest elements of .
4.2 Define . Remove percent of the columns of corresponding to the columns of with the smallest -norms.
4.3 Define as an orthonormal basis for .

Algorithm 2 Fourth step of Algorithm 1 with a technique for removing erroneous data points

A.2 Some of the data points remain unidentified: Suppose is to be identified in a given iteration, yet not all the data points belonging to are identified, i.e., some of these points remain unidentified. In this case, an error may occur if one such point is used for the constraint vector . However, such an error can be easily detected because if one such point is used as , the vector would be too sparse since the optimal direction is orthogonal to all the data points expect a few remaining points of . As an example, consider the setting where follows Data model 1 with , , but , i.e., contains only few data points. Fig. 4 shows the output of (22) with and . The right plot shows a solution that is too sparse. Accordingly, if the output of (22) is too sparse, we solve (22) again using a new constraint vector.

Fig. 4: The output of the proposed method with different choices of the constraint vector. In the left and right plots, the first and the column are used as a constraint vector, respectively. The first column lies in a cluster with 100 data points and the column lies in a cluster with 6 data points.

A.3 Error correction: The robustness of spectral clustering based algorithms stems from the fact that the spectral clustering step considers the representation of each of the data points, thus few errors in the similarity matrix do not significantly impact the overall performance of the algorithm. The proposed multi-step algorithm is of low complexity, however, if a data point is assigned to an incorrect cluster in a given step, its assignment will not change in subsequent steps of the algorithm that only consider the remaining data points. Thus, for scenarios where the data is highly noisy, we propose a technique for final error correction to account for and correct wrong assignments of data points to incorrect clusters. The table of Algorithm 3 presents the proposed technique, which is applied to the clustered data after Algorithm 1 terminates to minimize the clustering error. It uses the idea presented in Algorithm 2 to obtain a set of bases for the subspaces with respect to which the clustering is subsequently updated. In fact, Algorithm 3 can be applied multiple times, each time updating the clustering.

Iii-B Using multiple constraint vectors

In each step of the proposed algorithm, we could solve (23) with multiple choices for and pick the one leading to the most favorable solution. For instance, we can find the nearest neighbors to the least dominant singular vector among the data points, or find data points that are most close to the least dominant singular vectors, and solve (22) with all the choices of the constraint vector. The question remains as of how to identify the best choice for . Ideally, one would favor the constraint vector that minimizes the clustering error. Note that each step of the proposed algorithm is clustering the data points into two subspaces. When the clustering error increases, the distance between the identified subspaces decreases. To clarify, consider and spanning subspaces and , respectively. A subspace clustering algorithm clusters the data matrix into and . If the number of data points belonging to in increases, the identified subspace corresponding to gets closer to . As such, we choose the constraint vector which leads to the maximum distance between the identified subspaces. The distance between two subspaces can be measured using their orthogonal bases. For instance, can be used as a distance measure between and [32, 14], as it is inversely proportional to the distance between and .

Define the matrices as the clustered data.

Error Correction
1 For

1.1 Define . Remove percent of the columns of corresponding to the columns of with the smallest -norms.

1.2 Obtain as an orthonomal basis for the column space of .
End For

2 Update the data clustering with respect to the obtained bases (the matrices are updated), i.e., a data point is assigned to the cluster if .

Algorithm 3 Final error correction

Iv Innovation pursuit with spectral clustering

In Section II, we have shown analytically that the direction of innovation is a powerful tool for distinguishing subspaces. The presented iterative algorithm, Algorithm 1, uses the directions of innovation iteratively to consecutively identify the subspaces. Algorithm 1 is provable, fast, and strongly robust to the intersection between the subspaces but has some limitations. In this section, we first discuss the limitations of the proposed iterative approach. Then, we show that the proposed innovation search method can be integrated with spectral clustering to yield a new spectral-clustering-based subspace segmentation algorithm that does not have the limitations of Algorithm 1.

Iv-a Limitations of Algorithm 1

Fig. 5: The data lies in a union of 4 random 5-dimensional subspaces. In the left plot . In the right plot, . For both plots, the first column is used as a constraint vector.

Iv-A1 Innovation

The main requirement of Algorithm 1 is that no subspace should lie in the direct sum of the other subspaces. In other words, the dimension of the innovation subspace corresponding to each subspace should be at least 1. Thus, the number of subspaces cannot be larger than the ambient dimension. More specifically, suppose the data lies in a union of -dimensional subspaces and that the dimension of the innovation subspace corresponding to each subspace is equal to . In this case, the dimension of the ambient space should be greater than or equal to . As an example, assume the data points lie in a union of 4 random 5-dimensional subspaces, i.e., , each with 100 data points. Fig. 5 shows with . In the left and right plots and , respectively. When , the dimension of the innovation subspace corresponding to each subspace is equal to 5 whp. But, when , not every subspace carries innovation relative to the other subspaces. In the left plot, the non-zero elements correspond to only one subspace, and also the span of the data points corresponding to the zero elements does not contain the span of those corresponding to the non-zero elements. Thus, Algorithm 1 can successfully identify and isolate the first subspace. By contrast, both of these conditions are violated in the scenario of the right plot.

Iv-A2 Sub-clusters with innovation

According to the second condition of Lemma 1, the data points in a cluster should not lie in the union of lower dimensional subspaces each with innovation w.r.t. to the other ones. For instance, assume and . The data points in , , and span three independent 1-dimensional subspaces. In this case, Algorithm 1 will identify three subspaces since the column spaces of and have innovation relative to the other subspaces. If it is sensible to cluster the data into three clusters, then the algorithm will yield correct clustering in that sense. However, if it only makes sense to cluster the data into exactly two clusters, then the algorithm may not cluster the data correctly. For example, one possible output would be