Efficient Online Minimization for Low-Rank Subspace Clustering

03/28/2015 ∙ by Jie Shen, et al. ∙ Rutgers University 0

Low-rank representation (LRR) has been a significant method for segmenting data that are generated from a union of subspaces. It is, however, known that solving the LRR program is challenging in terms of time complexity and memory footprint, in that the size of the nuclear norm regularized matrix is n-by-n (where n is the number of samples). In this paper, we thereby develop a fast online implementation of LRR that reduces the memory cost from O(n^2) to O(pd), with p being the ambient dimension and d being some estimated rank (d < p ≪ n). The crux for this end is a non-convex reformulation of the LRR program, which pursues the basis dictionary that generates the (uncorrupted) observations. We build the theoretical guarantee that the sequence of the solutions produced by our algorithm converges to a stationary point of the empirical and the expected loss function asymptotically. Extensive experiments on synthetic and realistic datasets further substantiate that our algorithm is fast, robust and memory efficient.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past a few years, subspace clustering [Vid10, SC12]

has been extensively studied and has established solid applications, for example, in computer vision 

[EV09] and network topology inference [EBN11]. Among many subspace clustering algorithms which aim to obtain a structured representation to fit the underlying data, two prominent examples are Sparse Subspace Clustering (SSC) [EV09, SEC14] and Low-Rank Representation (LRR) [LLY13]. Both of them utilize the idea of self-expressiveness, i.e., expressing each sample as a linear combination of the remaining. What is of difference is that SSC pursues a sparse solution while LRR prefers a low-rank structure.

In this paper, we are interested in the LRR method, which is shown to achieve state-of-the-art performance on a broad range of real-world problems [LLY13]. Recently, [LL14] demonstrated that, when equipped with a proper dictionary, LRR can even handle the coherent data – a challenging issue in the literature [CR09, CLMW11] but commonly emerges in realistic datasets such as the Netflix.

Formally, the LRR problem we investigate here is formulated as follows [LLY13]:

(1.1)

Here, is the observation matrix with samples lying in a -dimensional subspace. The matrix is a given dictionary, is some possible sparse corruption and and are two tunable parameters. Typically, is chosen as the dataset itself. The program seeks a low-rank representation among all samples, each of which can be approximated by a linear combination of the atoms in the dictionary .

While LLR is mathematically elegant, three issues are immediately incurred in the face of big data:

Issue 1 (Memory cost of ).

In the LRR formulation (1.1), there is typically no sparsity assumption on . Hence, the memory footprint of is proportional to which precludes most of the recently developed nuclear norm solvers [LCM10, JS10, AKKS12, HO14].

Issue 2 (Computational cost of ).

Due to the size of the nuclear norm regularized matrix is , optimizing such problems can be computationally expensive even when is not too large [RFP10].

Issue 3 (Memory cost of ).

Since the dictionary size is , it is prohibitive to store the entire dictionary during optimization when manipulating a huge volume of data.

To remedy these issues, especially the memory bottleneck, one potential way is solving the problem in online manner. That is, we sequentially reveal the samples and update the components in and . Nevertheless, such strategy appears difficult to execute due the the residual term in (1.1). To be more precise, we note that each column of is the coefficients of a sample with respect to the entire dictionary , e.g., . This indicates that without further technique, we have to load the entire dictionary so as to update the columns of . Hence, for our purpose, we need to tackle a more serious challenge:

Issue 4 (Partial realization of ).

We are required to guarantee the optimality of the solution but can only access part of the atoms of in each iteration.

1.1 Related Works

There are a vast body of works attempting to mitigate the memory and computational bottleneck of the nuclear norm regularizer. However, to the best of our knowledge, none of them can handle Issue 3 and Issue 4 in the LRR problem.

One of the most popular ways to alleviate the huge memory cost is online implementation. [FXY13]

devised an online algorithm for the Robust Principal Component Analysis (RPCA) problem, which makes the memory cost independent of the sample size. Yet, compared to RPCA where the size of the nuclear norm regularized matrix is

, that of LRR is  – a worse and more challenging case. Moreover, their algorithm cannot address the partial dictionary issue that emerges in our case. It is also worth mentioning that [QVLH14] established another online variant of RPCA. But since we are dealing with a different problem setting, i.e., the multiple subspaces regime, it is not clear how to extend their method to LRR.

To tackle the computational overhead, [CCS10]

considered singular value thresholding technique. However, it is not scalable to large problems since it calls singular value decomposition (SVD) in each iteration.

[JS10] utilized a sparse semi-definite programming solver to derive a simple yet efficient algorithm. Unfortunately, the memory requirement of their algorithm is proportional to the number of observed entries, making it impractical when the regularized matrix is large and dense (which is the case of LRR). [AKKS12] combined stochastic subgradient and incremental SVD to boost efficiency. But for the LRR problem, the type of the loss function does not meet the requirements and thus, it is still not practical to use that algorithm in our case.

Another line in the literature explores a structured formulation of LRR beyond the low-rankness. For example, [WXL13] provably showed that combining LRR with SSC can take advantages of both methods. [LL14] demonstrated that LRR is able to cope with the intrinsic group structure of the data. Very recently, [SL16] argued that the vanilla LRR program does not fully characterize the nature of multiple subspaces, and presented several effective alternatives to LRR.

1.2 Summary of Contributions

In this paper, we propose a new algorithm called Online Low-Rank Subspace Clustering (OLRSC), which admits a low computational complexity. In contrast to existing solvers, OLRSC reduces the memory cost of LRR from to  (). This nice property makes OLRSC an appealing solution for large-scale subspace clustering problems. Furthermore, we prove that the sequence of solutions produced by OLRSC converges to a stationary point of the expected loss function asymptotically even though only one atom of is available at each iteration. In a nutshell, OLRSC resolves all practical issues of LRR and still promotes global low-rank structure – the merit of LRR.

1.3 Roadmap

The paper is organized as follows. In Section 2, we reformulate the LRR program (1.1) in a way which is amenable for online optimization. Section 3 presents the algorithm that incrementally minimizes a surrogate function to the empirical loss. Along with that, we establish a theoretical guarantee in Section 4. The experimental study in Section 5 confirms the efficacy and efficiency of our proposed algorithm. Finally, we conclude the work in Section 6 and the lengthy proof is deferred to the appendix.

Notation.  We use bold lowercase letters, e.g.

, to denote a column vector. The

norm and norm of a vector are denoted by and respectively. Bold capital letters such as are used to denote a matrix, and its transpose is denoted by

. For an invertible matrix

, we write its inverse as . The capital letter

is reserved for identity matrix where

indicates the size. The th column of a matrix is denoted by if not specified. Three matrix norms will be used: for the nuclear norm, i.e., the sum of the singular values, for the Frobenius norm and for the norm of a matrix seen as a long vector. The trace of a square matrix is denoted by . For an integer , we use to denote the integer set .

2 Problem Formulation

Our goal is to efficiently learn the representation matrix and the corruption matrix in an online manner so as to mitigate the issues mentioned in Section 1. The first technique for our purpose is a non-convex reformulation of the nuclear norm. Assume that the rank of the global optima in (1.1) is at most . Then a standard result in the literature (see, e.g., [FHB01]) showed that,

(2.1)

where and . The minimum can be attained at, for example, and where is the singular value decomposition.

In this way, (1.1) can be written as follows:

(2.2)

Note that by this reformulation, updating the entries in amounts to sequentially updating the rows of and . Also note that this technique is utilized in [FXY13] for online RPCA. Unfortunately, the size of and in our problem are both proportional to and the dictionary is partially observed in each iteration, making the algorithm in [FXY13] not applicable to LRR. Related to online implementation, another challenge is that, all the rows of

are coupled together at this moment as

is left multiplied by in the first term. This makes it difficult to sequentially update the rows of .

For the sake of decoupling the rows of , as part of the crux of our technique, we introduce an auxiliary variable , whose size is (i.e., independent of the sample size ). Interestingly, in this way, we are approximating the term with , which provides an intuition on the role of : Namely, can be seen as a basis dictionary of the clean data, with being the coefficients.

These key observations allow us to derive an equivalent reformulation of LRR (1.1):

(2.3)

By penalizing the constraint in the objective, we obtain a regularized version of LRR on which our algorithm is based:

(2.4)
Remark 1 (Superiority to LRR).

There are two advantages of (2.4) compared to (1.1). First, it is amenable for online optimization. Second, it is more informative since it explicitly models the basis of the union of subspaces, hence a better subspace recovery and clustering (see Section 5). This actually meets the core idea of [LL14] but they assumed contains true subspaces whereas we learn the true subspaces.

Remark 2 (Parameter).

Note that may be gradually increased until some maximum value is attained so as to enforce the equality constraint. In this way, (2.4) attains the same minimum as (1.1). Actually, the choice of depends on how much information brings for the subspace basis. As we aforementioned, is the basis dictionary of the clean data and is in turn approximated by (or equal to) . This suggests that the range of is a subset of that of . As a typical choice of , if is slightly corrupted, we would like to pick a large quantity for .

Remark 3 (Connection to RPCA).

Due to our explicit modeling of the basis, we unify LRR and RPCA as follows: for LRR, (or if tends to infinity) while for RPCA, . That is, ORPCA [FXY13] considers a problem of whose size is independent of , hence can be kept in memory which naturally resolves Issue 3 and 4. This is why RPCA can be easily implemented in an online fashion while LRR cannot.

Remark 4 (Connection to Dictionary Learning).

Generally speaking, LRR (1.1) can be seen as a coding algorithm, with the dictionary known in advance and is a desired structured code while other popular algorithms such as dictionary learning (DL) [MBPS10] simultaneously optimizes the dictionary and the sparse code. Interestingly, in view of (2.4), the link of LRR and DL becomes more clear in the sense that the difference lies in the way how the dictionary is constrained. That is, for LRR we have and is further regularized by Frobenius norm whereas for DL, we have for each column of .

Let , , , , and be the th column of matrices , , , and respectively and define

(2.5)
(2.6)

In addition, let

(2.7)
(2.8)

Then (2.4) can be rewritten as:

(2.9)

which amounts to minimizing the empirical loss function:

(2.10)

In stochastic optimization, we are interested in analyzing the optimality of the obtained solution with respect to the expected loss function. To this end, we first derive the optimal solutions , and that minimize (2.9) which renders a concrete form of the empirical loss function , hence we are able to derive the expected loss.

Given , we need to compute the optimal solutions , and to evaluate the objective value of . What is of interest here is that, the optimization procedure of is totally different from that of and . According to (2.6), when is given, each and can be solved by only accessing the th sample . However, the optimal depends on the whole dictionary as the second term in couples all the ’s. Fortunately, it is possible to obtain a closed form solution for which simplifies our analysis. To be more precise, the first order optimality condition for (2.8) gives

(2.11)

which implies

(2.12)

Likewise, another component in (2.7) can be derived as follows:

(2.13)

where we denote

(2.14)

Recall that is the th column of . So for each , we immediately have

(2.15)

Plugging and back to gives

(2.16)

Now we derive the expected loss function, which is defined as the limit of the empirical loss function when tends to infinity. If we assume that all the samples are drawn independently and identically from some (unknown) distribution, we have

(2.17)

If we further assume that the smallest singular value of is bounded away from zero (which implies is invertible and the spectrum of is bounded from the above), we have

(2.18)

Here is some absolute constant since is fixed and ’s are bounded. Hence, it follows that

(2.19)

Finally, the expected loss function is given by

(2.20)

3 Algorithm

Our OLRSC algorithm is summarized in Algorithm 1. Recall that OLRSC is an online implementation to solve (2.10), which is derived from the regularized version of LRR (2.4). The main idea is optimizing the variables in an alternative manner. That is, at the -th iteration, assume the basis dictionary is given, we compute the optimal solutions by minimizing the objective function over and . For , we need a more carefully designed paradigm since a direct optimization involves loading the full dictionary (see (2.15)). We will elaborate the details later. Subsequently, we update the basis dictionary by optimizing a surrogate function to the empirical loss . In our algorithm, we need to maintain three additional accumulation matrices for which the sizes are independent of .

0:   (observed samples), , parameters , and , initial basis , zero matrices , and .
0:  Optimal basis .
1:  for  to  do
2:     Access the -th sample and the -th atom .
3:     Compute the coefficient and noise:
4:     Update the accumulation matrices:
5:     Update the basis dictionary:
6:  end for
Algorithm 1 Online Low-Rank Subspace Clustering

Solving .  We observe that if is fixed, we can optimize in closed form:

(3.1)

Conversely, given , the variable is obtained via soft-thresholding [Don95]:

(3.2)

Thus, we utilize block coordinate minimization algorithm to optimize and .

Solving .  The closed form solution (2.15) tells us that it is impossible to derive an accurate estimation of without the entire dictionary . Thus, we have to “approximately” solve it during the online optimization procedure111Here, “accurately” and “approximately” mean that when only , and are given, whether we can obtain the same solution as for the batch problem (2.10)..

Our carefully designed approximate process to solve  (2.7) is motivated by the coordinate minimization method appealing to . As a convention, such method starts with initial guess that for all and updates the ’s in a cyclic order, i.e., , , , , , . Let us consider the first pass where we have already updated , , , and are to optimize over for some . Note that since the initial values are zero, . Thereby, the optimal is actually given by minimizing the following function:

(3.3)

where

(3.4)

We easily obtain the closed form solution to (3.3) as follows:

(3.5)

Now let us turn to the alternating minimization algorithm, where is updated iteratively rather than fixed as in (3.5). The above coordinate minimization process can be adjusted in this scenario as we did in Algorithm 1. That is, given , after revealing a new atom , we compute by minimizing , followed by updating . In this way, when the algorithm terminates, we in essence run a one-pass update on ’s with a simultaneous computation of new basis dictionary.

Solving .  As soon as the past filtration are available, we can compute a new iterate by optimizing the surrogate function

(3.6)

Expanding the first term, we find that is given by

(3.7)

where and . We point out that the size of is and that of is , i.e., independent of sample size. In practice, as [MBPS10] suggested, one may apply a block coordinate descent approach to minimize over . Compared to the closed form solution given above, such algorithm usually converges very fast after revealing sufficient number of samples. In fact, we observe that a one-pass update on the columns of suffices to ensure a favorable performance. See Algorithm 2.

0:   in the previous iteration, accumulation matrix , and , parameters and .
0:  Optimal (updated).
1:  Denote and .
2:  repeat
3:     for  to  do
4:        Update the th column of :
5:     end for
6:  until convergence
Algorithm 2 Solving

Memory Cost.  It is remarkable that the memory cost of Algorithm 1 is . To see this, note that when solving and , we load the auxiliary variable and a sample into the memory, which costs . To compute the optimal ’s, we need to access and . Although we aim to minimize (3.6), which seems to require all the past information, we actually only need to record , and , whose sizes are at most (since ).

Computational Efficiency.  In addition to memory efficiency, we further clarify that the the computation in each iteration is cheap. To compute , one may utilize the block coordinate method in [RT14]

which enjoys linear convergence due to strong convexity. One may also apply the stochastic variance reduced algorithms which also ensure a geometric rate of convergence 

[XZ14, DBL14]. The is given by simple matrix-vector multiplication, which costs . It is easy to see the update on the accumulation matrices is and that of is .

A Fully Online Scheme.  Now we have provided a way to (approximately) optimize the LRR problem (1.1

) in online fashion. Usually, researchers in the literature will take an optional post-processing step to refine the segmentation accuracy, for example, applying spectral clustering on the representation matrix

. In this case, one has to collect all the ’s and ’s to compute which again increases the memory cost to . Here, we suggest an alternative scheme which admits memory usage where is the number of subspaces. The idea is utilizing the well-known -means on ’s. There are two notable advantages compared to the spectral clustering. First, updating the -means model can be implemented in online manner and the computation is for each iteration. Second, we observe that is actually a robust feature for the th sample. Combining the online -means with Algorithm 1, we obtain a fully and efficient online subspace clustering scheme where the memory cost is . For the reader’s convenience, we summarize this pipeline in Algorithm 3.

0:   (observed samples), , parameters , and , initial basis , zero matrices , and , number of clusters , initial centroids .
0:  Optimal basis , cluster centroids , cluster assignments .
1:  Initialize .
2:  for  to  do
3:     Access the -th sample and the -th atom .
4:     Compute by Algorithm 1.
5:     Compute .
6:     Update the -th center:
7:  end for
Algorithm 3 Fully Online Pipeline for Low-Rank Subspace Clustering

An Accurate Online Implementation.  Our strategy for solving is based on an approximate routine which resolves Issue 4 as well as has a low complexity. Yet, to tackle Issue 4, another potential way is to avoid the variable 222We would like to thank the anonymous NIPS 2015 Reviewer for pointing out this potential solution to the online algorithm. Here we explain why this alternative can be computationally expensive.. Recall that we derive the optimal solution (provided that is given) to (2.4) is given by (2). Plugging it back to (2.4), we obtain

where

Here, was given in (2.14). Note that the size of is . Hence, if we incrementally compute the accumulation matrix , we can update the variable in an online fashion. Namely, at -th iteration, we re-define the surrogate function as follows:

Again, by noting the fact that only involves recording and , we show that the memory cost is independent of sample size.

While promising since the above procedure avoids the approximate computation, the main shortcoming is computing the inverse of a matrix in each iteration, hence not efficient. Moreover, as we will show in Theorem 1, although the ’s are approximate solutions, we are still guaranteed the convergence of .

4 Theoretical Analysis

We make three assumptions underlying our analysis.

Assumption 1.

The observed data are generated i.i.d. from some distribution and there exist constants and , such that the conditions and hold almost surely.

Assumption 2.

The smallest singular value of the matrix is lower bounded away from zero.

Assumption 3.

The surrogate functions are strongly convex for all .

Based on these assumptions, we establish the main theoretical result, justifying the validity of Algorithm 1.

Theorem 1.

Assume 12 and 3. Let be the sequence of optimal bases produced by Algorithm 1. Then, the sequence converges to a stationary point of the expected loss function when goes to infinity.

Note that since the reformulation of the nuclear norm (2.1) is non-convex, we can only guarantee that the solution is a stationary point in general [Ber99]. We also remark that OLRSC asymptotically fulfills the first order optimality condition of (1.1). To see this, we follow the proof technique of Prop.3 in [MMG15] and let , , , , . Due to our uniform bound (Prop. 7), we justify the optimality condition. See the details in [MMG15].

More interestingly, as we mentioned in Section 3, the solution (3.5) is not accurate in the sense that it is not equal to that of (2.15) given . Yet, our theorem asserts that this will not deviate away from the stationary point. The intuition underlying such amazing phenomenon is that the expected loss function (2.20) is only determined by which does not involve . What is of matter for and is their uniform boundedness and concentration to establish the convergence. Thanks to the carefully chosen function and the surrogate function , we are able to prove the desired property by mathematical induction which is a crucial step in our proof.

In particular, we have the following lemma that facilitates our analysis:

Lemma 2.

Assume 1 and 2 and 3. Let be the sequence of the matrices produced by Algorithm 1. Then, there exists some universal constant , such that for all , .

Due to the above lemma, the solution is essentially determined by and when is a very large quantity since . We also have a non-asymptotic rate for the numerical convergence of as . See Appendix B for more details and a full proof.

5 Experiments

Before presenting the empirical results, we first introduce the universal settings used throughout the section.

Algorithms.  For the subspace recovery task, we compare our algorithm with state-of-the-art solvers including ORPCA [FXY13], LRR [LLY13] and PCP [CLMW11]. For the subspace clustering task, we choose ORPCA, LRR and SSC [EV09] as the competitive baselines. Recently, [LL14] improved the vanilla LRR by utilizing some low-rank matrix for . We denote this variant of LRR by LRR2 and accordingly, our algorithm equipped with such is denoted as OLRSC2.

Evaluation Metric.  We evaluate the fitness of the recovered subspaces (with each column being normalized) and the ground truth by the Expressed Variance (EV) [XCM10]:

(5.1)

The value of EV scales between 0 and 1, and a higher value means better recovery.

The performance of subspace clustering is measured by clustering accuracy, which also ranges in the interval , and a higher value indicates a more accurate clustering.

Parameters.  We set , and , where is the iteration counter. These settings are actually used in ORPCA. We follow the default parameter setting for the baselines.

5.1 Subspace Recovery

Simulation Data.  We use 4 disjoint subspaces , whose bases are denoted by . The clean data matrix is then produced by , where . The entries of ’s and

’s are sampled i.i.d. from the normal distribution. Finally, the observed data matrix

is generated by , where is the column-wise concatenation of ’s followed by a random permutation, is the sparse corruption whose

fraction entries are non-zero and follow an i.i.d. uniform distribution over

. We independently conduct each experiment 10 times and report the averaged results.

Robustness.  We illustrate by simulation results that OLRSC can effectively recover the underlying subspaces, confirming that converges to the union of subspaces. For the two online algorithms OLRSC and ORPCA, We compute the EV after revealing all the samples. We examine the performance under different intrinsic dimension ’s and corruption . To be more detailed, the ’s are varied from to with a step size , and the is from 0 to 0.5, with a step size 0.05.

Figure 1: Subspace recovery under different intrinsic dimensions and corruptions. Brighter is better. We set , and . LRR and PCP are batch methods. OLRSC consistently outperforms ORPCA and even improves the performance of LRR. Compared to PCP, OLRSC is competitive in most cases and degrades a little for highly corrupted data, possibly due to the number of samples is not sufficient for its convergence.

The results are presented in Figure 1. The most intriguing observation is that OLRSC as an online algorithm outperforms its batch counterpart LRR! Such improvement may come from the explicit modeling for the basis, which makes OLRSC more informative than LRR. Interestingly, [GQV14] also observed that in some situations, an online algorithm can outperform the batch counterpart. To fully understand the rationale behind this phenomenon is an important direction for future research. Notably, OLRSC consistently beats ORPCA (an online version of PCP), which may be the consequence of the fact that OLRSC takes into account that the data are produced by a union of small subspaces. While PCP works well for almost all scenarios, OLRSC degrades a little when addressing difficult cases (high rank and corruption). This is not surprising since Theorem 1

is based on asymptotic analysis and hence, we expect that OLRSC will converge to the true subspace after acquiring more samples.

Convergence Rate.  Now we test on a large dataset to show that our algorithm usually converges to the true subspace faster than ORPCA. We plot the EV curve against the number of samples in Figure 2. Firstly, when equipped with a proper matrix , OLRSC2 and LRR2 can always produce an exact recovery of the subspace as PCP does. When using the dataset itself for , OLRSC still converges to a favorable point after revealing all the samples. Compared to ORPCA, OLRSC is more robust and converges much faster for hard cases (see, e.g., ). Again, we note that in such hard cases, OLRSC outperforms LRR, which agrees with the observation in Figure 1.

Figure 2: Convergence rate and time complexity. A higher EV means better subspace recovery. We set , , and . OLRSC always converges to or outperforms the batch counterpart LRR. For hard cases, OLRSC converges much faster than ORPCA. Both PCP and LRR2 achieve the best EV value. When equipped with the same dictionary as LRR2, OLRSC2 also well handles the highly corrupted data (). Our methods are more efficient than the competitors but PCP when is small, possibly because PCP utilizes a highly optimized C++ toolkit while ours are written in Matlab.

Computational Efficiency. We also illustrate the time complexity of the algorithms in the last panel of Figure 2. In short, our algorithms (OLRSC and OLRSC2) admit the lowest computational complexity for all cases. One may argue that PCP spends slightly less time than ours for a small (0.01 and 0.1). However, we remark here that PCP utilizes a highly optimized C++ toolkit to boost computation while our algorithms are fully written in Matlab. We believe that ours will work more efficiently if properly optimized by, e.g., the blas routine. Another important message conveyed by the figure is that, OLRSC is always being orders of magnitude computationally more efficient than the batch method LRR, as well as producing comparable or even better solution.

5.2 Subspace Clustering

Datasets.  We examine the performance for subspace clustering on 5 realistic databases shown in Table 1, which can be downloaded from the LibSVM website. For MNIST, We randomly select 20000 samples to form MNIST-20K since we find it time consuming to run the batch methods on the entire database.

#classes #samples #features
Mushrooms 2 8124 112
DNA 3 3186 180
Protein 3 24,387 357
USPS 10 9298 256
MNIST-20K 10 20,000 784
Table 1: Datasets for subspace clustering.

Standard Clustering Pipeline.  In order to focus on the solution quality of different algorithms, we follow the standard pipeline which feeds to a spectral clustering algorithm [NJW01]. To this end, we collect all the ’s and ’s produced by OLRSC to form the representation matrix . For ORPCA, we use as the similarity matrix [LLY13], where is the row space of and

is the clean matrix recovered by ORPCA. We run our algorithm and ORPCA with 2 epochs so as to apply backward correction on the coefficients (

and in ours and in ORPCA).

Fully Online Pipeline.  As we discussed in Section 3, the (optional) spectral clustering procedure needs the similarity matrix , making the memory proportional to . To tackle this issue, we proposed a fully online scheme where the key idea is performing -means on . Here, we examine the efficacy of this variant, which is called OLRSC-F.

OLRSC OLRSC-F ORPCA LRR SSC
Mush- 85.09 89.36 65.26 58.44 54.16
rooms 8.78 8.78 8.30 46.82 32 min
DNA 67.11 83.08 53.11 44.01 52.23
2.58 2.58 2.09 23.67 3 min
Protein 43.30 43.94 40.22 40.31 44.27
24.66 24.66 22.90 921.58 65 min
USPS 65.95 70.29 55.70 52.98 47.58
33.93 33.93 27.01 257.25 50 min
MNIST- 57.74 55.50 54.10 55.23 43.91
20K 129 129 121 32 min 7 hours
Table 2: Clustering accuracy (%) and computational time (seconds). For each dataset, the first row indicates the accuracy and the second row the running time. For all the large-scale datasets, OLRSC (or OLRSC-F) has the highest clustering accuracy. Regarding the running time, our method spends comparable time as ORPCA (the fastest solver) does while dramatically improves the accuracy. Although SSC is slightly better than SSC on Protein, it consumes one hour while OLRSC takes 25 seconds.

The results are recorded in Table 2, where the time cost of spectral clustering or -means is not included so we can focus on comparing the efficiency of the algorithms themselves. Also note that we use the dataset itself as the dictionary because we find that an alternative choice of does not help much on this task. For OLRSC and ORPCA, they require an estimation on the true rank. Here, we use as such estimation where is the number of classes of a dataset. Our algorithm significantly outperforms the two state-of-the-art methods LRR and SSC both for accuracy and efficiency. One may argue that SSC is slightly better than OLRSC on Protein. Yet, it spends 1 hour while OLRSC only costs 25 seconds. Hence, SSC is not practical. Compared to ORPCA, OLRSC always identifies more correct samples as well as consumes comparable running time. For example, on the USPS dataset, OLRSC achieves the accuracy of 65.95% while that of ORPCA is 55.7%. Regarding the running time, OLRSC uses only 7 seconds more than ORPCA – same order of computational complexity, which agrees with the qualitative analysis in Section 3.

Mushrooms DNA Protein USPS MNIST-20K
Spectral 295 18 7567 482 4402
-means 2 6 5 19 91
Table 3: Time cost (seconds) of spectral clustering and -means.

More interestingly, it shows that the -means alternative (OLRSC-F) usually outperforms the spectral clustering pipeline. This suggests that perhaps for robust subspace clustering formulations, the simple -means paradigm suffices to guarantee an appealing result. On the other hand, we report the running time of spectral clustering and -means in Table 3. As expected, since spectral clustering computes SVD for an -by- similarity matrix, it is quite slow. In fact, it sometimes dominates the running time of the whole pipeline. In contrast, -means is extremely fast and scalable, as it can be implemented in online fashion.

5.3 Influence of

A key ingredient of our formulation is a factorization on the nuclear norm regularized matrix, which requires an estimation on the rank of the (see (2.1)). Here we examine the influence of the selection of (which plays as an upper bound of the true rank). We report both EV and clustering accuracy for different under a range of corruptions. The simulation data are generated as in Section 5.1 and we set , and . Since the four subspaces are disjoint, the true rank is 40.

Figure 3: Examine the influence of . We experiment on . The true rank is 40.

From Figure 3, we observe that our algorithm cannot recover the true subspace if is smaller than the true rank. On the other hand, when is sufficiently large (at least larger than the true rank), our algorithm can perfectly estimate the subspace. This agrees with the results in [BM05] which says as long as is large enough, any local minima is global optima. We also illustrate the influence of on subspace clustering. Generally speaking, OLRSC can consistently identify the cluster of the data points if is sufficiently large. Interestingly, different from the subspace recovery task, here the requirement for seems to be slightly relaxed. In particular, we notice that if we pick as 20 (smaller than the true rank), OLRSC still performs well. Such relaxed requirement of may benefit from the fact that the spectral clustering step can correct some wrong points as suggested by [SEC14].

6 Conclusion

In this paper, we have proposed an online algorithm called OLRSC for subspace clustering, which dramatically reduces the memory cost of LRR from to  –  orders of magnitudes more memory efficient. One of the key techniques in this work is an explicit basis modeling, which essentially renders the model more informative than LRR. Another important component is a non-convex reformulation of the nuclear norm. Combining these techniques allows OLRSC to simultaneously recover the union of the subspaces, identify the possible corruptions and perform subspace clustering. We have also established the theoretical guarantee that solutions produced by our algorithm converge to a stationary point of the expected loss function. Moreover, we have analyzed the time complexity and empirically demonstrated that our algorithm is computationally very efficient compared to competing baselines. Our extensive experimental study on synthetic and realistic datasets also illustrates the robustness of OLRSC. In a nutshell, OLRSC is an appealing algorithm in all three worlds: memory cost, computation and robustness.

Appendix A Proof Preliminaries

Lemma 3 (Corollary of Thm. 4.1 [Bs98]).

Let . Suppose that for all the function is differentiable, and that and are continuous on . Let be the optimal value function , where is a compact subset of . Then is directionally differentiable. Furthermore, if for , has unique minimizer then is differentiable in and .

Lemma 4 (Corollary of Donsker theorem [vdV00]).

Let be a set of measurable functions indexed by a bounded subset of . Suppose that there exists a constant such that

for every and in and in . Then, is P-Donsker. For any in , let us define , and as

Let us also suppose that for all , and and that the random elements are Borel-measurable. Then, we have

where .

Lemma 5 (Sufficient condition of convergence for a stochastic process [Bot98]).

Let

be a measurable probability space,

, for , be the realization of a stochastic process and be the filtration by the past information at time . Let

If for all , and , then is a quasi-martingale and converges almost surely. Moreover,

Lemma 6 (Lemma 8 from [Mbps10]).

Let , be two real sequences such that for all , , , , , , such that . Then, .

Appendix B Proof Details

b.1 Proof of Boundedness

Proposition 7.

Let , , and be the optimal solutions produced by Algorithm 1. Then,

  1. , , and are uniformly bounded.

  2. is uniformly bounded.

  3. is supported by some compact set .

  4. is uniformly bounded.

Proof.

Let us consider the optimization problem of solving and . As the trivial solution are feasible, we have

Therefore, the optimal solution should satisfy:

which implies

Since is uniformly bounded (Assumption 1), and are uniformly bounded.

To examine the uniform bound for and , note that

Since for each , , and are uniformly bounded, and are uniformly bounded.

Now we derive the bound for . All the information we have is:

  1. (definition of ).

  2. (closed form solution).

  3. (first order optimality condition for ).

  4. , , are uniformly upper bounded (Claim 1).

  5. The smallest singular values of and are uniformly lower bounded away from zero (Assumption 2 and 3).

For simplicity, we write as:

(B.1)

where

Note that as we assume is positive definite, is always invertible.

From the definition of and (3.5), we know that