1 Introduction
Highdimensional data often have a small intrinsic dimension. For example, in the area of computer vision, face images of a subject
[1], [26], handwritten images of a digit [9], and trajectories of a moving object [20], can all be wellapproximated by a lowdimensional subspace of the highdimensional ambient space. Thus, multiple class data often lie in a union of lowdimensional subspaces. The subspace clustering problem is to partition highdimensional data into clusters corresponding to their underlying subspaces.Standard clustering methods such as kmeans in general are not applicable to subspace clustering. Various methods have been recently suggested for subspace clustering, such as Sparse Subspace Clustering (SSC)
[6] (see also its extensions and analysis in [11, 17, 18, 24]), Local Subspace Affinity (LSA) [27], Local Bestfit Flats (LBF) [28], Generalized Principal Component Analysis
[22], Agglomerative Lossy Compression [13], Locally Linear Manifold Clustering [8], and Spectral Curvature Clustering [5]. A recent survey on subspace clustering can be found in [21].Lowdimensional intrinsic structures, which enable subspace clustering, are often violated for realworld computer vision observations (as well as other types of real data). For example, under the assumption of Lambertian reflectance, [1] shows that face images of a subject obtained under a wide variety of lighting conditions can be approximated accurately with a 9dimensional linear subspace. However, realworld face images are often captured under pose variations; in addition, faces are not perfectly Lambertian, and exhibit cast shadows and specularities [3]. Therefore, it is critical for subspace clustering to handle corrupted underlying structures of realistic data, and as such, deviations from ideal subspaces.
When data from the same lowdimensional subspace are arranged as columns of a single matrix, this matrix should be approximately lowrank. Thus, a promising way to handle corrupted data for subspace clustering is to restore such lowrank structure. Recent efforts have been invested in seeking transformations such that the transformed data can be decomposed as the sum of a lowrank matrix component and a sparse error one [14, 16, 29]. [14] and [29] are proposed for image alignment (see [10] for the extension to multipleclasses with applications in cryotomograhy), and [16] is discussed in the context of salient object detection. All these methods build on recent theoretical and computational advances in rank minimization.
In this paper, we propose to robustify subspace clustering by learning a linear transformation on subspaces using matrix rank, via its nuclear norm convex surrogate, as the optimization criteria. The learned linear transformation recovers a lowrank structure for data from the same subspace, and, at the same time, forces a highrank structure for data from different subspaces. In this way, we reduce variations within the subspaces, and increase separations between the subspaces for more accurate subspace clustering.
This paper makes the following main contributions:
Subspace lowrank transformation is introduced in the context of subspace clustering;
A learned Robust Subspace Clustering framework is proposed to enhance existing subspace clustering methods;
We propose a specific subspace clustering technique, called Robust Sparse Subspace Clustering, by exploiting lowrank structures of the learned transformed subspaces;
We discuss online learning of subspace lowrank transformation for big data;
We discuss learning of subspace lowrank transformations with compression, where the learned matrix simultaneously reduces the data embedding dimension;
The proposed approach can be considered as a way of learning data features, with such features learned in order to reduce rank and encourage subspace clustering. As such, the framework and criteria here introduced can be incorporated into other data classification and clustering problems.
2 Subspace Clustering using Lowrank Transformations
Let be dimensional subspaces of (not all subspaces are necessarily of the same dimension, this is only here assumed to simplify notation). Given a data set , with each data point in one of the subspaces, and in general the data arranged as columns of . denotes the set of points in the th subspace , and points are arranged as columns of the matrix . The subspace clustering problem is to partition the data set into clusters corresponding to their underlying subspaces.
As data points in lie in a lowdimensional subspace, the matrix is expected to be lowrank, and such lowrank structure is critical for accurate subspace clustering. However, as discussed above, this lowrank structure is often violated for realistic data.
Our proposed approach is to robustify subspace clustering by learning a global linear transformation on subspaces. Such linear transformation restores a lowrank structure for data from the same subspace, and, at the same time, encourages a highrank structure for data from different subspaces. In this way, we reduce the variation within the subspaces and introduce separations between the subspaces for more accurate subspace clustering. In other words, the learned transform prepares the data for the “ideal” conditions of subspace clustering.
2.1 Lowrank Transformation on Subspaces
We now discuss lowrank transformation on subspaces in the context of subspace clustering. We first introduce a method to learn a lowrank transformation using gradient descent (other optimization techniques could be considered). Then, we present the online version for big data. We further discuss the learning of a transformation with compression (dimensionality reduction) enabled.
2.1.1 Problem Formulation
We first assume the data cluster labels are known beforehand, assumption to be removed when discussing the full clustering approach in Section 2.2. We adopt matrix rank, actually its convex surrogate, as the key criterion, and compute one global linear transformation on all subspaces as
(1) 
where is one global linear transformation on all data points, and denotes the nuclear norm. Intuitively, minimizing the first representation term encourages a consistent representation for the transformed data from the same subspace; and minimizing the second discrimination term encourages a diverse representation for transformed data from different subspaces. The parameter balances between the representation and discrimination.
2.1.2 Gradient Descent Learning
Given any matrix of rank at most , the matrix norm
is equal to its largest singular value, and the nuclear norm
is equal to the sum of its singular values. Thus, these two norms are related by the inequality,(2) 
We use a simple gradient descent (though other modern nuclear norm optimization techniques could be considered, including recent realtime formulations [19]) to search for the transformation matrix that minimizes (1). The partial derivative of (1) w.r.t is written as,
(3) 
Due to property (2), by minimizing the matrix norm, one also minimizes an upper bound to the nuclear norm. (3) can now be evaluated as,
(4) 
where is the subdifferential of norm . Given a matrix , the subdifferential can be evaluated using the simple approach shown in Algorithm 1 [25]. By evaluating , the optimal transformation matrix can be searched with gradient descent where defines the step size. After each iteration, we normalize as . This algorithm guarantees convergence to a local minimum.
2.1.3 Online Learning
When data is big, we use an online algorithm to learn the lowrank transformation on subspaces:
We first randomly partition the data set into minibatches;
Using minibatch gradient descent, a variant of stochastic gradient descent, the gradient
is approximated by a sum of gradients obtained from each minibatch of samples, , where is obtained from (4) using only data points in the th minibatch;Starting with the first minibatch, we learn subspace transformation using data only in the th minibatch, with as warm restart.
2.1.4 Subspace Transformation with Compression
Given data , so far, we considered a square linear transformation of size . If we devise a “fat” linear transformation of size , where , we enable dimension reduction along with transformation (the above discussed algorithm is directly applicable to learning a linear transformation with less rows than columns). This connects the proposed framework with the literature on compressed sensing, though the goal here is to learn a sensing matrix for subspace classification and not for reconstruction [4]. The nuclearnorm minimization provides a new metric for such sensing design paradigm.
2.2 Learning for Subspace Clustering
We now first present a general procedure to enhance the performance of existing subspace clustering methods in the literature. Then we further propose a specific subspace clustering technique to fully exploit the lowrank structure of (learned) transformed subspaces.
2.2.1 A Learned Robust Subspace Clustering (RSC) Framework
The data labeling (clustering) is not known beforehand in practice. The proposed algorithm, Algorithm 2, iterates between two stages: In the first assignment stage, we obtain clusters using any subspace clustering methods, e.g., SSC [6], LSA [27], LBF [28]. In particular, in this paper we often use the new technique introduced in Section 2.2.2. In the second update stage, based on the current clustering result, we compute the optimal subspace transformation that minimizes (1). The algorithm is repeated until the clustering assignments stop changing. Algorithm 2 is a general procedure to enhance the performance of any subspace clustering methods.
2.2.2 Robust Sparse Subspace Clustering (RSSC)
Though Algorithm 2 can adopt any subspace clustering methods, to fully exploit the lowrank structure of transformed subspaces, we further propose the following specific technique for the clustering step in the RSC framework, called Robust Sparse Subspace Clustering (RSSC):
For the transformed subspaces, we first recover their lowrank representation by performing a lowrank decomposition (5), e.g., using RPCA [3],
(5) 
Each transformed point is then sparsely decomposed over ,
(6) 
where is a predefined sparsity value (). As explained in [6], a data point in a linear or affine subspace of dimension can be written as a linear or affine combination of or points in the same subspace. Thus, if we represent a point as a linear or affine combination of all other points, a sparse linear or affine combination can be obtained by choosing or nonzero coefficients.
As the optimization process for (6) is computationally demanding, we further simplify (6) using Local Linear Embedding [15], [23]. Each transformed point is represented using its Nearest Neighbors (NN) in , which are denoted as ,
(7) 
Let be centered through . can then be obtained in closed form,
where solves the system of linear equations . As suggested in [15], if the correlation matrix is nearly singular, it can be conditioned by adding a small multiple of the identity matrix.
Given the sparse representation of each transformed data point , we denote the sparse representation matrix as . It is noted that is written as an
sized vector with no more than
nonzero values (being the total number of data points). The pairwise affinity matrix is now defined as
and the subspace clustering is obtained using spectral clustering
[12].Based on experimental results presented in Section 3, the proposed RSSC outperforms stateoftheart subspace clustering techniques, both in accuracy and running time, e.g., about 500 times faster than the original SSC using the implementation provided in [6]. The accuracy is even further enhanced when RSCC is used as an internal step of RSC.
3 Experimental Evaluation
This section presents experimental evaluations on three public datasets (standard benchmarks): the MNIST handwritten digit dataset, the Extended YaleB face dataset [7], and the CMU Motion Capture (Mocap) dataset at http://mocap.cs.cmu.edu. The MNIST dataset consists of 8bit grayscale handwritten digit images of “0” through “9” and 7000 examples for each class. The Extended YaleB face dataset contains 38 subjects with near frontal pose under 64 lighting conditions. All the images are resized to . The Mocap dataset contains measurements of 42 (nonimaging) sensors that capture the motions of 149 subjects performing multiple actions.
Subspace clustering methods compared are SSC [6], LSA [27], and LBF [28]. Based on the studies in [6], [21] and [28], these three methods exhibit stateoftheart subspace clustering performance. We adopt the LSA and SSC implementations provided in [6] from http://www.vision.jhu.edu/code/, and the LBF implementation provided in [28] from http://www.ima.umn.edu/~zhang620/lbf/.
3.1 Evaluation with Illustrative Examples
We conduct the first set of experiments on a subset of the MNIST dataset. We adopt a similar setup as described in [28], using the same sets of 2 or 3 digits, and randomly choose 200 images for each digit. We do not perform dimension reduction to preprocess the data as [28]. We set the sparsity value for RSSC, and perform iterations for the gradient descent updates while learning the transformation on subspaces.
Fig. 1 shows the misclassification rate (e) and running time (t) on clustering subspaces of two digits. The misclassification rate is the ratio of misclassified points to the total number of points. For visualization purposes, the data are plotted with the dimension reduced to 2 using Laplacian Eigenmaps [2]. Different clusters are represented by different colors and the ground truth is plotted using the true cluster labels. The proposed RSSC outperforms stateoftheart methods, both in terms of clustering accuracy and running time. The clustering error of RSSC is further reduced using the proposed RSC framework in Algorithm 2 through the learned lowrank subspace transformation. The clustering results converge after about 3 RSC iterations. After convergence, the learned subspace transformation not only recovers a lowrank structure for data from the same subspace, but also increases the separations between the subspaces for more accurate clustering.
Fig. 2 shows misclassification rate (e) on clustering subspaces of three digits. Here we adopt LBF in our RSC framework, denoted as Robust LBF (RLBF), to illustrate that the performance of existing subspace clustering methods can be enhanced using the proposed RSC framework. After convergence, RLBF, which uses the proposed learned subspace transformation, significantly outperforms stateoftheart methods.
3.1.1 Online vs. Batch Learning
In this set of experiments, we use digits {1, 2} from the MNIST dataset. We select 1000 images for each digit, and randomly partition them into 5 minibatches. We first perform one iteration of RSC in Algorithm 2 over all selected data with various values. As shown in Fig. (a)a, we always observe empirical convergence for subspace transformation learning in (1).
As discussed, the value of balances between the representation and discrimination terms in the objective function (1). In general, the value of
can be estimated through crossvalidations. In our experiments, we always choose
, where is the number of subspaces.Starting with the first minibatch, we then perform one iteration of RSC over one minibatch a time, with the subspace transformation learned from the previous minibatch as warm restart. We adopt here iterations for the gradient descent updates. As shown in Fig. (b)b, we observe similar empirical convergence for online transformation learning. To converge to the same objective function value, it takes sec. for online learning and sec. for batch learning.
3.2 Application to Face Clustering
In the Extended YaleB dataset, each of the 38 subjects is imaged under 64 lighting conditions, shown in Fig. (a)a. We conduct the face clustering experiments on the first 9 subjects shown in Fig. (b)b. We set the sparsity value for RSSC, and perform iterations for the gradient descent updates while learning the transformation. Fig. 5 shows error rate (e) and running time (t) on clustering subspaces of 2 subjects, 3 subjects, and 9 subjects. The proposed RSSC outperforms stateoftheart methods for both accuracy and running time. Using the proposed RSC framework (that is, learning the transform), the misclassification errors of RSSC are further reduced significantly, for example, from to for subjects {1,2}, and from to for the 9 subjects.
3.3 Application to Motion Segmentation
In the Mocap dataset, we consider the trial 2 sequence performed by subject 86, which consists of eight different actions shown in Fig. 6. As discussed in [18], the data from each action lie in a lowdimensional subspace. We achieve temporal motion segmentation by clustering sensor measurements corresponding to different actions. We set the sparsity value for RSSC and downsample the sequence by a factor 2 as [18]. As shown in Table 1, the proposed approach again significantly outperforms stateoftheart clustering methods for motion segmentation.
3.4 Discussion on the Size of the Transformation Matrix
In the experiments presented above, images are resized to . Thus, the learned subspace transformation is of size . If we learn a transformation of size with , we enable dimension reduction while performing subspace transformation. Through experiments, we notice that the peak clustering accuracy is usually obtained when is smaller than the dimension of the ambient space. For example, in Fig. 5, through exhaustive search for the optimal , we observe the misclassification rate reduced from to for subjects {2, 3} at , and from to for subjects {4, 5, 6} at . As discussed before, this provides a framework to sense for clustering and classification, connecting the work here presented with the extensive literature on compressed sensing, and in particular for sensing design, e.g., [4]. We plan to study in detail the optimal size of the learned transformation matrix for subspace clustering and further investigate such connections with compressed sensing.
4 Conclusion
We presented a subspace lowrank transformation approach to robustify subspace clustering. Using matrix rank as the optimization criteria, we learn a subspace transformation that reduces variations within the subspaces, and increases separations between the subspaces for more accurate subspace clustering. We demonstrated that the proposed approach significantly outperforms stateoftheart subspace clustering methods.
References
 [1] R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Trans. on Patt. Anal. and Mach. Intell., 25(2):218–233, February 2003.
 [2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2003.
 [3] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):11:1–11:37, June 2011.
 [4] W. R. Carson, M. Chen, M. R. D. Rodrigues, R. Calderbank, and L. Carin. Communicationsinspired projection design with application to compressive sensing. SIAM J. Imaging Sci., 5(4):1185–1212, 2012.
 [5] G. Chen and G. Lerman. Spectral curvature clustering (SCC). International Journal of Computer Vision, 81(3):317–330, Mar. 2009.
 [6] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. on Patt. Anal. and Mach. Intell., 2013.

[7]
A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman.
From few to many: Illumination cone models for face recognition under variable lighting and pose.
IEEE Trans. on Patt. Anal. and Mach. Intell., 23(6):643–660, June 2001.  [8] A. Goh and R. Vidal. Segmenting motions of different types by unsupervised manifold clustering. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., 2007.
 [9] T. Hastie and P. Y. Simard. Metrics and models for handwritten character recognition. Statistical Science, 13(1):54–65, 1998.
 [10] O. Kuybeda, G. A. Frank, A. Bartesaghi, M. Borgnia, S. Subramaniam, and G. Sapiro. A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryoelectron tomography. Journal of Structural Biology, 181:116 127, 2013.

[11]
G. Liu, Z. Lin, and Y. Yu.
Robust subspace segmentation by lowrank representation.
In
International Conference on Machine Learning
, 2010.  [12] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, Dec. 2007.
 [13] Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed data via lossy data coding and compression. IEEE Trans. on Patt. Anal. and Mach. Intell., 29(9):1546–1562, 2007.
 [14] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. RASL: Robust alignment by sparse and lowrank decomposition for linearly correlated images. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., 2010.
 [15] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
 [16] X. Shen and Y. Wu. A unified approach to salient object detection via low rank matrix recovery. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., June 2012.

[17]
M. Soltanolkotabi and E. J. Candes.
A geometric analysis of subspace clustering with outliers.
The Annals of Statistics, 40(4):2195–2238, 2012.  [18] M. Soltanolkotabi, E. Elhamifar, and E. J. Candès. Robust subspace clustering. CoRR, abs/1301.2603, 2013.
 [19] P. Sprechmann, A. M. Bronstein, and G. Sapiro. Learning efficient sparse and low rank models. CoRR, abs/1212.3631, 2012.
 [20] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9:137–154, 1992.
 [21] R. Vidal. Subspace clustering. Signal Processing Magazine, IEEE, 28(2):52–68, 2011.
 [22] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (GPCA). In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., 2003.
 [23] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Localityconstrained linear coding for image classification. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., 2010.
 [24] Y. Wang and H. Xu. Noisy sparse subspace clustering. In International Conference on Machine Learning, 2013.
 [25] G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra and Applications, 170:1039–1053, 1992.
 [26] J. Wright, M. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans. on Patt. Anal. and Mach. Intell., 31(2):210–227, 2009.
 [27] J. Yan and M. Pollefeys. A general framework for motion segmentation: independent, articulated, rigid, nonrigid, degenerate and nondegenerate. In Proc. European Conference on Computer Vision, 2006.
 [28] T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local bestfit flats. International Journal of Computer Vision, 100(3):217–240, 2012.
 [29] Z. Zhang, X. Liang, A. Ganesh, and Y. Ma. TILT: transform invariant lowrank textures. In Proceedings of the 10th Asian conference on Computer vision, 2011.
Comments
There are no comments yet.