I Introduction
The key to graphbased learning algorithms is a sparse eigenvalue problem, i.e., constructing a blockdiagonal affinity matrix whose nonzero entries correspond to the intraclass data points. Based on the affinity matrix, a series of algorithms can be derived for various tasks, e.g., data clustering and feature extraction. The algorithms are called graphoriented learning methods
[1, 2, 3], because the affinity matrix actually represents a similarity graph in which each vertex is a data point and the edge weight denotes the similarity between the connected vertexes.There are generally two ways to build a similarity graph. One is based on pairwise distances (e.g., Euclidean distance), and the other is based on reconstruction coefficients. The second method assumes that each data point could be represented as a linear combination of the other points, as the intrasubspace points have the same basis. When the data is clean, i.e., the data points are strictly sampled from the subspaces, several approaches are able to recover the subspace structures [4, 5, 6]. However, in real applications, the data set may lie in the intersection of multiple subspaces or contain noise and outliers. As a result, interclass data points may connect to each other with very high connection weights. Hence, eliminating the effects of errors is a major challenge^{1}^{1}1In this article, we assume that noise, outliers, and other factors that impair the discrimination of the data representation are a kind of error. Consequently, we use the terms denoising and error handling to denote eliminating the effects of errors ..
To address these problems, several algorithms have been proposed, e.g., Locally Linear Manifold Clustering (LLMC) [7], Agglomerative Lossy Compression (ALC) [8], Sparse Subspace Clustering (SSC) [9, 10], L1graph [11, 12], Low Rank Representation (LRR) [13, 14, 15], Spectral MultiManifold Clustering (SMMC) [16], Latent Low Rank Representation (LatLRR) [17], Fixed Rank Representation (FRR) [18], Least Squares Regression (LSR) [19]
, and Sparse Representation Classifier Steered Discriminative Projection (SRCDR)
[20]. In [21], Vidal provided a comprehensive survey of these algorithms in the context of subspace clustering.Of the above methods, SSC [9, 10] and L1graph [11] obtain a sparse similarity graph from the sparsest coefficients. One of the main differences between these techniques is that [9, 10] formulate the noise and outliers in the objective function and provide more theoretical analysis, whereas [11] derives a series of algorithms upon the L1graph for various tasks. The popular LRR model [13, 14] and its extensions [15, 17, 18] obtain a similarity graph from the lowestrank representation rather than the sparsest one. Both  and rankminimizationbased methods can automatically select the neighbors for each data point, and have achieved impressive results in numerous applications. However, they must solve a convex problem whose computational complexity is proportional to the cube of the problem size. Moreover, SSC requires that the corruption over each data point has a sparse structure, and LRR assumes that only a small portion of the data are contaminated (socalled samplespecified corruption), otherwise the performance will be degraded. In fact, these two problems are mainly caused by the adopted errorhandling strategy, i.e., removing the errors from the data set to obtain a clean dictionary over which each sample is encoded.
Although these methods have achieved a lot of success, could we find a better way to achieve a blockdiagonal affinity matrix? Based on the assumption that the trivial coefficients are always distributed over trivial data points, such as noise, outliers, or faraway data points, this article proposes a method of eliminating the effects of errors by encoding each sample over the dictionary and then zeroing the trivial components of the coefficients, i.e., encoding and then denoising from the representation.
To verify the effectiveness of our scheme, we develop a simple but effective algorithm, named L2graph. Experimental results show that the proposed L2graph is superior to stateoftheart methods in subspace learning and subspace clustering with respect to accuracy, robustness, and computation time. Moreover, the theoretical analysis shows that our assumption is correct when the norm () or nuclear norm is enforced over the representation.
and the similarity graph based on a linear regression model, where the points are categorized into two subjects. (a) Data set drawn from two subjects. (b) Similarity graph based on the linear regression model. (c) Representation coefficient of a data point based on the linear regression model. (d) L2graph. (e) Representation coefficient of a data point achieved by the L2graph. Here, the L2graph disconnects the edges from faraway data points, which are regarded as a type of error.
Fig. 1 illustrates our basic idea and its effectiveness. For a given data set drawn from two clusters in , its linear representation, which is nonsparse, as demonstrated in Figs. 1 and 1, cannot satisfy the sparsity requirement of graphoriented approaches. After zeroing the trivial coefficients, as suggested by our scheme, we obtain a sparse similarity graph, called the L2graph (Figs. 1 and 1). The L2graph removes the connections with faraway data points, as these are regarded as a type of error.
This example could be misconstrued as inferring that the L2graph is not very competitive, as a NN graph based on the pairwise distance could also successfully separate these data points. In Section IV, we will show that this is not the case, and that the L2graph is superior to the NN graph by a considerable performance margin. Moreover, Fig. 1 shows that two points (socalled exemplars) are frequently connected with the other intrasubspace points. This implies that the L2graph can reveal the latent structure of a data distribution, an ability that is important to a lot of applications, such as data summarization [22]. This is another advantage of the L2graph over a NN graph based on pairwise distances.
Several aspects this work should be highlighted:

We propose a novel scheme for finding sparse similarity graphs by eliminating the effect of errors from the representation but from the dictionary, and develop the L2graph algorithm to corroborate the effectiveness of the scheme.

Under some conditions, we prove the correctness of the assumption that trivial coefficients are always distributed over trivial data points, when the norm () or nuclear norm is enforced over the representation.

The L2graph has an analytic solution that does not involve an iterative optimization process. Moreover, it avoids the need to construct different dictionaries for different data points. Thus, it is much faster than other graphconstruction methods, such as the L1graph.

The proposed algorithm is robust to various corruptions, and real occlusions in the context of subspace learning and subspace clustering.
Except in some specified cases, lowercase bold letters
represent column vectors and
uppercase bold letters represent matrices. denotes the transpose of the matrix whose pseudoinverse is , anddenotes the identity matrix.
The rest of the article is organized as follows: Section II presents some graph construction methods, and Section III provides the theoretical analysis to show that it is feasible to eliminate the effects of errors from the representation but from the dictionary. Moreover, we propose the L2graph algorithm, and derive a series of methods upon L2graph for subspace learning and subspace clustering. Section IV reports the results of a series of experiments to examine the effectiveness of the algorithm in the context of feature extraction, image clustering, and motion segmentation. Finally, Section V summarizes this work.
Ii Related Work
Over the past two decades, a number of graphoriented algorithms have been proposed to address various problems, e.g., data clustering [23], dimension reduction [24], and object tracking [25]. The key to these algorithms is the construction of a similarity graph to depict the relationship between data points. Hence, the performance of the algorithms largely depends on whether the graph can accurately determine the neighborhood of each data point, particularly when the data contain a lot of errors. Thus, eliminating the effects of errors to obtain a sparse similarity graph is a major challenge.
There are two ways to obtain a similarity graph, i.e., methods based on the pairwise distance and those that use reconstruction coefficients. The first scheme uses the distance between two points to measure their similarity. The distance between any two data points is independent from the other points, so the scheme is sensitive to noise and outliers. Alternatively, reconstruction coefficientbased similarity is datum adaptive. In other words, the similarity between two points may vary with the other data. This global property is beneficial to the robustness of such algorithms. Therefore, the second option has become increasingly popular, especially in highdimensional data analysis.
Locally Linear Embedding (LLE) [1] was possibly the first work to construct a similarity graph using reconstruction coefficients. Specifically, the coefficient for each data point can be calculated by solving
(1) 
where is the coefficient of over , and consists of the nearest neighbors of in Euclidean space. Clearly, the LLEgraph is sparse. A big problem with the LLEgraph is that it will not obtain a good result when the data are poorly or nonuniformly sampled.
Recently, some studies have exploited the inherent sparsity of sparse representation to obtain a blockdiagonal affinity matrix, e.g., SSC [9, 10], the L1graph [11, 12], and Sparsity Preserving Projection (SPP) [26].
In [11], the L1graph is proposed for image analysis, which solves the following problem:
(2) 
where is the sparse representation of over the dictionary , and is the error tolerance.
SSC [9, 10] was proposed for subspace segmentation, which solves the following problem:
(3) 
, where is the sparse representation of the data set , corresponds to the sparse outlying entries, denotes the reconstruction errors owing to the limited representational capability, and the parameters and balance the three terms in the objective function.
From (2) and (II), we can see that both the L1graph and SSC aim to obtain a similarity graph by solving an minimization problem. In fact, the objective functions will be the same if does not contain outliers, and if and in (2) and (II) are selected so as to correspond. However, these schemes have contributed to this field in different ways. Elhamifar et al. proved that SSC will only choose intrasubspace data points to represent each other if is drawn from the union of independent or disjoint subspaces, whereas Cheng et al. demonstrated the success of the L1graph in several important applications, such as feature extraction (unsupervised and semisupervised) and image clustering.
Another recently proposed method, LRR [13, 14], aims to find the lowestrank representation, rather than the sparsest, by solving
(4) 
where is the coefficient matrix of over the data set itself, and is used to deal with samplespecific corruptions.
As the rankminimization problem is not convex, Liu et al. replaced the rank of by its nuclear norm, i.e.,
(5) 
where , and is the
th singular value of
.The objective functions of the L1graph, SSC, and LRR can be solved using convex optimization algorithms, e.g., the Augmented Lagrange Multiplier method (ALM) [27]. As a result, the computational complexity of these methods is at least proportional to the cube of the problem size. Moreover, it is easy to demonstrate that the methods do actually obtain an optimal over a clean dictionary. In other words, the methods eliminate the effect of errors by denoising from data sete and then encoding each sample over a clean dictionary, and the strategy of denoising has been extensively adopted in this community [15, 17, 18, 28, 29, 30].
Iii Learning with the L2graph
In this section, we prove the correctness of our scheme, i.e., the effects of errors can be eliminated by zeroing the trivial coefficients. Moreover, we develop the L2graph algorithm for subspace learning and subspace clustering.
Iiia Theoretical Analysis
This article is based on the assumption that the trivial coefficients are always distributed over the trivial data points (i.e., errors). In this subsection, we prove the correctness of this assumption in two steps. Lemmas 1–3 show that the method will perform well when the norm is enforced over the representation. It should be pointed out that the proof is motivated by the theoretical analysis in [10]. Moreover, Lemmas 4–6 show the correctness of our assumption when the nuclear norm is enforced over the representation. Lemma 2 is a preliminary step toward Lemma 3, and Lemmas 4 and 5 are preliminaries for Lemma 6. For clarity, we provide proofs of these lemmas in the appendix.
Let be a data point in the subspace that is spanned by , where is the clean data set and
consists of the errors that probably exist in
. Without loss of generality, let denote the subspace spanned by and denote the subspace spanned by . Hence, is drawn from the intersection between and , or from except the intersection, denoted as or .IiiA1 normbased Model
Let and be the optimal solutions of
(6) 
over and , where denotes the norm and . We aim to investigate the conditions under which, for every nonzero data point , if the norm of is smaller than that of , then and such that . Here, denotes the th largest norm of the coefficients of , and is the dimensionality of .
Lemma 1
For any nonzero data point in the subspace except the intersection between and , i.e., , the optimal solution of (6) over is given by , which is partitioned according to the sets and , i.e., . Thus, we must have .
Lemma 2
Consider a nonzero data point in the intersection between and , i.e., , where and denote the subspace spanned by the clean data set and the errors , respectively. Let , , and be the optimal solution of
(7) 
over , , and , where denotes the norm, , and are partitioned according to the sets . If , then and .
Definition 1 (The First Principal Angle)
Let be a Euclidean vectorspace, and consider the two subspaces , with . There exists a set of angles called the principal angles, the first one being defined as:
(8) 
where and .
Lemma 3
Consider the nonzero data point in the intersection between and , i.e., , where and denote the subspace spanned by the clean data set and the errors , respectively. The dimensionality of is , and that of is . Let be the optimal solution of
(9) 
over , where denotes the norm, , and are partitioned according to the sets and . If the sufficient condition
(10) 
is satisfied, then . Here, is the smallest nonzero singular value of , is the first principal angle between and , is the maximum norm of the columns of , and denotes the th largest norm of the coefficients of .
IiiA2 Nuclear normbased Model
Based on two existing conclusions [31, 15], we theoretically show that our scheme is correct when the nuclear norm is used to obtain the lowestrank representation.
Lemma 4 ([31])
Let
be the skinny singular value decomposition (SVD) of the data matrix
. The unique solution to(11) 
is given by , where is the rank of .
Lemma 5 ([15])
Let be the SVD of the data matrix . The optimal solution to
(12) 
is given by and , where , , and are the top singular values and singular vectors of , respectively.
Lemma 6
Let be the skinny SVD of the optimal solution to
(13) 
where consists of the clean data set and the errors , i.e., .
The optimal solution to
(14) 
is given by , where is a truncation operator that retains the first elements and sets the other elements to zero, , and is the th largest singular value of .
IiiB Algorithm for Constructing the L2graph
Let be a set of linear subspaces embedded into , be a collection of data points located in the union of the , and be the specified dictionary for the data point , where .
To construct a similarity graph, the L2graph algorithm calculates the representation for each data point over by solving
(15) 
where is a regularization parameter.
For each , solving the optimization problem (15) gives
(16) 
This solution requires the calculation of for each , which is very inefficient. Hence, we derive another solution for the optimization problem in Theorem 1.
Theorem 1
The optimal solution of the problem (15) is given by
(17) 
where , and the union of is the standard orthogonal basis of , i.e., all entries in are zero, except for the th entry, which is one.
For a given data set , the construction process of the L2graph is summarized as follows:

Eliminate the effects of errors by performing the NN or ball method over , e.g., , where retains the largest coefficients of and sets the other entries to zero.

Construct a similarity graph by connecting node , denoted by , with node , denoted by . Assign the connection weight between and , where is an element of .
Note that the objective function (15
) is actually derived from the wellknown ridge regression
[32], which has been extensively used in different works. Based on the ridge regression model, Zhang et al. [33]presented a classifier (named CRC) for face recognition, and Lu et al.
[19] proposed the LSR for subspace clustering. Although these methods are derived from the same model, their solutions are totally different. Moreover, Lu et al. proved that LSR will produce a blockdiagonal matrix if the data are drawn from sufficiently independent subspaces, whereas our theoretical analysis shows that the L2graph will obtain a sparse similarity graph even when the data lie in multiple dependent subspaces. Finally, the L2graph eliminates the effect of errors by zeroing the trivial coefficients. This is a novel way of handling errors in data.IiiC Computational Complexity Analysis
Suppose the data points are drawn from a union of subspaces. The L2graph takes to construct and store the projection matrix . It then projects each data point into another space via (17), with complexity . Moreover, to eliminate the effects of errors in the data, it requires to find the largest coefficients. Putting everything together, the computational complexity of constructing the graph is .
The complexity of our method is high, meaning that a mediumsized data set will bring about scalability issues. However, the L2graph will not fall into local minima, and is more efficient than its counterparts. For example, normbased methods [10, 11, 26] need operations to construct a similarity graph using the Homotopy optimizer [34] to find the sparsest solution, where denotes the number of iterations of the Homotopy algorithm. According to [35], the Homotopy optimizer is one of the fastest minimization algorithms. Moreover, LRR has a complexity of at least to solve the rankminimization problem, where is the rank of the dictionary and is the number of iterations of the ALM method [27].
Once the L2graph has been built, we can follow the scheme of graphoriented learning methods to integrate various algorithms for different tasks, e.g., subspace learning and subspace clustering. In the following, we briefly describe how to apply the L2graph for these tasks.
IiiD Subspace Learning with the L2graph
Subspace learning aims to find a projection matrix to transform a highdimensional datum into a lower one via . According to [3], most existing dimension reduction techniques can be unified into a graph framework, i.e., they embed the similarity graph from the input space into a lowdimensional space. In this article, following the Neighborhood Preserving Embedding (NPE) program [37], we conduct subspace learning on the L2graph by solving
(18) 
where is the affinity matrix produced by the L2graph, and the constraint term provides the scaleinvariance.
The optimal solution of (18) is achieved by solving
(19) 
Clearly, the optimal solution
consists of the eigenvectors corresponding to the
smallest nonzero eigenvalues of the above generalized eigenvalue problem.IiiE Subspace Segmentation with the L2graph
Because of its effectiveness, spectral clustering is one of the most popular subspace clustering methods. The basis of spectral clustering is a sparse eigenvalue problem, i.e., the construction of a similarity graph in which only the intrasubspace points connect with each other. In this subsection, we demonstrate spectral clustering [4] using the L2graph.

Construct a Laplacian matrix using the affinity matrix produced by the L2graph, where with .

Obtain a matrix using the first normalized eigenvectors of that correspond to its smallest nonzero eigenvalues, where is the number of clusters.

Calculate the clustering membership of the data by performing a means clustering algorithm over the rows of .
Iv Experimental Verification and Analysis
In this section, we evaluate the performance of the L2graph in the context of unsupervised subspace learning, data clustering, and motion segmentation. We consider the results in terms of three aspects: 1) accuracy, 2) robustness, and 3) computational cost.
Iva Experimental Configuration
Baseline: We compare the L2graph with several stateoftheart algorithms, which are listed in TABLE I. Note that Locality Preserving Projections (LPP) and the Ggraph construct a similarity graph using the heatkernel function with the Euclidean distance, but embed the graph in different ways. The Ggraph takes the embedding function of NPE that calculates the similarity among data points in the same way as LLE. Specifically, NPE and LLE obtain a local dictionary for each data point via a kNN search using the Euclidean distance. Moreover, we solve the objective function of the L1graph using the Homotopy optimizer [34]. According to [35], this is one of the most competitive minimization algorithms in terms of accuracy, robustness, and convergence speed. From the homepage of Liu ^{2}^{2}2 https://sites.google.com/site/guangcanliu/, we downloaded MATLAB code for LRR and LatLRR. To allow for a fair comparison, we adopt the embedding function (18) and the spectral clustering algorithm [4] over the Ggraph, L1graph, LRRgraph, LatLRRgraph, and L2graph^{3}^{3}3The MATLAB code for the L2graph and the databases used in the experiment can be downloaded from https://www.dropbox.com/sh/hkwyc2v97mvp96q/Q3ZcmHiSG.. Similar to [11, 17], in each experiment, we tune the parameters of all methods to obtain the best results.
Algorithms  Parameters  Similarity metrics  SL  SC 

PCA [36]      
LPP [6]  ,  Eu.+Heat kernel  
NPE [37]  Locally Linear rep.  
LLEgraph [1]  Locally Linear rep.  
L1graph [11]  ,  norm based rep.  
Ggraph [4]  ,  Eu.+Heat kernel  
LRRgraph [38]  Low rank rep.  
LatLRRgraph [17]  Low rank rep.  
L2graph  ,  Linear rep. 
Databases: We examine the performance of the algorithms using several popular image databases, i.e., Extended Yale B (ExYaleB) [39], AR [40], Multiple PIE (MPIE) [41], and COIL100 [42]. TABLE II provides an overview of these databases. ExYaleB contains 2414 frontalface images of 38 subjects (about 64 images for each subject), and we use the first 58 samples of each subject. Moreover, we extract three subsets from the AR database by randomly selecting from 50 male and 50 female subjects. AR1 contains 1400 clean images, AR2 contains 600 images occluded by sunglasses and 600 clean images, and AR3 contains 600 images occluded by scarves and 600 clean samples. MPIE contains the facial images of 337 subjects captured in four sessions with simultaneous variations in pose, expression, and illumination. Limited by the computational capacity, we perform experiments over the first seven samples per subject of Session 1 and all images of the other Sessions.
Database 
Original size  Cropped size  Feature Dim.  

ExYaleB  38  58  192168  5448  116 
AR1  100  14  165120  5540  167 
AR2  100  12  165120  5540  173 
AR3  100  12  165120  5540  170 
MPIES1  249  7  10082  5041  91 
MPIES2  203  10  10082  5041  103 
MPIES3  164  10  10082  5041  100 
MPIES4  176  10  10082  5041  94 
COIL100  100  10  128128  6464  280 
IvB Subspace Learning
To compare the performance of the algorithms, the L2graph, Principal Component Analysis (PCA), LPP, NPE, Ggraph, LRRgraph, and LatLRRgraph are used to extract the features of some databases. Similar to
[11], the 1NN classifier is then applied to the features to calculate the classification accuracy. Note that LPP, NPE, L2graph, LRRgraph, and LatLRRgraph are graphoriented algorithms, but focus on different problems. LPP and NPE embed a similarity graph from a highdimensional space into a lowdimensional one, whereas the L2graph, LRRgraph, and LatLRRgraph mainly focus on the construction of a similarity graph. In this experiment, we did not investigate the performance of the L1graph, as its optimization program requires the dictionary to be an underdetermined matrix. This requirement is questionable, because it means that dimension reduction based on the L1graph depends on the result of other dimension reduction techniques when the data size is less than its dimensionality.IvB1 Subspace Learning on Clean Facial Images
We randomly draw 3, 5, 7, 9, and 11 of the 14 images per subject in AR1 for use as a training data set, and use the remaining data for testing. Thus, we form five different data sets from AR1. Similarly, we select 10, 20, 29 (half of the total number of samples), and 40 images from the ExYaleB data to give four data sets.
TABLE III and TABLE IV report the results obtained from each algorithm. We make the following observations:

The L2graph outperforms the other methods in all tests, whereas LPP achieves the worst results.

NPE generally achieves the secondbest result over all databases. The LRRgraph and LatLRRgraph are somewhat superior to LPP, PCA, and the Ggraph. Moreover, the LatLRRgraph outperforms the LRRgraph in all the tests

Representationbased similarity graphs (L2graph, NPE, LRRgraph, and LatLRRgraph) are more competitive than pairwise distancebased ones (LPP and Ggraph).
Training number  L2graph  LPP  NPE  Ggraph  PCA  LRRgraph  LatLRRgraph 

3  86.46 (179)  64.09 (111)  85.00 (156)  81.00 (278)  74.91 (201)  80.18 (298)  81.09 (244) 
5  93.11 (430)  61.89 (186)  91.78 (239)  80.67 (461)  79.80 (154)  87.56 (498)  83.78 (499) 
7  95.57 (456)  74.71 (272)  94.57 (275)  81.43 (283)  86.71 (506)  89.14 (257)  72.43 (332) 
9  95.02 (491)  76.60 (366)  94.00 (319)  87.80 (567)  85.20 (332)  91.80 (317)  81.80 (260) 
11  92.18 (240)  73.36 (113)  85.55 (150)  90.00 (283)  76.46 (293)  86.82 (300)  87.00 (296) 
Training number  L2graph  LPP  NPE  Ggraph  PCA  LRRgraph  LatLRRgraph 

10  92.11 (334)  72.70 (127)  90.68 (190)  87.06 (374)  73.08 (338)  91.61 (349)  89.80 (326) 
20  90.72 (554)  76.73 (312)  87.54 (405)  81.23 (596)  76.11 (353)  89.96 (309)  88.99 (502) 
29  98.46 (264)  86.66 (436)  97.73 (178)  84.30 (297)  93.74 (407)  93.01 (300)  94.74 (585) 
40  98.39 (321)  86.70 (594)  98.39 (492)  88.01 (579)  96.49 (344)  95.47 (496)  94.88 (510) 
IvB2 Subspace Learning on Corrupted Facial Images
In this section, we investigate the robustness of the algorithms to two types of corruption using the ExYaleB database. For each subject, we randomly choose half of the images (29 samples per subject) and corrupt them using white Gaussian noise or random pixel corruption. We then randomly divide the 58 images into two groups of equal size, one for training and the other for testing. Thus, both the training data and the testing data may be contaminated by noise. For the chosen image , we add white Gaussian noise according to , where , is a corruption ratio of either or , and the noise
follows a standard normal distribution. For the random pixel corruption, we replace the value of a randomly selected percentage of pixels in image
with values from a uniform distribution over
, where is the largest pixel value of . Clearly, white Gaussian noise is additive, whereas random pixel corruption is nonadditive.TABLE V reports the classification accuracy of the 1NN classifier using each of the algorithms. This demonstrates that:

The L2graph achieves the best results in all tests, except when 30% of the data are corrupted by random pixel corruption, in which case it is second best.

In general, the representationbased methods (L2graph, NPE, LRRgraph, and LatLRRgraph) outperform the pairwise distancebased methods (LPP and the Ggraph). This corroborates the claim that representationbased methods are more robust than pairwise distancebased ones.

The algorithms perform better when the data contain white Gaussian noise than when they are affected by random pixel corruption.
Corruption + noise level  L2graph  LPP  NPE  Ggraph  PCA  LRRgraph  LatLRRgraph 

Gaussian + 10%  95.28 (519)  82.67 (495)  94.37 (536)  84.57 (506)  79.40 (474)  92.02 (385)  91.11 (384) 
Gaussian + 30%  92.65 (259)  71.87 (444)  89.93 (377)  66.97 (529)  70.51 (128)  87.39 (370)  85.21 (421) 
Random Pixel + 10%  87.75 (298)  57.53 (451)  86.48 (392)  55.81 (514)  69.78 (96)  80.49 (351)  77.13 (381) 
Random Pixel + 30%  68.97 (425)  45.83 (378)  71.87 (516)  46.82 (480)  61.07 (600)  58.89 (361)  57.17 (364) 
IvC Data Clustering
In this subsection, we investigate the clustering quality of the L2graph, Ggraph, LLEgraph, L1graph, LRRgraph, and LatLRRgraph using clean, corrupted, and occluded images.
Two popular metrics, the accuracy (AC, sometimes called the purity) [43] and the normalized mutual information (NMI) [44], are used to evaluate the clustering quality. AC or NMI values of 1 indicate perfect matching with the ground truth, whereas 0 indicates a perfect mismatch.
IvC1 Clustering of Clean Images
Seven image data sets (AR1, ExYaleB, MPIES1, MPIE2S2, MPIE3S3, MPIES4, and COIL100) are used in this experiment. TABLE VI shows that the L2graph achieves the best results in the tests, except with MPIES4, where it is second best. With respect to the AR1 database, the AC of the L2graph is about 45.00% higher than that of the Ggraph, 37.57% higher than the AC of the LLEgraph, 9.36% higher than the L1graph, 7.71% higher than the LRRgraph, and 8.29% higher than the LatLRRgraph. In addition, the LRRgraph and LatLRRgraph, which exhibit similar performance, are far superior to the Ggraph, LLEgraph, and L1graph. Cheng et al. [11] reported that the clustering accuracy of their L1graph with ExYaleB is about , which is higher than the achieved in our experiment. This could be because they used a different solver to compute the sparsest representation. However, no details were reported in their paper. Another possible reason could be differences in data processing. In their experiments, each image was cropped to pixels and had its dark homogeneous background removed, whereas we performed the experiments over 116 features extracted by PCA. In spite of this, the accuracy of the L1graph reported in their work (78.5%) is still lower than that of the L2graph (86.78%).
Databases  L2graph  Ggraph  LLEgraph  L1graph  LRRgraph  LatLRRgraph  

AC  NMI  AC  NMI  AC  NMI  AC  NMI  AC  NMI  AC  NMI  
ExYaleB  86.78  92.84  40.34  52.63  51.82  61.61  71.14  77.92  85.25  91.19  83.85  91.69 
AR1  80.50  91.99  35.50  64.65  42.93  70.14  73.21  88.37  72.79  89.49  72.21  91.13 
MPIES1  88.12  96.75  27.71  70.31  40.22  76.57  68.39  89.60  83.88  95.75  84.17  96.20 
MPIES2  82.32  96.23  29.47  72.45  31.77  74.21  76.60  95.27  81.03  96.73  80.89  96.72 
MPIES3  77.56  94.53  25.85  70.79  28.48  72.44  66.83  92.05  75.61  95.40  75.98  95.37 
MPIES4  82.96  96.24  29.83  71.80  42.96  80.60  77.84  95.31  83.24  97.09  83.01  97.00 
COIL100  52.40  77.57  44.80  73.47  48.60  75.30  51.40  76.93  50.10  76.29  42.50  73.89 
IvC2 Clustering of Corrupted Images
To examine the robustness of the algorithms, we divide ExYaleB into two parts of equal size and corrupt one part with Gaussian noise or random pixel corruption. For each type, the corruption percentage increases from to in intervals of . Fig. 2 shows some samples.
Corruptions  Corruption  L2graph  Ggraph  LLEgraph  L1graph  LRRgraph  LatLRRgraph  

ratio (%)  AC  NMI  AC  NMI  AC  NMI  AC  NMI  AC  NMI  AC  NMI  
Gaussian  10  89.26  92.71  46.37  60.41  47.82  69.4  70.191  75.57  87.79  92.12  89.16  93.20 
30  88.70  92.18  47.73  60.12  46.51  59.84  69.28  73.66  81.31  86.05  81.72  87.59  
50  86.57  90.43  39.52  55.25  37.48  52.1  65.93  71.10  84.96  79.15  79.17  83.53  
70  74.32  77.70  36.21  47.69  32.76  44.96  59.35  66.08  60.66  69.57  69.65  75.74  
90  56.31  63.43  32.21  45.3  29.81  42.9  52.41  61.43  49.96  57.9  47.64  58.11  
Random Pixel 
10  85.52  88.64  50.23  61.28  46.82  59.26  67.59  58.24  78.68  87.19  84.39  88.99 
30  68.97  75.89  35.12  48.11  33.26  42.33  59.44  64.79  60.80  67.47  49.91  64.31  
50  48.15  56.67  39.52  55.25  19.51  27.77  43.51  48.95  38.61  49.93  27.18  39.80  
70  34.98  45.56  14.88  21.48  13.39  18.82  33.26  38.42  30.54  38.13  13.02  21.52  
90  30.04  38.39  12.25  20.43  14.07  23.04  25.95  34.30  19.01  29.16  14.61  24.22  

From TABLE VII, we can observe that:

All of the algorithms perform better when the data are contaminated by white Gaussian noise than with random pixel corruption. Moreover, the AC and NMI of each method decreases as the corruption rate increases.

The L2graph is more robust than the other graphs. For example, the difference in AC between the L2graph and the L1graph varies from ( corruption ratio) to ( corruption ratio) with Gaussian noise, and from ( corruption ratio) to ( corruption ratio) under random pixel corruption.

The LRRgraph and LatLRRgraph are more robust than the L1graph under Gaussian corruption. This could be because the LRRgraph and LatLRRgraph formulate additive noise into their objective function, whereas the L1graph does not.

The LatLRRgraph outperforms the LRRgraph when the images are contaminated by Gaussian noise, but the LRRgraph is superior under random pixel corruption.
IvC3 Clustering of Images with Real Disguises
To investigate the robustness of the algorithms to real occlusions, we use two subsets of the AR images, i.e., AR2 and AR3 (as shown in Fig. 3). AR2 contains 600 images occluded by sunglasses (occlusion percentage is about ) and 600 clean images. AR3 consists of 600 images occluded by scarves (occlusion percentage is about ) and 600 clean samples.
Databases  Metrics  Algorithms  

L2graph  Ggraph  LLEgraph  L1graph  LRRgraph  LatLRRgraph  
Faces wearing sunglasses  AC  74.00  22.33  27.92  45.33  62.00  68.58 
(AR2)  NMI  87.89  57.73  61.28  73.81  84.81  86.16 
Faces wearing scarves  AC  75.42  20.92  25.67  38.83  61.50  63.25 
(AR3)  NMI  88.93  55.71  59.15  80.84  82.88  84.49 
TABLE VIII reports the clustering results of the tested algorithms. Clearly, the L2graph again outperforms the other evaluated algorithms by a considerable margin. For example, its AC is 5.42% and 12.17% higher than the secondbest algorithm (LatLRRgraph) over AR2 and AR3, respectively. Moreover, the LRRgraph and LatLRRgraph are superior to the L1graph in terms of both AC and NMI. We also find that the evaluated algorithms perform similarly for the two different occlusions, even though the occlusion rates are very different, e.g., the AC of the L2graph is 74.00% on AR2 and 75.42% on AR3.
IvD Motion Segmentation
Motion segmentation aims to separate a video sequence into multiple spatiotemporal regions, with each region representing a moving object. Generally, segmentation algorithms are based on the feature point trajectories of multiple moving objects. Therefore, the motion segmentation problem can be thought of as the clustering of these trajectories into different subspaces, each of which corresponds to an object. Recently, a number of subspace clustering approaches have been proposed to resolve this problem, e.g., GPCA [45, 46], SCC [47], LSA [48], and ALC [8], and these algorithms have achieved impressive results.
Mathematical models of motion segmentation depend on the camera projection. The affine camera model has been extensively studied in recent years. SSC (another norm based graph), LRR, and LatLRR have each been extended to include affine mathematical models by introducing affine constraints into their objective functions [38, 9, 10]. Theoretically, it is better to depict the affine camera model using corresponding mathematical models. However, there is no empirical evidence to support the assertion that the affine constraint will obviously improve the segmentation quality of SSC and LRRs, as pointed out in [21]. Therefore, in this article, we use only the linear models of the evaluated algorithms for the motion segmentation task.
A common problem in motion segmentation is that some data entries are missing owing to occlusions or other reasons. There are two simple ways to solve this problem, i.e., filling the missing entries with random values, or removing all properties corresponding to the missing entries from the data set. The first method, which we adopt here, transforms the problem to the clustering of corrupted data.
To examine the performance of the L2graph for motion segmentation, we conduct experiments on the Hopkins155 raw data [49], some frames of which are shown in Fig.4. The data include the feature point trajectories of 155 image sequences, consisting of 120 video sequences with two motions and 35 video sequences with three motions. Thus, there are a total of 155 independent clustering tasks. In the experiments, we run the L2graph over the clean image sequences, and form corrupted images by adding Gaussian noise to each sequence with corruption ratios of 5 and 10%. For each algorithm, similar to [13, 50, 18], we report the best average and corresponding median clustering errors using the tuned parameters over each kind of image sequence (two and three motions). The clustering errors are defined as
(20) 
Corrupted ratio  Databases  clustering error  L2graph  L1graph  LRRgraph  LatLRRgraph  LLEgraph  Ggraph 

0%  2 motions  mean  1.91  7.63  2.22  1.97  12.46  14.67 
median  0.00  0.61  0.00  0.00  3.28  8.15  
3 motions  mean  4.94  17.77  5.45  3.63  19.62  23.14  
median  0.43  18.29  1.57  0.32  18.95  19.75  
5%  2 motions  mean  37.53  39.10  42.45  44.94  43.56  47.23 
median  39.78  41.42  44.44  45.74  45.34  48.00  
3 motions  mean  56.00  58.55  61.21  61.88  60.72  61.19  
median  57.23  59.49  62.43  62.68  61.34  61.70  
10%  2 motions  mean  41.23  43.59  42.44  45.64  44.34  43.81 
median  44.07  45.28  44.54  46.50  45.42  47.46  
3 motions  mean  59.23  60.25  60.94  62.15  60.92  58.08  
median  59.70  61.62  61.95  62.73  61.17  62.24  
Results in the recent works.  SSC [10]  LRRgraph [10]  LatLRRgraph [10]  SSC [18]  LRRgraph [18]  
0%  2 motions  mean  1.52  7.42  2.13  3.70  3.20  
median  0.00  0.84  0.00  0.00  0.30  
3 motions  mean  4.40  10.89  4.03  11.40  7.80  
median  0.56  6.51  1.43  3.30  2.80 
Fig. 5 shows the clustering errors of the evaluated algorithms over all sequences. Moreover, TABLE IX reports the mean and median clustering errors on the data sets. For fair comparison, we also give the results for SSC, LRRgraph, and LatLRRgraph reported in [10, 18]. Note that SSC and the L1graph construct a similarity graph using an normbased sparse representation (we discussed their connections in Section II). From the results, we can make the following observations:

All the algorithms perform better with twomotion data than with threemotion data, and the corrupted data largely decrease the clustering quality of the tested algorithms.
IvE Computational Costs
In this subsection, we investigate the time costs of constructing the similarity graph using four stateoftheart reconstruction coefficientbased methods, i.e., the L2graph, L1graph, LRRgraph, and LatLRRgraph. TABLE X reports the time costs obtained by averaging the elapsed CPU time over 10 independent experiments for each algorithm. We can see that the computational time of the L2graph is remarkably lower than that of the other methods. This is because the L1graph, LRRgraph, and LatLRRgraph find the optimal solutions for their objective functions in an iterative manner, whereas the L2graph projects data points into another space via algebraic operations. In addition, the LRRgraph is more efficient than the L1graph, and the LatLRRgraph is noticeably slower than the LRRgraph owing to the introduction of latent variables.
Databases  L2graph  L1graph  LRRgraph  LatLRRgraph 

AR1  19.57  125.75  44.74  124.61 
ExYaleB  73.29  237.43  117.35  506.34 
MPIES1  37.46  290.53  50.08  143.30 
MPIES2  23.13  165.50  72.89  345.34 
MPIES3  29.44  375.65  49.82  224.47 
MPIES4  46.05  182.64  89.53  292.68 
COIL100  8.51  63.43  144.27  343.79 
Hopkins155  2.22  7.37  3.11  11.77 
V Conclusion
Based on the assumption that trivial coefficients are always distributed over trivial data points such as noise, we proposed a simple but effective scheme to eliminate the errors in data so as to obtain a blockdiagonal affinity matrix. The scheme adopts an encoding and then denoising from representation strategy, which is the inverse of the popular procedure. The theoretical analysis has shown that the scheme is correct when the norm () or nuclear norm is enforced over the representation. Based on the scheme, we developed an algorithm, named the L2graph. The L2graph calculates the linear representation of a given data set, and then eliminates the effects of errors by zeroing the trivial components. Extensive experiments demonstrated the efficiency and effectiveness of the L2graph in feature extraction, image clustering, and motion segmentation.
There are several ways to improve and extend this work. First, Lemma 2 and Lemma 3 establish the condition under which our scheme is correct. However, this condition is a little strict, as it requires the coefficients over errors to be zero. Second, although the theoretical analysis and experimental studies showed some connections between the parameter and the intrinsic dimensionality of a subspace, it is challenging to determine the parameter values. Hence, we intend to exploit more theoretical results in the future. Third, the experimental comparisons illustrate that the L2graph is more efficient than the L1graph, LRRgraph, and LatLRRgraph. However, any mediumsized data set will bring about a scalability issue with the L2graph. Therefore, it would be interesting to make the L2graph and its counterparts more efficient in handling largescale data sets.
Appendix
In this material, Lemma 1  Lemma 6 show that it is possible to eliminate the the effect of errors by zeroing the trivial coefficients. Moreover, Theorem 1 gives an analytic solution to L2graph that makes the algorithm efficient.
a normbased Model
Let be a data point in the subspace that is spanned by , where is the clean data set and consists of the errors that probably exist in . Without loss of generality, let denote the subspace spanned by and denote the subspace spanned by . Hence, is drawn from the intersection between and , or from except the intersection, denoted as or .
Let and be the optimal solutions of
(21) 
over and , where denotes the norm, where . We aim to investigate the conditions under which, for every nonzero data point , if the norm of is smaller than that of , then and such that . Here is the optimal solution of (21) over , denotes the th largest norm of the coefficients of , and is the dimensionality of .
Lemma 1
For any nonzero data point in the subspace except the intersection between and , i.e., , the optimal solution of (21) over is given by which is partitioned according to the sets and , i.e., . Thus, we must have .
Since the nonzero only could be represented as a linear combination of the data points from , we must have and which implies that .
Lemma 2
Consider a nonzero data point in the intersection between and , i.e., , where and denote the subspace spanned by the clean data set and the errors , respectively. Let , , and be the optimal solution of
(22) 
over , , and , where denotes the norm, , and are partitioned according to the sets . We have and , if and only if .
() We prove the result using contradiction. Assume , then
(23) 
Note that, the left side and the right side of (23) correspond a data point from and , respectively. Then, we must have
(24) 
and
(25) 
Clearly, and are feasible solutions of (22) over . According to the triangle inequality and the condition , we have
(26) 
From (25), we have as is the optimal solution of (22) over . Then, . It contradicts the fact that is the optimal solution of (22) over .
() We prove the result using contradiction. For a nonzero data point , assume . Thus, for the data point , it is possible that (22) will only choose the points from to represent . This contradicts the assumption that and .
Lemma 2 is an extension of Theorem 2 in [10] which only provided the necessary and sufficient condition when norm is enforced over the coefficients. More important is that the work assumes that the data are draw from an union of independent/disjoint subspaces, whereas we didn’t take the assumption which is very strong in practice since most data are sampled from dependent subspace.
Definition 1 (The First Principal Angle)
Let be a Euclidean vectorspace, and consider the two subspaces , with . There exists a set of angles called the principal angles, the first one being defined as:
(27) 
where and .
Lemma 3
Consider the data point in the intersection between and , i.e., , where and denote the subspace spanned by the clean data set and the errors , respectively. The dimensionality of is , and that of is . Let be the optimal solution of
(28) 
over , where denotes the norm, , and are partitioned according to the sets and . If the sufficient condition
(29) 
is satisfied, then . Here, is the smallest nonzero singular value of , is the first principal angle between and , is the maximum norm of the columns of , and denotes the th largest norm of the coefficients of .
Since , we could write , where is the skinny SVD of , , is the rank of , and is the optimal solution of (28) over . Thus, .
From the propositions of norm, i.e., , , and , we have
(30) 
Since the Frobenius norm is subordinate to the Euclidean vector norm, we must have
(31) 
where is the smallest nonzero singular value of .
Moreover, could be represented as a linear combination of since it lies in the intersection between and , i.e., , where is the optimal solution of (28) over . Multiplying two sides of the equation with , it gives . According to the Hölder’s inequality, we have
(32) 
Comments
There are no comments yet.