# Constructing the L2-Graph for Robust Subspace Learning and Subspace Clustering

Under the framework of graph-based learning, the key to robust subspace clustering and subspace learning is to obtain a good similarity graph that eliminates the effects of errors and retains only connections between the data points from the same subspace (i.e., intra-subspace data points). Recent works achieve good performance by modeling errors into their objective functions to remove the errors from the inputs. However, these approaches face the limitations that the structure of errors should be known prior and a complex convex problem must be solved. In this paper, we present a novel method to eliminate the effects of the errors from the projection space (representation) rather than from the input space. We first prove that ℓ_1-, ℓ_2-, ℓ_∞-, and nuclear-norm based linear projection spaces share the property of Intra-subspace Projection Dominance (IPD), i.e., the coefficients over intra-subspace data points are larger than those over inter-subspace data points. Based on this property, we introduce a method to construct a sparse similarity graph, called L2-Graph. The subspace clustering and subspace learning algorithms are developed upon L2-Graph. Experiments show that L2-Graph algorithms outperform the state-of-the-art methods for feature extraction, image clustering, and motion segmentation in terms of accuracy, robustness, and time efficiency.

## Authors

• 29 publications
• 15 publications
• 8 publications
• 12 publications
• ### Sparse Subspace Clustering via Diffusion Process

Subspace clustering refers to the problem of clustering high-dimensional...
08/05/2016 ∙ by Qilin Li, et al. ∙ 0

• ### Locally linear representation for image clustering

It is a key to construct a similarity graph in graph-oriented subspace l...
04/24/2013 ∙ by Liangli Zhen, et al. ∙ 0

• ### Correntropy Induced L2 Graph for Robust Subspace Clustering

In this paper, we study the robust subspace clustering problem, which ai...
01/18/2015 ∙ by Canyi Lu, et al. ∙ 0

• ### Kernel Truncated Regression Representation for Robust Subspace Clustering

Subspace clustering aims to group data points into multiple clusters of ...
05/15/2017 ∙ by Liangli Zhen, et al. ∙ 0

• ### Constructing the F-Graph with a Symmetric Constraint for Subspace Clustering

Based on further studying the low-rank subspace clustering (LRSC) and L2...
12/17/2019 ∙ by Kai Xu, et al. ∙ 0

• ### Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering

State-of-the-art subspace clustering methods are based on expressing eac...
05/09/2016 ∙ by Chong You, et al. ∙ 0

• ### Outlier Regularization for Vector Data and L21 Norm Robustness

In many real-world applications, data usually contain outliers. One popu...
06/20/2017 ∙ by Bo Jiang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The key to graph-based learning algorithms is a sparse eigenvalue problem, i.e., constructing a block-diagonal affinity matrix whose nonzero entries correspond to the intra-class data points. Based on the affinity matrix, a series of algorithms can be derived for various tasks, e.g., data clustering and feature extraction. The algorithms are called graph-oriented learning methods

[1, 2, 3], because the affinity matrix actually represents a similarity graph in which each vertex is a data point and the edge weight denotes the similarity between the connected vertexes.

There are generally two ways to build a similarity graph. One is based on pairwise distances (e.g., Euclidean distance), and the other is based on reconstruction coefficients. The second method assumes that each data point could be represented as a linear combination of the other points, as the intra-subspace points have the same basis. When the data is clean, i.e., the data points are strictly sampled from the subspaces, several approaches are able to recover the subspace structures [4, 5, 6]. However, in real applications, the data set may lie in the intersection of multiple subspaces or contain noise and outliers. As a result, inter-class data points may connect to each other with very high connection weights. Hence, eliminating the effects of errors is a major challenge111In this article, we assume that noise, outliers, and other factors that impair the discrimination of the data representation are a kind of error. Consequently, we use the terms denoising and error handling to denote eliminating the effects of errors ..

To address these problems, several algorithms have been proposed, e.g., Locally Linear Manifold Clustering (LLMC) [7], Agglomerative Lossy Compression (ALC) [8], Sparse Subspace Clustering (SSC) [9, 10], L1-graph [11, 12], Low Rank Representation (LRR) [13, 14, 15], Spectral Multi-Manifold Clustering (SMMC) [16], Latent Low Rank Representation (LatLRR) [17], Fixed Rank Representation (FRR) [18], Least Squares Regression (LSR) [19]

, and Sparse Representation Classifier Steered Discriminative Projection (SRC-DR)

[20]. In [21], Vidal provided a comprehensive survey of these algorithms in the context of subspace clustering.

Of the above methods, SSC [9, 10] and L1-graph [11] obtain a sparse similarity graph from the sparsest coefficients. One of the main differences between these techniques is that [9, 10] formulate the noise and outliers in the objective function and provide more theoretical analysis, whereas [11] derives a series of algorithms upon the L1-graph for various tasks. The popular LRR model [13, 14] and its extensions [15, 17, 18] obtain a similarity graph from the lowest-rank representation rather than the sparsest one. Both - and rank-minimization-based methods can automatically select the neighbors for each data point, and have achieved impressive results in numerous applications. However, they must solve a convex problem whose computational complexity is proportional to the cube of the problem size. Moreover, SSC requires that the corruption over each data point has a sparse structure, and LRR assumes that only a small portion of the data are contaminated (so-called sample-specified corruption), otherwise the performance will be degraded. In fact, these two problems are mainly caused by the adopted error-handling strategy, i.e., removing the errors from the data set to obtain a clean dictionary over which each sample is encoded.

Although these methods have achieved a lot of success, could we find a better way to achieve a block-diagonal affinity matrix? Based on the assumption that the trivial coefficients are always distributed over trivial data points, such as noise, outliers, or faraway data points, this article proposes a method of eliminating the effects of errors by encoding each sample over the dictionary and then zeroing the trivial components of the coefficients, i.e., encoding and then denoising from the representation.

To verify the effectiveness of our scheme, we develop a simple but effective algorithm, named L2-graph. Experimental results show that the proposed L2-graph is superior to state-of-the-art methods in subspace learning and subspace clustering with respect to accuracy, robustness, and computation time. Moreover, the theoretical analysis shows that our assumption is correct when the -norm () or nuclear norm is enforced over the representation.

Fig. 1 illustrates our basic idea and its effectiveness. For a given data set drawn from two clusters in , its linear representation, which is non-sparse, as demonstrated in Figs. 1 and 1, cannot satisfy the sparsity requirement of graph-oriented approaches. After zeroing the trivial coefficients, as suggested by our scheme, we obtain a sparse similarity graph, called the L2-graph (Figs. 1 and 1). The L2-graph removes the connections with faraway data points, as these are regarded as a type of error.

This example could be misconstrued as inferring that the L2-graph is not very competitive, as a -NN graph based on the pairwise distance could also successfully separate these data points. In Section IV, we will show that this is not the case, and that the L2-graph is superior to the -NN graph by a considerable performance margin. Moreover, Fig. 1 shows that two points (so-called exemplars) are frequently connected with the other intra-subspace points. This implies that the L2-graph can reveal the latent structure of a data distribution, an ability that is important to a lot of applications, such as data summarization [22]. This is another advantage of the L2-graph over a -NN graph based on pairwise distances.

Several aspects this work should be highlighted:

• We propose a novel scheme for finding sparse similarity graphs by eliminating the effect of errors from the representation but from the dictionary, and develop the L2-graph algorithm to corroborate the effectiveness of the scheme.

• Under some conditions, we prove the correctness of the assumption that trivial coefficients are always distributed over trivial data points, when the -norm () or nuclear norm is enforced over the representation.

• The L2-graph has an analytic solution that does not involve an iterative optimization process. Moreover, it avoids the need to construct different dictionaries for different data points. Thus, it is much faster than other graph-construction methods, such as the L1-graph.

• The proposed algorithm is robust to various corruptions, and real occlusions in the context of subspace learning and subspace clustering.

Except in some specified cases, lower-case bold letters

represent column vectors and

upper-case bold letters represent matrices. denotes the transpose of the matrix whose pseudo-inverse is , and

denotes the identity matrix.

The rest of the article is organized as follows: Section II presents some graph construction methods, and Section III provides the theoretical analysis to show that it is feasible to eliminate the effects of errors from the representation but from the dictionary. Moreover, we propose the L2-graph algorithm, and derive a series of methods upon L2-graph for subspace learning and subspace clustering. Section IV reports the results of a series of experiments to examine the effectiveness of the algorithm in the context of feature extraction, image clustering, and motion segmentation. Finally, Section V summarizes this work.

## Ii Related Work

Over the past two decades, a number of graph-oriented algorithms have been proposed to address various problems, e.g., data clustering [23], dimension reduction [24], and object tracking [25]. The key to these algorithms is the construction of a similarity graph to depict the relationship between data points. Hence, the performance of the algorithms largely depends on whether the graph can accurately determine the neighborhood of each data point, particularly when the data contain a lot of errors. Thus, eliminating the effects of errors to obtain a sparse similarity graph is a major challenge.

There are two ways to obtain a similarity graph, i.e., methods based on the pairwise distance and those that use reconstruction coefficients. The first scheme uses the distance between two points to measure their similarity. The distance between any two data points is independent from the other points, so the scheme is sensitive to noise and outliers. Alternatively, reconstruction coefficient-based similarity is datum adaptive. In other words, the similarity between two points may vary with the other data. This global property is beneficial to the robustness of such algorithms. Therefore, the second option has become increasingly popular, especially in high-dimensional data analysis.

Locally Linear Embedding (LLE) [1] was possibly the first work to construct a similarity graph using reconstruction coefficients. Specifically, the coefficient for each data point can be calculated by solving

 min∥yi−Dici∥22s.t.1Tci=1, (1)

where is the coefficient of over , and consists of the nearest neighbors of in Euclidean space. Clearly, the LLE-graph is sparse. A big problem with the LLE-graph is that it will not obtain a good result when the data are poorly or non-uniformly sampled.

Recently, some studies have exploited the inherent sparsity of sparse representation to obtain a block-diagonal affinity matrix, e.g., SSC [9, 10], the L1-graph [11, 12], and Sparsity Preserving Projection (SPP) [26].

In [11], the L1-graph is proposed for image analysis, which solves the following problem:

 min∥ci∥1s.t.∥yi−Yici∥2<δ (2)

where is the sparse representation of over the dictionary , and is the error tolerance.

SSC [9, 10] was proposed for subspace segmentation, which solves the following problem:

 s.t.Y=YC+E+Z,C1T=1,diag(C)=0, (3)

, where is the sparse representation of the data set , corresponds to the sparse outlying entries, denotes the reconstruction errors owing to the limited representational capability, and the parameters and balance the three terms in the objective function.

From (2) and (II), we can see that both the L1-graph and SSC aim to obtain a similarity graph by solving an -minimization problem. In fact, the objective functions will be the same if does not contain outliers, and if and in (2) and (II) are selected so as to correspond. However, these schemes have contributed to this field in different ways. Elhamifar et al. proved that SSC will only choose intra-subspace data points to represent each other if is drawn from the union of independent or disjoint subspaces, whereas Cheng et al. demonstrated the success of the L1-graph in several important applications, such as feature extraction (unsupervised and semi-supervised) and image clustering.

Another recently proposed method, LRR [13, 14], aims to find the lowest-rank representation, rather than the sparsest, by solving

 minC,Erank(C)+λ∥E∥2,1s.t.Y=YC+E, (4)

where is the coefficient matrix of over the data set itself, and is used to deal with sample-specific corruptions.

As the rank-minimization problem is not convex, Liu et al. replaced the rank of by its nuclear norm, i.e.,

 minC,E∥C∥∗+λ∥E∥2,1s.t.Y=YC+E, (5)

where , and is the

-th singular value of

.

The objective functions of the L1-graph, SSC, and LRR can be solved using convex optimization algorithms, e.g., the Augmented Lagrange Multiplier method (ALM) [27]. As a result, the computational complexity of these methods is at least proportional to the cube of the problem size. Moreover, it is easy to demonstrate that the methods do actually obtain an optimal over a clean dictionary. In other words, the methods eliminate the effect of errors by denoising from data sete and then encoding each sample over a clean dictionary, and the strategy of denoising has been extensively adopted in this community [15, 17, 18, 28, 29, 30].

## Iii Learning with the L2-graph

In this section, we prove the correctness of our scheme, i.e., the effects of errors can be eliminated by zeroing the trivial coefficients. Moreover, we develop the L2-graph algorithm for subspace learning and subspace clustering.

### Iii-a Theoretical Analysis

This article is based on the assumption that the trivial coefficients are always distributed over the trivial data points (i.e., errors). In this subsection, we prove the correctness of this assumption in two steps. Lemmas 13 show that the method will perform well when the -norm is enforced over the representation. It should be pointed out that the proof is motivated by the theoretical analysis in [10]. Moreover, Lemmas 46 show the correctness of our assumption when the nuclear norm is enforced over the representation. Lemma 2 is a preliminary step toward Lemma 3, and Lemmas 4 and 5 are preliminaries for Lemma 6. For clarity, we provide proofs of these lemmas in the appendix.

Let be a data point in the subspace that is spanned by , where is the clean data set and

consists of the errors that probably exist in

. Without loss of generality, let denote the subspace spanned by and denote the subspace spanned by . Hence, is drawn from the intersection between and , or from except the intersection, denoted as or .

#### Iii-A1 ℓp norm-based Model

Let and be the optimal solutions of

 (6)

over and , where denotes the -norm and . We aim to investigate the conditions under which, for every nonzero data point , if the -norm of is smaller than that of , then and such that . Here, denotes the -th largest norm of the coefficients of , and is the dimensionality of .

###### Lemma 1

For any nonzero data point in the subspace except the intersection between and , i.e., , the optimal solution of (6) over is given by , which is partitioned according to the sets and , i.e., . Thus, we must have .

###### Lemma 2

Consider a nonzero data point in the intersection between and , i.e., , where and denote the subspace spanned by the clean data set and the errors , respectively. Let , , and be the optimal solution of

 (7)

over , , and , where denotes the -norm, , and are partitioned according to the sets . If , then and .

###### Definition 1 (The First Principal Angle)

Let be a Euclidean vector-space, and consider the two subspaces , with . There exists a set of angles called the principal angles, the first one being defined as:

 θmin:=min{arccos(μTν∥μ∥2∥ν∥2)}, (8)

where and .

###### Lemma 3

Consider the nonzero data point in the intersection between and , i.e., , where and denote the subspace spanned by the clean data set and the errors , respectively. The dimensionality of is , and that of is . Let be the optimal solution of

 (9)

over , where denotes the -norm, , and are partitioned according to the sets and . If the sufficient condition

 σmin(D0)≥recosθmin∥De∥1,2 (10)

is satisfied, then . Here, is the smallest nonzero singular value of , is the first principal angle between and , is the maximum norm of the columns of , and denotes the -th largest norm of the coefficients of .

#### Iii-A2 Nuclear norm-based Model

Based on two existing conclusions [31, 15], we theoretically show that our scheme is correct when the nuclear norm is used to obtain the lowest-rank representation.

###### Lemma 4 ([31])

Let

be the skinny singular value decomposition (SVD) of the data matrix

. The unique solution to

 (11)

is given by , where is the rank of .

###### Lemma 5 ([15])

Let be the SVD of the data matrix . The optimal solution to

 minC,A∥C∥∗+α2∥D−A∥2Fs.t.A=AC (12)

is given by and , where , , and are the top singular values and singular vectors of , respectively.

###### Lemma 6

Let be the skinny SVD of the optimal solution to

 (13)

where consists of the clean data set and the errors , i.e., .

The optimal solution to

 minC0,D0∥C0∥∗+α2∥De∥2Fs.t.D0=D0C0,D=D0+De (14)

is given by , where is a truncation operator that retains the first elements and sets the other elements to zero, , and is the -th largest singular value of .

### Iii-B Algorithm for Constructing the L2-graph

Let be a set of linear subspaces embedded into , be a collection of data points located in the union of the , and be the specified dictionary for the data point , where .

To construct a similarity graph, the L2-graph algorithm calculates the representation for each data point over by solving

 minci12∥yi−Yici∥22+λ∥ci∥22, (15)

where is a regularization parameter.

For each , solving the optimization problem (15) gives

 ci=(YTiYi+λI)−1YTiyi,(i=1,⋯,n). (16)

This solution requires the calculation of for each , which is very inefficient. Hence, we derive another solution for the optimization problem in Theorem 1.

###### Theorem 1

The optimal solution of the problem (15) is given by

 c∗i=P[YTyi−eTiPYTyieTiPeiei], (17)

where , and the union of is the standard orthogonal basis of , i.e., all entries in are zero, except for the -th entry, which is one.

For a given data set , the construction process of the L2-graph is summarized as follows:

1. Calculate and store the projection matrix . For each point, obtain the optimal solution to the problem (15) via (17), and normalize to give a unit L2-norm.

2. Eliminate the effects of errors by performing the -NN or -ball method over , e.g., , where retains the largest coefficients of and sets the other entries to zero.

3. Construct a similarity graph by connecting node , denoted by , with node , denoted by . Assign the connection weight between and , where is an element of .

Note that the objective function (15

) is actually derived from the well-known ridge regression

[32], which has been extensively used in different works. Based on the ridge regression model, Zhang et al. [33]

presented a classifier (named CRC) for face recognition, and Lu et al.

[19] proposed the LSR for subspace clustering. Although these methods are derived from the same model, their solutions are totally different. Moreover, Lu et al. proved that LSR will produce a block-diagonal matrix if the data are drawn from sufficiently independent subspaces, whereas our theoretical analysis shows that the L2-graph will obtain a sparse similarity graph even when the data lie in multiple dependent subspaces. Finally, the L2-graph eliminates the effect of errors by zeroing the trivial coefficients. This is a novel way of handling errors in data.

### Iii-C Computational Complexity Analysis

Suppose the data points are drawn from a union of subspaces. The L2-graph takes to construct and store the projection matrix . It then projects each data point into another space via (17), with complexity . Moreover, to eliminate the effects of errors in the data, it requires to find the largest coefficients. Putting everything together, the computational complexity of constructing the -graph is .

The complexity of our method is high, meaning that a medium-sized data set will bring about scalability issues. However, the L2-graph will not fall into local minima, and is more efficient than its counterparts. For example, norm-based methods [10, 11, 26] need operations to construct a similarity graph using the Homotopy optimizer [34] to find the sparsest solution, where denotes the number of iterations of the Homotopy algorithm. According to [35], the Homotopy optimizer is one of the fastest -minimization algorithms. Moreover, LRR has a complexity of at least to solve the rank-minimization problem, where is the rank of the dictionary and is the number of iterations of the ALM method [27].

Once the L2-graph has been built, we can follow the scheme of graph-oriented learning methods to integrate various algorithms for different tasks, e.g., subspace learning and subspace clustering. In the following, we briefly describe how to apply the L2-graph for these tasks.

### Iii-D Subspace Learning with the L2-graph

Subspace learning aims to find a projection matrix to transform a high-dimensional datum into a lower one via . According to [3], most existing dimension reduction techniques can be unified into a graph framework, i.e., they embed the similarity graph from the input space into a low-dimensional space. In this article, following the Neighborhood Preserving Embedding (NPE) program [37], we conduct subspace learning on the L2-graph by solving

 minΘ∥∥ΘTY−ΘTYW∥∥2F,s.t.ΘTYYTΘ=I, (18)

where is the affinity matrix produced by the L2-graph, and the constraint term provides the scale-invariance.

The optimal solution of (18) is achieved by solving

 (I+WTW−W−WT)YTΘ=λYTΘ. (19)

Clearly, the optimal solution

consists of the eigenvectors corresponding to the

smallest nonzero eigenvalues of the above generalized eigenvalue problem.

### Iii-E Subspace Segmentation with the L2-graph

Because of its effectiveness, spectral clustering is one of the most popular subspace clustering methods. The basis of spectral clustering is a sparse eigenvalue problem, i.e., the construction of a similarity graph in which only the intra-subspace points connect with each other. In this subsection, we demonstrate spectral clustering [4] using the L2-graph.

1. Construct a Laplacian matrix using the affinity matrix produced by the L2-graph, where with .

2. Obtain a matrix using the first normalized eigenvectors of that correspond to its smallest nonzero eigenvalues, where is the number of clusters.

3. Calculate the clustering membership of the data by performing a -means clustering algorithm over the rows of .

## Iv Experimental Verification and Analysis

In this section, we evaluate the performance of the L2-graph in the context of unsupervised subspace learning, data clustering, and motion segmentation. We consider the results in terms of three aspects: 1) accuracy, 2) robustness, and 3) computational cost.

### Iv-a Experimental Configuration

Baseline: We compare the L2-graph with several state-of-the-art algorithms, which are listed in TABLE I. Note that Locality Preserving Projections (LPP) and the G-graph construct a similarity graph using the heat-kernel function with the Euclidean distance, but embed the graph in different ways. The G-graph takes the embedding function of NPE that calculates the similarity among data points in the same way as LLE. Specifically, NPE and LLE obtain a local dictionary for each data point via a k-NN search using the Euclidean distance. Moreover, we solve the objective function of the L1-graph using the Homotopy optimizer [34]. According to [35], this is one of the most competitive -minimization algorithms in terms of accuracy, robustness, and convergence speed. From the homepage of Liu , we downloaded MATLAB code for LRR and LatLRR. To allow for a fair comparison, we adopt the embedding function (18) and the spectral clustering algorithm [4] over the G-graph, L1-graph, LRR-graph, LatLRR-graph, and L2-graph333The MATLAB code for the L2-graph and the databases used in the experiment can be downloaded from https://www.dropbox.com/sh/hkwyc2v97mvp96q/Q-3ZcmHiSG.. Similar to [11, 17], in each experiment, we tune the parameters of all methods to obtain the best results.

Databases: We examine the performance of the algorithms using several popular image databases, i.e., Extended Yale B (ExYaleB) [39], AR [40], Multiple PIE (MPIE) [41], and COIL100 [42]. TABLE II provides an overview of these databases. ExYaleB contains 2414 frontal-face images of 38 subjects (about 64 images for each subject), and we use the first 58 samples of each subject. Moreover, we extract three subsets from the AR database by randomly selecting from 50 male and 50 female subjects. AR1 contains 1400 clean images, AR2 contains 600 images occluded by sunglasses and 600 clean images, and AR3 contains 600 images occluded by scarves and 600 clean samples. MPIE contains the facial images of 337 subjects captured in four sessions with simultaneous variations in pose, expression, and illumination. Limited by the computational capacity, we perform experiments over the first seven samples per subject of Session 1 and all images of the other Sessions.

### Iv-B Subspace Learning

To compare the performance of the algorithms, the L2-graph, Principal Component Analysis (PCA), LPP, NPE, G-graph, LRR-graph, and LatLRR-graph are used to extract the features of some databases. Similar to

[11], the 1-NN classifier is then applied to the features to calculate the classification accuracy. Note that LPP, NPE, L2-graph, LRR-graph, and LatLRR-graph are graph-oriented algorithms, but focus on different problems. LPP and NPE embed a similarity graph from a high-dimensional space into a low-dimensional one, whereas the L2-graph, LRR-graph, and LatLRR-graph mainly focus on the construction of a similarity graph. In this experiment, we did not investigate the performance of the L1-graph, as its optimization program requires the dictionary to be an under-determined matrix. This requirement is questionable, because it means that dimension reduction based on the L1-graph depends on the result of other dimension reduction techniques when the data size is less than its dimensionality.

#### Iv-B1 Subspace Learning on Clean Facial Images

We randomly draw 3, 5, 7, 9, and 11 of the 14 images per subject in AR1 for use as a training data set, and use the remaining data for testing. Thus, we form five different data sets from AR1. Similarly, we select 10, 20, 29 (half of the total number of samples), and 40 images from the ExYaleB data to give four data sets.

TABLE III and TABLE IV report the results obtained from each algorithm. We make the following observations:

• The L2-graph outperforms the other methods in all tests, whereas LPP achieves the worst results.

• NPE generally achieves the second-best result over all databases. The LRR-graph and LatLRR-graph are somewhat superior to LPP, PCA, and the G-graph. Moreover, the LatLRR-graph outperforms the LRR-graph in all the tests

• Representation-based similarity graphs (L2-graph, NPE, LRR-graph, and LatLRR-graph) are more competitive than pairwise distance-based ones (LPP and G-graph).

#### Iv-B2 Subspace Learning on Corrupted Facial Images

In this section, we investigate the robustness of the algorithms to two types of corruption using the ExYaleB database. For each subject, we randomly choose half of the images (29 samples per subject) and corrupt them using white Gaussian noise or random pixel corruption. We then randomly divide the 58 images into two groups of equal size, one for training and the other for testing. Thus, both the training data and the testing data may be contaminated by noise. For the chosen image , we add white Gaussian noise according to , where , is a corruption ratio of either or , and the noise

follows a standard normal distribution. For the random pixel corruption, we replace the value of a randomly selected percentage of pixels in image

with values from a uniform distribution over

, where is the largest pixel value of . Clearly, white Gaussian noise is additive, whereas random pixel corruption is non-additive.

TABLE V reports the classification accuracy of the 1-NN classifier using each of the algorithms. This demonstrates that:

• The L2-graph achieves the best results in all tests, except when 30% of the data are corrupted by random pixel corruption, in which case it is second best.

• In general, the representation-based methods (L2-graph, NPE, LRR-graph, and LatLRR-graph) outperform the pairwise distance-based methods (LPP and the G-graph). This corroborates the claim that representation-based methods are more robust than pairwise distance-based ones.

• The algorithms perform better when the data contain white Gaussian noise than when they are affected by random pixel corruption.

### Iv-C Data Clustering

In this subsection, we investigate the clustering quality of the L2-graph, G-graph, LLE-graph, L1-graph, LRR-graph, and LatLRR-graph using clean, corrupted, and occluded images.

Two popular metrics, the accuracy (AC, sometimes called the purity) [43] and the normalized mutual information (NMI[44], are used to evaluate the clustering quality. AC or NMI values of 1 indicate perfect matching with the ground truth, whereas 0 indicates a perfect mismatch.

#### Iv-C1 Clustering of Clean Images

Seven image data sets (AR1, ExYaleB, MPIE-S1, MPIE2-S2, MPIE3-S3, MPIE-S4, and COIL100) are used in this experiment. TABLE VI shows that the L2-graph achieves the best results in the tests, except with MPIE-S4, where it is second best. With respect to the AR1 database, the AC of the L2-graph is about 45.00% higher than that of the G-graph, 37.57% higher than the AC of the LLE-graph, 9.36% higher than the L1-graph, 7.71% higher than the LRR-graph, and 8.29% higher than the LatLRR-graph. In addition, the LRR-graph and LatLRR-graph, which exhibit similar performance, are far superior to the G-graph, LLE-graph, and L1-graph. Cheng et al. [11] reported that the clustering accuracy of their L1-graph with ExYaleB is about , which is higher than the achieved in our experiment. This could be because they used a different -solver to compute the sparsest representation. However, no details were reported in their paper. Another possible reason could be differences in data processing. In their experiments, each image was cropped to pixels and had its dark homogeneous background removed, whereas we performed the experiments over 116 features extracted by PCA. In spite of this, the accuracy of the L1-graph reported in their work (78.5%) is still lower than that of the L2-graph (86.78%).

#### Iv-C2 Clustering of Corrupted Images

To examine the robustness of the algorithms, we divide ExYaleB into two parts of equal size and corrupt one part with Gaussian noise or random pixel corruption. For each type, the corruption percentage increases from to in intervals of . Fig. 2 shows some samples.

From TABLE VII, we can observe that:

• All of the algorithms perform better when the data are contaminated by white Gaussian noise than with random pixel corruption. Moreover, the AC and NMI of each method decreases as the corruption rate increases.

• The L2-graph is more robust than the other graphs. For example, the difference in AC between the L2-graph and the L1-graph varies from ( corruption ratio) to ( corruption ratio) with Gaussian noise, and from ( corruption ratio) to ( corruption ratio) under random pixel corruption.

• The LRR-graph and LatLRR-graph are more robust than the L1-graph under Gaussian corruption. This could be because the LRR-graph and LatLRR-graph formulate additive noise into their objective function, whereas the L1-graph does not.

• The LatLRR-graph outperforms the LRR-graph when the images are contaminated by Gaussian noise, but the LRR-graph is superior under random pixel corruption.

#### Iv-C3 Clustering of Images with Real Disguises

To investigate the robustness of the algorithms to real occlusions, we use two subsets of the AR images, i.e., AR2 and AR3 (as shown in Fig. 3). AR2 contains 600 images occluded by sunglasses (occlusion percentage is about ) and 600 clean images. AR3 consists of 600 images occluded by scarves (occlusion percentage is about ) and 600 clean samples.

TABLE VIII reports the clustering results of the tested algorithms. Clearly, the L2-graph again outperforms the other evaluated algorithms by a considerable margin. For example, its AC is 5.42% and 12.17% higher than the second-best algorithm (LatLRR-graph) over AR2 and AR3, respectively. Moreover, the LRR-graph and LatLRR-graph are superior to the L1-graph in terms of both AC and NMI. We also find that the evaluated algorithms perform similarly for the two different occlusions, even though the occlusion rates are very different, e.g., the AC of the L2-graph is 74.00% on AR2 and 75.42% on AR3.

### Iv-D Motion Segmentation

Motion segmentation aims to separate a video sequence into multiple spatiotemporal regions, with each region representing a moving object. Generally, segmentation algorithms are based on the feature point trajectories of multiple moving objects. Therefore, the motion segmentation problem can be thought of as the clustering of these trajectories into different subspaces, each of which corresponds to an object. Recently, a number of subspace clustering approaches have been proposed to resolve this problem, e.g., GPCA [45, 46], SCC [47], LSA [48], and ALC [8], and these algorithms have achieved impressive results.

Mathematical models of motion segmentation depend on the camera projection. The affine camera model has been extensively studied in recent years. SSC (another norm based graph), LRR, and LatLRR have each been extended to include affine mathematical models by introducing affine constraints into their objective functions [38, 9, 10]. Theoretically, it is better to depict the affine camera model using corresponding mathematical models. However, there is no empirical evidence to support the assertion that the affine constraint will obviously improve the segmentation quality of SSC and LRRs, as pointed out in [21]. Therefore, in this article, we use only the linear models of the evaluated algorithms for the motion segmentation task.

A common problem in motion segmentation is that some data entries are missing owing to occlusions or other reasons. There are two simple ways to solve this problem, i.e., filling the missing entries with random values, or removing all properties corresponding to the missing entries from the data set. The first method, which we adopt here, transforms the problem to the clustering of corrupted data.

To examine the performance of the L2-graph for motion segmentation, we conduct experiments on the Hopkins155 raw data [49], some frames of which are shown in Fig.4. The data include the feature point trajectories of 155 image sequences, consisting of 120 video sequences with two motions and 35 video sequences with three motions. Thus, there are a total of 155 independent clustering tasks. In the experiments, we run the L2-graph over the clean image sequences, and form corrupted images by adding Gaussian noise to each sequence with corruption ratios of 5 and 10%. For each algorithm, similar to [13, 50, 18], we report the best average and corresponding median clustering errors using the tuned parameters over each kind of image sequence (two and three motions). The clustering errors are defined as

 clustering error=1−AC, (20)

Fig. 5 shows the clustering errors of the evaluated algorithms over all sequences. Moreover, TABLE IX reports the mean and median clustering errors on the data sets. For fair comparison, we also give the results for SSC, LRR-graph, and LatLRR-graph reported in [10, 18]. Note that SSC and the L1-graph construct a similarity graph using an norm-based sparse representation (we discussed their connections in Section II). From the results, we can make the following observations:

• The L2-graph again outperforms its counterparts in most tests, and achieves results that are comparable to those reported recently. Note that [10] used PCA as a preprocessing step to extract features for each sequence, whereas [18] carried out the experiments on the raw data, as did we.

• All the algorithms perform better with two-motion data than with three-motion data, and the corrupted data largely decrease the clustering quality of the tested algorithms.

### Iv-E Computational Costs

In this subsection, we investigate the time costs of constructing the similarity graph using four state-of-the-art reconstruction coefficient-based methods, i.e., the L2-graph, L1-graph, LRR-graph, and LatLRR-graph. TABLE X reports the time costs obtained by averaging the elapsed CPU time over 10 independent experiments for each algorithm. We can see that the computational time of the L2-graph is remarkably lower than that of the other methods. This is because the L1-graph, LRR-graph, and LatLRR-graph find the optimal solutions for their objective functions in an iterative manner, whereas the L2-graph projects data points into another space via algebraic operations. In addition, the LRR-graph is more efficient than the L1-graph, and the LatLRR-graph is noticeably slower than the LRR-graph owing to the introduction of latent variables.

## V Conclusion

Based on the assumption that trivial coefficients are always distributed over trivial data points such as noise, we proposed a simple but effective scheme to eliminate the errors in data so as to obtain a block-diagonal affinity matrix. The scheme adopts an encoding and then denoising from representation strategy, which is the inverse of the popular procedure. The theoretical analysis has shown that the scheme is correct when the -norm () or nuclear norm is enforced over the representation. Based on the scheme, we developed an algorithm, named the L2-graph. The L2-graph calculates the linear representation of a given data set, and then eliminates the effects of errors by zeroing the trivial components. Extensive experiments demonstrated the efficiency and effectiveness of the L2-graph in feature extraction, image clustering, and motion segmentation.

There are several ways to improve and extend this work. First, Lemma 2 and Lemma 3 establish the condition under which our scheme is correct. However, this condition is a little strict, as it requires the coefficients over errors to be zero. Second, although the theoretical analysis and experimental studies showed some connections between the parameter and the intrinsic dimensionality of a subspace, it is challenging to determine the parameter values. Hence, we intend to exploit more theoretical results in the future. Third, the experimental comparisons illustrate that the L2-graph is more efficient than the L1-graph, LRR-graph, and LatLRR-graph. However, any medium-sized data set will bring about a scalability issue with the L2-graph. Therefore, it would be interesting to make the L2-graph and its counterparts more efficient in handling large-scale data sets.

## Appendix

In this material, Lemma 1 - Lemma 6 show that it is possible to eliminate the the effect of errors by zeroing the trivial coefficients. Moreover, Theorem 1 gives an analytic solution to L2-graph that makes the algorithm efficient.

### -a ℓp norm-based Model

Let be a data point in the subspace that is spanned by , where is the clean data set and consists of the errors that probably exist in . Without loss of generality, let denote the subspace spanned by and denote the subspace spanned by . Hence, is drawn from the intersection between and , or from except the intersection, denoted as or .

Let and be the optimal solutions of

 (21)

over and , where denotes the -norm, where . We aim to investigate the conditions under which, for every nonzero data point , if the -norm of is smaller than that of , then and such that . Here is the optimal solution of (21) over , denotes the -th largest -norm of the coefficients of , and is the dimensionality of .

###### Lemma 1

For any nonzero data point in the subspace except the intersection between and , i.e., , the optimal solution of (21) over is given by which is partitioned according to the sets and , i.e., . Thus, we must have .

Since the nonzero only could be represented as a linear combination of the data points from , we must have and which implies that .

###### Lemma 2

Consider a nonzero data point in the intersection between and , i.e., , where and denote the subspace spanned by the clean data set and the errors , respectively. Let , , and be the optimal solution of

 (22)

over , , and , where denotes the -norm, , and are partitioned according to the sets . We have and , if and only if .

() We prove the result using contradiction. Assume , then

 x−D0c∗0=Dec∗e. (23)

Note that, the left side and the right side of (23) correspond a data point from and , respectively. Then, we must have

 x=D0c∗0+D0z0, (24)

and

 x=D0c∗0+Deze, (25)

Clearly, and are feasible solutions of (22) over . According to the triangle inequality and the condition , we have

 (26)

From (25), we have as is the optimal solution of (22) over . Then, . It contradicts the fact that is the optimal solution of (22) over .

() We prove the result using contradiction. For a nonzero data point , assume . Thus, for the data point , it is possible that (22) will only choose the points from to represent . This contradicts the assumption that and .

Lemma 2 is an extension of Theorem 2 in [10] which only provided the necessary and sufficient condition when -norm is enforced over the coefficients. More important is that the work assumes that the data are draw from an union of independent/disjoint subspaces, whereas we didn’t take the assumption which is very strong in practice since most data are sampled from dependent subspace.

###### Definition 1 (The First Principal Angle)

Let be a Euclidean vector-space, and consider the two subspaces , with . There exists a set of angles called the principal angles, the first one being defined as:

 θmin:=min{arccos(μTν∥μ∥2∥ν∥2)}, (27)

where and .

###### Lemma 3

Consider the data point in the intersection between and , i.e., , where and denote the subspace spanned by the clean data set and the errors , respectively. The dimensionality of is , and that of is . Let be the optimal solution of

 (28)

over , where denotes the -norm, , and are partitioned according to the sets and . If the sufficient condition

 σmin(D0)≥recosθmin∥De∥1,2, (29)

is satisfied, then . Here, is the smallest nonzero singular value of , is the first principal angle between and , is the maximum -norm of the columns of , and denotes the -th largest -norm of the coefficients of .

Since , we could write , where is the skinny SVD of , , is the rank of , and is the optimal solution of (28) over . Thus, .

From the propositions of -norm, i.e., , , and , we have

 ∥z0∥p≤∥z0∥1≤√r0∥z0∥2=√r0∥∥Vr0Σ−1r0UTr0x∥∥2. (30)

Since the Frobenius norm is subordinate to the Euclidean vector norm, we must have

 ∥z0∥p ≤√r0∥∥Vr0Σ−1r0UTr0∥F∥x∥∥2 =√r0√σ21(D0)+⋯+σ2r0(D0)∥x∥2 ≤σ−1min(D0)∥x∥2 (31)

where is the smallest nonzero singular value of .

Moreover, could be represented as a linear combination of since it lies in the intersection between and , i.e., , where is the optimal solution of (28) over . Multiplying two sides of the equation with , it gives . According to the Hölder’s inequality, we have

 ∥x∥22≤∥DTex∥∞∥ze∥1, (32)

According to the definition of the first principal angles (Definition 1), we have

 ∥DTex∥∞ =max(∣∣[De]T1x∣∣,∣∣[De]T2x∣∣,⋯) ≤cosθmin∥De∥1,2∥x∥2, (33)

where