I Introduction
Subspace learning or metric learning aims to find a projection matrix from the training data , so that the highdimensional datum can be transformed into a lowdimensional space via . Existing subspace learning methods can be roughly divided into three categories: supervised, semisupervised, and unsupervised. Supervised method incorporates the class label information of to obtain discriminative features. The wellknown works include but not limit to linear discriminant analysis [1], neighbourhood components analysis [2], and their variants such as [3, 4, 5, 6]. Moreover, Xu et al. [7] recently propose to formulate the problem of supervised multiple view subspace learning as one multiple source communication system, which provide a novel insight to the community. Semisupervised methods [8, 9, 10] utilize limited labeled training data as well as unlabeled ones for better performance. Unsupervised methods seek a lowdimensional subspace without using any label information of training samples. Typical methods in this category include Eigenfaces [11], neighbourhood preserving embedding (NPE) [12], locality preserving projections (LPP) [13], sparsity preserving projections (SPP) [14] or known as L1graph [15], and multiview intact space learning [16]. For these subspace learning methods, Yan et al. [17] have shown that most of them can be unified into the framework of graph embedding, i.e., low dimensional features can be achieved by embedding some desirable properties (described by a similarity graph) from a highdimensional space into a lowdimensional one. By following this framework, this paper focuses on unsupervised subspace learning, i.e., dimension reduction considering the unavailable label information in training data.
Although a large number of subspace learning methods have been proposed, less works have investigated the following two challenging problems simultaneously: 1) how to automatically determine the dimension of the feature space, referred to as automatic subspace learning, and 2) how to immune the influence of corruptions, referred to as robust subspace learning.
Automatic subspace learning involves the technique of dimension estimation which aims at identifying the number of features necessary for the learned lowdimensional subspace to describe a data set. In previous studies, most existing methods manually set the feature dimension by exploring all possible values based on the classification accuracy. Clearly, such a strategy is timeconsuming and easily overfits to the specific data set. In the literature of manifold learning, some dimension estimation methods have been proposed,
e.g., spectrum analysis based methods [18, 19], boxcounting based methods [20], fractalbased methods [21, 22], tensor voting
[23], and neighbourhood smoothing [24]. Although these methods have achieved some impressive results, this problem is still far from solved due to the following limitations: 1) these existing methods may work only when data are sampled in a uniform way and data are free to corruptions, as pointed out by Saul and Roweis [25]; 2) most of them can accurately estimate the intrinsic dimension of a single subspace but would fail to work well for the scenarios of multiple subspaces, especially, in the case of the dependent or disjoint subspaces; 3) although some dimension estimation techniques can be incorporated with subspace learning algorithms, it is more desirable to design a method that can not only automatically identify the feature dimension but also reduce the dimension of data.Robust subspace learning aims at identifying underlying subspaces even though the training data contains gross corruptions such as the Gaussian noise. Since is corrupted by itself, accurate prior knowledge about the desired geometric properties is hard to be learned from
. Furthermore, gross corruptions will make dimension estimation more difficult. This socalled robustness learning problem has been challenging in machine learning and computer vision. One of the most popular solutions is recovering a clean data set from inputs and then performing dimension reduction over the clean data. Typical methods include the wellknown principal component analysis (PCA) which achieves robust results by removing the bottom eigenvectors corresponding to the smallest eigenvalues. However, PCA can achieve a good result only when data are sampled from a single subspace and only contaminated by a small amount of noises. Moreover, PCA needs specifying a parameter (
e.g., 98% energy) to distinct the principal components from the minor ones. To improve the robustness of PCA, Candes et al. recently proposed robust PCA (RPCA) [26] which can handle the sparse corruption and has achieved a lot of success [27, 28, 29, 30, 31, 32]. However, RPCA directly removes the errors from the input space, which cannot obtain the lowdimensional features of inputs. Moreover, the computational complexity of RPCA is too high to handle the largescale data set with very high dimensionality. Bao et al. [33] proposed an algorithm which can handle the gross corruption. However, they did not explore the possibility to automatically determine feature dimension. Tzimiropoulos et al. [34] proposed a subspace learning method from image gradient orientations by replacing pixel intensities of images with gradient orientations. Shu et al. [35]proposed to impose the lowrank constraint and group sparsity on the reconstruction coefficients under orthonormal subspace so that the Laplacian noise can be identified. Their method outperforms a lot of popular methods such as Gabor features in illumination and occlusionrobust face recognition. Liu and Tao have recently carried out a series of comprehensive works to discuss how to handle various noises,
e.g., Cauchy noise [36], Laplacian noise [37], and noisy labels [38]. Their works provide some novel theoretical explanations towards understanding the role of these errors. Moreover, some recent developments have been achieved in the field of subspace clustering [39, 40, 41, 42, 43, 44, 45], which use , , or nuclearnorm based representation to achieve robustness.In this paper, we proposed a robust unsupervised subspace learning method which can automatically identify the number of features. The proposed method, referred to as principal coefficients embedding (PCE), formulates the possible corruptions as a term of an objective function so that a clean data set and the corresponding reconstruction coefficients can be simultaneously learned from the training data . By embedding into an dimensional space, PCE obtains a projection matrix , where is adaptively determined by the rank of with theoretical guarantees.
PCE is motivated by a recent work in subspace clustering [39, 46] and the wellknown locally linear embedding (LLE) method [47]. The former motivates us the way to achieve robustness, i.e., the errors such as the Gaussian noise can be mathematically formulated as a term into the objective function and thus the errors can be explicitly removed. The major differences between [39] and PCE are: 1) [39] is proposed for clustering, whereas PCE is for dimension reduction; 2) the objective functions are different. PCE is based on the Frobenius norm instead of the nuclear norm, thus resulting a closedform solution and avoiding iterative optimization procedure; 3) PCE can automatically determine the feature dimension, whereas [39] does not investigate this challenging problem. LLE motivated us the way to estimate feature dimension even though it does not overcome this problem. LLE is one of most popular dimension reduction methods, which encodes each data point as a linear combination of its neighborhood and preserves such reconstruction relationship into different projection spaces. LLE implies the possibility to estimate the feature dimension using the size of neighborhood of data points. However, this parameter needs to be specified by users rather than automatically learning from data. Thus, LLE still suffers from the issue of dimension estimation. Moreover, the performance of LLE would be degraded when the data is contaminated by noises. The contributions of this work are summarized as follows:

The proposed method (i.e.
, PCE) can handle the Gaussian noise that probably exists into data with theoretical guarantees. Different from the existing dimension reduction methods such as L1Graph, PCE formulates the corruption into its objective function and only calculates the reconstruction coefficients using a clean data set instead of the original one. Such a formulation can perform data recovery and improve the robustness of PCE;

Unlike previous subspace learning methods, PCE can automatically determine the feature dimension of the learned lowdimensional subspace. This largely reduces the efforts for finding an optimal dimension and thus PCE is more competitive in practice;

PCE is computationally efficient, which only involves performing singular value decomposition (SVD) over the training data set one time.
The rest of this paper is organized as follows. Section II briefly introduces some related works. Section III presents our proposed algorithm. Section IV reports the experimental results and Section V concludes this work.
Notation  Definition 

the number of data points  
data size of the th subject  
the dimension of input  
the dimension of feature space  
the number of subject  
the rank of a given matrix  
a given testing sample  
the lowdimensional feature  
training data set  
full and skinny SVD of  
the desired clean data set  
the errors existing into  
the representation of  
the th singular value of  
the projection matrix 
Ii Related Works
Iia Notations and Definitions
In the following, lowercase bold letters
represent column vectors and
UPPERCASE BOLD ONES denote matrices. and denote the transpose and pseudoinverse of the matrix , respectively.denotes the identity matrix.
For a given data matrix , the Frobenius norm of is defined as
(1) 
where denotes the th singular value of .
The full Singular Value Decomposition (SVD) and the skinny SVD of are defined as and , where and are in descending order. , and consist of the top (i.e., largest) singular vectors and singular values of . Table I summarizes some notations used throughout the paper.
IiB Locally Linear Embedding
In [17], Yan et al. have shown that most unsupervised, semisupervised, and supervised subspace learning methods can be unified into a framework known as graph embedding. Under this framework, subspace learning methods obtain lowdimensional features by preserving some desirable geometric relationships from a highdimensional space into a lowdimensional one. Thus, the performance of subspace learning largely depends on the identified relationship which is usually described by a similarity graph (i.e.
, affinity matrix). In the graph, each vertex corresponds to a data point and the edge weight denotes the similarity between two connected points. There are two popular ways to measure the similarity among data points,
i.e., pairwise distance such as Euclidean distance [48] and linear reconstruction coefficients introduced by LLE [47].For a given data matrix , LLE solves the following problem:
(2) 
where is the linear representation of over , denotes the th entry of , and consists of nearest neighbors of that are chosen from the collection of in terms of Euclidean distance.
By assuming the reconstruction relationship is invariant to ambient space, LLE obtains the lowdimensional features of by
(3) 
where and the nonzero entries of corresponds to .
IiC L1Graph
By following the framework of LLE and NPE, Qiao et al. [14] and Cheng et al. [15] proposed SPP and L1graph, respectively. The methods sparsely encode each data points by solving the following sparse coding problem:
(4) 
After obtaining , SPP and L1graph embed into the feature space by following NPE. The advantage of sparsity based subspace methods is that they can automatically determine the neighbourhood for each data point without the parameter of neighbourhood size. Inspired by the success of SPP and L1graph, a number of spectral embedding methods [51, 52, 53, 54, 55, 56] have been proposed. However, these methods including L1graph and SPP have still required specifying the dimension of feature space.
IiD Robust Principal Component Analysis
RPCA [26] is proposed to improve the robustness of PCA, which solves the following optimization problem:
(5) 
where is the parameter to balance the possible corruptions and the desired clean data, and is norm to count the number of nonzero entries of a given matrix or vector.
Iii Principal Coefficients Embedding for Unsupervised Subspace Learning
In this section, we propose an unsupervised algorithm for subspace learning, i.e., principal coefficients embedding (PCE). The method not only can achieve robust results but also can automatically determine the feature dimension.
For a given data set containing the errors , PCE achieves robustness and dimension estimation in two steps: 1) the first step achieves the robustness by recovering a clean data set from and building a similarity graph with the reconstruction coefficients of , where and are jointly learned by solving a SVD problem; 2) the second step automatically estimates the feature dimension using the rank of and learns the projection matrix by embedding into an dimensional space. In the following, we will introduce these two steps in details.
Iiia Robustness Learning
For a given training data matrix , PCE removes the corruption from and then linearly encodes the recovered clean data set over itself. The proposed objective function is as follows:
(7) 
The proposed objective function mainly considers the constraints on the representation and the errors . We enforce Frobenius norm on because some recent works have shown that the Frobenius norm based representation is more computationally efficient than the  and nuclearnorm based representation while achieving competitive performance in face recognition [58] and subspace clustering [41]. Moreover, Frobeniusnorm based representation has shared some desirable properties with nuclearnorm based representation as shown in our previous theoretical studies [59, 46].
The term is motivated by the recent development in subspace clustering [60, 61], which can be further derived from the formulation of LLE (i.e., eqn.(3)). More specifically, ones reconstruct by itself to obtain this socalled selfexpression as the similarity of data set. The major differences between this work and the existing methods are: 1) the objective functions are different. Our method is based on Frobenius norm instead of  or nuclearnorm; 2) the methods directly project the original data into the space spanned by itself, whereas we simultaneously learn a clean data set from and compute the selfexpression of ; 3) PCE is proposed for subspace learning, whereas the methods are proposed for clustering.
By formulating the error as a term into our objective function, we can achieve robustness by . The constraint on (i.e., ) could be chosen as , , or norm. Different choices of correspond to different types of noises. For example, norm is usually used to formulate the Laplacian noise, norm is adopted to describe the Gaussian noise, and
norm is used to represent the samplespecified corruption such as outlier
[61]. Here, we mainly consider the Gaussian noise which is commonly assumed in signal transmission problem. Thus, we have the following objective function:(8) 
where
denotes the error that follows the Gaussian distribution. It is worthy to point out that, although the above formulation only considers the Gaussian noise, our experimental result show that PCE is also robust to other corruptions such as random pixel corruption (nonadditive noise) and real disguises.
To efficiently solve (8), we first consider the case of corruptionfree, i.e., . In such a setting, (8) is simplified as follows:
(9) 
Note that, is a feasible solution to , where denotes the pseudoinverse of . In [46], an unique minimizer to (9) is given as follows.
Lemma 1.
Let be the skinny SVD of the data matrix . The unique solution to
(10) 
is given by , where is the rank of and is a clean data set without any corruptions.
Proof.
Let be the full SVD of . The pseudoinverse of is . Defining by and . To prove that is the unique solution to (10), two steps are required.
First, we prove that is the minimizer to (10), i.e., for any satisfying , it must hold that
. Since for any column orthogonal matrix
, it must hold that . Then, we have(11)  
As satisfies , then , i.e., . Since , . Denote , then . Because , we have . Then, it follows that
(12) 
Since for any matrixes and with the same number of columns, it holds that
(13) 
Furthermore, since
(15) 
we have .
Second, we prove that is the unique solution of (10). Let be another minimizer, then, and . From (14) and (15),
(16) 
Since , it must hold that , and then . Together with , this gives . Because is an orthogonal matrix, it must hold that . ∎
Based on Lemma 1, the following theorem can be used to solve the robust version of PCE (i.e., ).
Theorem 1.
Let be the full SVD of , where the diagonal entries of are in descending order, and are corresponding left and right singular vectors, respectively. Suppose there exists a clean data set and errors, denoted by and , respectively. The optimal to (8) is given by , where is a balanced factor, consists of the first right singular vectors of , , and denotes the th diagonal entry of .
Proof.
(8) can be rewritten as
(17) 
Let be the skinny SVD of , where is the rank of . Let and be the basis that orthogonal to and , respectively. Clearly, . By Lemma 1, the representation over the clean data is given by . Next, we will bridge and .
Using Lagrange method, we have
(18) 
where denotes the Lagrange multiplier and the operator denotes dot product.
Letting , it gives that
(19) 
Letting , it gives that
(20) 
Thus, we have . Clearly, is minimized when is a diagonal matrix and can be denoted by , i.e., . Thus, the SVD of could be chosen as
(22) 
Theorem 1 shows that the skinny SVD of is automatically separated into two parts, the top and the bottom one correspond to a desired clean data and the possible corruptions , respectively. Such a PCAlike result provides a good explanation toward the robustness of our method, i.e., the clean data can be recovered by using the first leading singular vectors of . It should be pointed out that the above theoretical results (Lemma 1 and Theorem 1) have been presented in [46] for building the connections between Frobenius norm based representation and nuclear norm based representation in theory. Different from [46], this paper mainly considers how to utilize this result to achieve robust and automatic subspace learning.
Fig. 1 gives an example to show the effectiveness of PCE. We carried out experiment using 700 clean AR facial images [62] as training data that distribute over 100 individuals. Fig. 1(a) shows the coefficient matrix obtained by PCE. One can find that the matrix is approximately blockdiagonal, i.e., if and only if the corresponding points and belong to the same class. Moreover, we perform SVD over and show the singular values of in Fig. 1(b). One can find that only the first singulars values are nonzero. In other words, the intrinsic dimension of the entire data set is and the first singular values can preserve 100% information. It should be pointed out that, PCE does not set a parameter to truncate the trivial singular values like PCA and PCAlike methods [19], which incorporates all energy into a small number of dimension.
IiiB Intrinsic Dimension Estimation and Projection Learning
After obtaining the coefficient matrix , PCE builds a similarity graph and embeds it into an dimensional space by following NPE [12], i.e.,
(24) 
where denotes the projection matrix.
One challenging problem arising in dimension reduction is to determine the value of , most existing methods experimentally set this parameter, which is very computational inefficiency. To solve this problem, we propose estimating the feature dimension using the rank of the affinity matrix and have the following theorem:
Theorem 2.
For a given data set , the feature dimension is upper bounded by the rank of , i.e.,
(25) 
Proof.
It is easy to see that Eq.(24) has the following equivalent variation:
(26) 
We can see that the optimal solution to Eq.(26) consists of leading eigenvectors of the following generalized Eigen decomposition problem:
(27) 
where is the corresponding singular value of the problem.
As , then (27) can be rewritten
(28) 
Algorithm 1 summarizes the procedure of PCE. Note that, it does not require to be a symmetric matrix.
IiiC Computational Complexity Analysis
For a training data set , PCE performs the skinny SVD over in . However, a number of fast SVD methods can speed up this procedure. For example, the complexity can be reduced to by Brand’s method [63], where is the rank of . Moreover, PCE estimates the feature dimension in and solves a sparse generalized eigenvector problem in with Lanczos eigensolver. Putting everything together, the time complexity of PCE is due to .
Iv Experiments and Results
In this section, we reported the performance of PCE and six stateoftheart unsupervised feature extraction methods including Eigenfaces
[11], Locality Preserving Projections (LPP) [48, 13], neighbourhood Preserving Embedding (NPE) [12], L1graph [15], Nonnegative Matrix Factorization (NMF) [64, 65], RPCA [26], NeNMF [66], and Robust Orthonormal Subspace Learning (ROSL) [35]. Noticed that, NeNMF is one of the most efficient NMF solvers, which can effectively overcome the slow convergence rate, numerical instability and nonconvergence issue of NMF. All algorithms are implemented in MATLAB. The used data sets and the codes of our algorithm can be downloaded from the website http://machineilab.org/users/pengxi.Iva Experimental Setting and Data Sets
We implemented a fast version of L1graph by using Homotopy algorithm [67] to solve the minimization problem. According to [49], Homotopy is one of the most competitive optimization algorithms in terms of accuracy, robustness, and convergence speed. For RPCA, we adopted the accelerated proximal gradient method with partial SVD [68] which has achieved a good balance between computation speed and reconstruction error. As mentioned above, RPCA cannot obtain the projection matrix for subspace learning. For fair comparison, we incorporated Eigenfaces with RPCA (denoted by RPCA+PCA) and ROSL (denoted by ROSL+PCA) to obtain the lowdimensional features of the inputs. Unless otherwise specified, we assigned for all the tested methods except PCE which automatically determines the value of .
In our experiments, we evaluated the performance of these subspace learning algorithms with three classifiers,
i.e., sparse representation based classification (SRC) [69, 70], support vector machine (SVM) with linear kernel
[71], and the nearest neighbor classifier (NN). For all the evaluated methods, we first identify their optimal parameters using a data partitions and then reported the mean and standard deviation of classification accuracy using 10 randomly sampling data partitions.
We used eight image data sets including AR facial database [62], Expended Yale Database B (ExYaleB) [72], four sessions of Multiple PIE (MPIE) [73], COIL100 objects database [74], and the handwritten digital database USPS^{1}^{1}1http://archive.ics.uci.edu/ml/datasets.html.
The used AR data set contains 2600 samples from 50 male and 50 female subjects, of which 1400 samples are clean images, 600 samples are disguised by sunglasses, and the remaining 600 samples are disguised by scarves. ExYaleB contains 2414 frontalface images of 38 subjects, and we use the first 58 samples of each subject. MPIE contains the facial images captured in four sessions. In the experiments, all the frontal faces with 14 illuminations^{2}^{2}2illuminations: 0,1,3,4,6,7,8,11,13,14,16,17,18,19. are investigated. For computational efficiency, we downsized all the data sets from the original size to smaller one. Table II provides an overview of the used data sets.
Databases  Original Size  Cropped Size  

AR  100  26  
ExYaleB  38  58  
MPIES1  249  14  
MPIES2  203  10  
MPIES3  164  10  
MPIES4  176  10  
COIL100  100  10  
USPS  10  1100   
IvB The Influence of the Parameter
In this section, we investigate the influence of parameters of PCE. Besides the aforementioned subspace clustering methods, we also report the performance of CorrEntropy based Sparse Representation (CESR) [75] as a baseline. Noticed that, CESR is a not subspace learning method, which performs like SRC to classify each testing sample by finding which subject produces the minimal reconstruction error. By following the experimental setting in [75], we evaluated CESR using the nonnegativity constraint with 0.
The influence of : PCE uses the parameter to measure the possible corruptions and estimate the feature dimension . To investigate the influence of on the classification accuracy and the estimated dimension, we increased the value of from 1 to 99 with an interval of 2 by performing experiment on a subset of AR database and a subset of Extended Yale Database B. The used data sets include 1400 clean images over 100 individuals and 2204 samples over 38 subjects. In the experiment, we randomly divided each data set into two parts with equal size for training and testing.
Fig. 2 shows that a larger will lead to a larger but does not necessarily bring a higher accuracy since the value of does reflect the errors contained into inputs. For example, while increases from to , the recognition accuracy of PCE on AR almost remains unchanged, which ranges from 93.86% to 95.29%.
Classifiers  SRC  SVM  NN  

Algorithms  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para. 
PCE  96.900.74  23.502.36  5, 118  98.930.18  7.440.37  50, 329  97.030.57  6.960.71  5, 118 
PCE2  96.920.59  28.022.84  16.00  98.200.43  8.070.67  26.00  96.860.57  7.890.88  19.00 
Eigenfaces  95.320.80  27.790.22    95.530.85  5.650.14    82.531.70  4.970.14   
LPP  83.876.59  17.200.71  9.00  87.929.12  7.400.12  2.00  79.971.36  7.180.19  3.00 
NPE  90.4715.72  37.800.45  50.00  82.508.74  27.570.24  47.00  93.350.53  28.370.30  49.00 
L1graph  91.290.60  633.9547.94  1e2,1e1  82.081.66  870.0461.01  1e3,1e3  89.750.70  988.2774.98  1e2,1e3 
NMF  87.541.15  137.466.26    91.591.09  19.390.36    72.111.44  11.130.03   
NeNMF  87.091.11  72.106.37    76.732.14  10.380.66    47.631.03  6.890.02   
RPCA+PCA  95.880.56  497.4832.72  0.30  95.791.02  466.1735.85  0.10  82.571.18  466.1142.70  0.20 
ROLS+PCA  95.730.77  765.9515.95  0.23  95.030.86  733.4618.09  0.19  81.331.65  732.6115.58  0.35 
PCE with the fixed : To further show the effectiveness of our dimension determination method, we investigated the performance of PCE by manually specifying , denoted by PCE2. We carried out the experiments on ExYaleB by choosing 40 samples from each subject as training data and using the rests for testing. Table III reports the result from which we can find that:

the automatic version of our method, i.e., PCE, performs competitive to PCE2 which manually set . This shows that our dimension estimation method can accurately estimate the feature dimension.

both PCE and PCE2 outperform the other methods by a considerable performance margin. For example, PCE is 3.68% at least higher than the second best method when the NN classifier is used.

Although PCE is not the fastest algorithm, it achieves a good balance between recognition rate and computational efficiency. In the experiments, PCE, Eigenfaces, LPP, NPE, and NeNMF are remarkably faster than other baseline methods. Moreover, NeNMF is remarkably faster than NMF while achieving a competitive performance.
Tuning for the baseline methods: To show the dominance of the dimension estimation of PCE, we reported the performance of all the baseline methods in two settings, i.e., and the optimal . The later setting is achieved by finding an optimal from 1 to 600 so that the algorithm achieves their highest classification accuracy. We carried out the experiments on ExYaleB by selecting 20 samples from each subject as training data and using the rests for testing. Note that, we only tuned for the baseline algorithms and PCE automatically identifies this parameter. Table IV shows that PCE remarkably outperforms the investigated methods in two settings even though all parameters including are tuned for achieving the best performance of the baselines.
Methods  Fixed ()  The Tuned  

Accuracy  Para.  Accuracy  Para.  
PCE  93.51  263  94.63  95  162 
Eigenfaces  70.71  76.11    353  
LPP  77.76  3.00  76.73  3  312 
NPE  85.18  43.00  87.54  65  405 
L1graph  88.78  1e2,1e3  89.18  1e2,1e2  532 
NMF  62.21    70.15    214 
NeNMF  51.42    69.88    148 
RPCA+PCA  70.86  0.10  76.25  0.1  375 
ROLS+PCA  70.46  0.35  89.18  0.39  322 
CESR  88.71  1e3,1e3  88.85  1e3,1e3  336 
IvC Performance with Increasing Training Data and Feature Dimension
Classifiers  SRC  SVM  NN  

Algorithms  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para. 
PCE  99.270.32  51.960.29  75.00  96.561.23  14.380.52  85.00  97.720.55  13.210.59  40.00 
Eigenfaces  92.640.56  90.640.73    90.731.81  12.870.20    55.030.93  6.210.22   
LPP  81.840.94  30.582.60  10.00  70.160.07  7.380.54  55.00  71.312.39  4.850.39  4.00 
NPE  80.560.41  58.950.77  29.00  80.250.15  36.380.55  43.00  77.711.65  36.190.38  49.00 
L1graph  80.360.17  3856.69280.16  1e1,1e1  86.791.62  5726.08444.82  1e6,1e5  89.821.44  8185.55503.80  1e6,1e4 
NMF  65.180.87  520.946.27    66.421.66  121.890.58    41.781.18  11.030.00   
NeNMF  27.421.18  277.3326.87    17.571.44  36.791.83    14.661.07  10.240.04   
RPCA+PCA  92.680.57  1755.27490.99  0.10  90.511.26  1497.13329.00  0.30  54.951.38  1557.33358.93  0.10 
ROLS+PCA  92.390.91  658.2939.47  0.19  89.91.71  578.5338.61  0.35  54.821.22  593.0760.41  0.27 
Classifiers  SRC  SVM  NN  

Algorithms  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para. 
PCE  93.870.82  29.000.36  90.00  92.630.95  4.970.11  40.00  93.180.87  4.140.14  75.00 
Eigenfaces  64.362.42  81.7114.99    51.722.81  0.500.11    30.861.44  0.360.06   
LPP  59.622.33  36.697.64  2.00  34.282.53  2.730.60  2.00  62.642.20  2.730.84  3.00 
NPE  84.650.77  33.031.51  41.00  64.663.03  12.450.30  27.00  85.560.92  12.240.24  49.00 
L1graph  47.673.09  874.9153.69  1e3,1e3  65.411.69  657.6953.51  1e3,1e3  74.151.67  703.5437.97  1e2,1e3 
NMF  81.881.31  323.938.70    83.191.47  46.721.22    57.211.38  26.010.01   
NeNMF  33.381.46  128.338.43    19.651.08  26.791.39    11.940.55  14.580.06   
RPCA+PCA  91.181.11  401.627.46  0.20  91.181.11  401.627.46  0.20  67.801.93  366.508.78  0.10 
ROLS+PCA  91.070.91  875.7974.21  0.27  83.022.2  633.8542.38  0.43  43.70.98  761.9867.65  0.23 
Classifiers  SRC  SVM  NN  

Algorithms  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para. 
PCE  97.790.81  13.140.19  75.00  95.371.82  2.740.08  65.00  94.040.84  2.290.04  65.00 
Eigenfaces  88.040.70  29.510.36    80.992.28  2.010.05    37.961.18  0.940.05   
LPP  78.732.04  28.614.62  40.00  60.442.49  1.610.25  3.00  65.962.49  1.030.13  75.00 
NPE  77.833.14  25.791.02  46.00  72.290.99  7.560.07  7.00  79.182.38  7.060.09  48.00 
L1graph  70.400.22  1315.37192.65  1e1,1e5  79.282.54  1309.27193.38  1e3,1e3  89.402.80  1539.26226.57  1e3,1e3 
NMF  60.940.80  90.640.91    51.341.68  40.040.37    39.891.04  4.280.01   
NeNMF  39.901.19  61.243.21    26.661.79  20.300.41    11.930.95  3.110.01   
RPCA+PCA  88.492.17  630.0888.89  0.10  81.022.52  491.3626.75  0.30  37.850.83  481.8725.01  0.30 
ROLS+PCA  87.131.53  327.1430.97  0.15  78.293.09  291.571.64  0.47  37.231.24  297.7524.91  0.35 
Classifiers  SRC  SVM  NN  

Algorithms  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para. 
PCE  98.360.41  14.070.31  100.00  90.551.02  3.040.12  45.00  97.340.78  2.730.09  80.00 
Eigenfaces  92.051.37  32.430.32    82.183.88  2.340.05    43.741.17  1.120.05   
LPP  64.672.52  27.381.57  3.00  61.471.12  1.940.20  2.00  73.692.68  1.110.17  2.00 
NPE  84.741.50  30.451.28  46.00  63.801.56  9.870.49  49.00  87.301.10  8.540.36  45.00 
L1graph  70.450.31  1928.24212.21  1e3,1e3  84.672.46  1825.09197.62  1e3,1e3  93.561.13  1767.57156.61  1e3,1e3 
NMF  69.411.73  98.911.37    53.482.07  47.260.44    25.471.40  4.850.00   
NeNMF  40.611.21  58.453.76    23.781.80  20.871.17    14.830.61  3.360.02   
RPCA+PCA  93.161.17  682.2739.20  0.30  84.453.02  535.3119.08  0.10  43.660.63  514.5120.82  0.10 
ROLS+PCA  91.81.01  200.8325.47  0.07  82.612.12  265.637.03  0.23  43.011.58  264.216.46  0.27 
Classifiers  SRC  SVM  NN  

Algorithms  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para. 
PCE  59.601.94  12.250.32  15.00  53.001.22  1.360.01  45.00  57.401.83  1.150.03  5.00 
Eigenfaces  57.401.67  12.970.25    44.402.21  1.040.06    54.761.14  0.670.06   
LPP  45.861.51  13.220.54  60.00  30.203.08  0.800.11  2.00  41.102.15  0.630.02  90.00 
NPE  47.722.25  15.300.28  43.00  32.782.90  5.330.08  36.00  44.882.12  6.810.03  49.00 
L1graph  45.161.83  960.80123.43  1e2,1e4  39.422.81  801.73147.83  1e3,1e3  38.061.96  664.9293.75  1e1,1e5 
NMF  51.422.17  76.051.21    41.742.05  32.810.18    56.821.46  6.470.00   
NeNMF  57.482.13  39.213.18    35.963.73  25.640.52    59.021.55  10.990.01   
RPCA+PCA  58.040.90  244.9250.17  0.30  45.522.70  229.5451.06  0.20  56.481.32  227.2752.66  0.10 
ROLS+PCA  58.111.67  447.0527.01  0.03  44.741.69  747.6698.89  0.19  57.101.68  379.8118.42  0.03 
In this section, we examined the performance of PCE with increasing training samples and increasing feature dimension. In the first test, we randomly sampled clean AR images from each subject for training and used the rest for testing. Besides the result of RPCA+PCA, we also reported the performance of RPCA without dimension reduction.
In the second test, we randomly chose a half of images from ExYaleB for training and used the rest for testing. We reported the recognition rate of the NN classifier with the first features extracted by all the tested subspace learning methods, where increases from to with an interval of 10. From Fig. 3, we can conclude:

PCE performs well even though only a few of training samples are available. Its accuracy is about when , whereas the second best method achieves the same accuracy when .

RPCA and RPCA+PCA perform very close, however, RPCA+PCA is more efficient than RPCA.

Fig. 3 shows that PCE consistently outperforms the other methods. This benefits an advantage of PCE, i.e., PCE obtains a more compact representation which can use a few of variables to represent the entire data.
IvD Subspace Learning on Clean Images
In this section, we performed the experiments using MPIE and COIL100. For each data set, we split it into two parts with equal size. As did in the above experiments, we set for all the tested methods except PCE. Tables V–IX report the results, from which one can find that:

with three classifiers, PCE outperforms the other investigated approaches on these five data sets by a considerable performance margin. For example, the recognition rates of PCE with these three classifiers are 6.59%, 5.83%, and 7.90% at least higher than the rates of the second best subspace learning method on MPIES1.

PCE is more stable than other tested methods. Although SRC generally outperforms SVM and NN with the same feature, such superiority is not distinct for PCE. For example, SRC gives an accuracy improvement of 1.02% over NN to PCE on MPIES4. However, the corresponding improvement to RPCA+PCA is about 49.50%.

PCE achieves the best results in all the tests, while using the least time to perform dimension reduction and classification. PCE, Eigenfaces, LPP, NPE, and NeNMF are remarkably efficient than L1graph, NMF, RPCA+PCA, and ROSL+PCA.
IvE Subspace Learning on Corrupted Facial Images
In this section, we investigated the robustness of PCE against two corruptions using ExYaleB and the NN classifier. The corruptions include the white Gaussian noise (additive noise) and the random pixel corruption (nonadditive noise) [69].
In our experiments, we use a half of images (29 images per subject) to corrupt using these two noises. Specifically, we added white Gaussian noise into the sampled data via , where , is the corruption ratio, and
is the noise following the standard normal distribution. For random pixel corruption, we replaced the value of a percentage of pixels randomly selected from the image with the values following a uniform distribution over
, where is the largest pixel value of . After adding the noises into the images, we randomly divide the data into training and testing sets. In other words, both training data and testing data probably contains corruptions. Fig. 4 illustrates some results achieved by our method. We can see that PCE successfully identifies the noises from the corrupted samples and recovers the clean data. Table X reports the comparison from which we can see that:
PCE is more robust than the other tested approaches. When pixels are randomly corrupted, the accuracy of PCE is at least higher than that of the other methods.

with the increase of level of noise, the dominance of PCE is further strengthen. For example, the improvement in accuracy of PCE increases from to when increases to 30%.
Corruptions  Gaussian (10%)  Gaussian (30%)  RPC (10%)  RPC (30%)  

Algorithms  Accuracy  Para.  Accuracy  Para.  Accuracy  Para.  Accuracy  Para. 
PCE  95.050.63  10.00  93.180.87  5.00  90.120.98  5.00  83.481.04  10.00 
Eigenfaces  41.692.01    30.861.44    30.352.05    25.371.56   
LPP  76.940.75  2.00  62.642.20  3.00  55.861.27  2.00  42.761.53  2.00 
NPE  91.540.76  49.00  85.560.92  49.00  80.660.86  49.00  60.251.64  43.00 
L1graph  87.360.81  1e3,1e4  74.151.67  1e2,1e3  71.630.90  1e3,1e4  55.022.07  1e4,1e4 
NMF  67.421.41    57.211.38    60.571.88    46.131.41   
NeNMF  44.751.28    42.480.68    43.271.01    25.441.21   
RPCA+PCA  76.261.12  0.20  67.801.93  0.10  64.560.67  0.10  52.121.34  0.10 
ROSL+PCA  76.520.83  0.27  66.501.49  0.27  75.941.44  0.07  65.760.98  0.07 
IvF Subspace Learning on Disguised Facial Images
Besides the above tests on the robustness to corruptions, we also investigated the robustness to real disguises. Tables XI–XII reports results on two subsets of AR database. The first subset contains 600 clean images and 600 images disguised with sunglasses (occlusion rate is about ), and the second one includes 600 clean images and 600 images disguised by scarves (occlusion rate is about ). Like the above experiment, both training data and testing data will contains the disguised images. From the results, one can conclude that:

PCE significantly outperforms the other tested methods. When the images are disguised by sunglasses, the recognition rates of PCE with SRC, SVM, and NN are , , and higher than the best baseline method. With respect to the images with scarves, the corresponding improvements are , , and .

PCE is one of the most computationally efficient methods. When SRC is used, PCE is 2.27 times faster than NPE and 497.16 times faster than L1graph on the faces with sunglasses. When the faces are disguised by scarves, the corresponding speedup are 2.17 and 484.94 times, respectively.
Classifiers  SRC  SVM  NN  

Algorithms  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para. 
PCE  83.881.38  8.730.90  65.00  87.801.57  0.900.10  65.00  68.581.96  0.710.11  55.00 
Eigenfaces  72.871.99  45.485.18    64.772.96  1.620.40    36.421.69  0.780.19   
LPP  51.732.77  44.208.25  95.00  44.881.93  1.600.57  2.00  37.372.19  1.120.30  85.00 
NPE  78.002.27  19.840.51  47.00  49.173.33  4.300.04  47.00  56.831.83  4.160.04  49.00 
L1graph  52.001.42  4340.22573.64  1e4,1e4  48.532.06  3899.81487.89  1e4,1e4  49.282.68  4189.73431.98  1e4,1e4 
NMF  47.872.64  108.462.98    43.052.39  24.340.81    31.352.04  8.010.01   
NeNMF  37.572.41  55.364.31    26.172.30  15.140.34    26.521.43  7.620.01   
RPCA+PCA  72.072.30  1227.08519.27  0.10  63.703.74  1044.46462.33  0.20  36.930.90  965.76385.19  0.10 
ROLS+PCA  71.421.46  1313.65501.05  0.31  62.673.27  1336.00549.66  0.43  35.582.14  1245.63406.74  0.19 
Classifiers  SRC  SVM  NN  

Algorithms  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para.  Accuracy  Time (s)  Para. 
PCE  83.571.16  8.950.96  50.00  87.701.62  0.900.11  95.00  66.921.75  0.690.10  50.00 
Eigenfaces  69.322.58  36.183.97    63.932.84  1.420.34    30.581.27  0.710.19   
LPP  49.481.79  30.806.29  2.00  43.071.80  1.340.39  2.00  33.701.70  0.850.21  90.00 
NPE  62.752.16  19.440.62  47.00  58.232.75  4.400.03  49.00  54.332.37  4.090.03  49.00 
L1graph  49.651.42  4340.22453.64  1e3,1e4  48.532.06  5381.96467.89  1e4,1e4  49.282.68  5189.73411.98  1e4,1e4 
NMF  47.172.18  109.332.22    40.552.20  23.670.92    24.581.88  5.820.01   
NeNMF  32.581.76  52.201.81    21.982.10  14.710.08    18.521.46  3.490.00   
RPCA+PCA  71.402.67  241.454.39  0.10  66.402.62  193.677.76  0.10  32.271.46  195.218.01  0.20 
ROLS+PCA  68.571.37  557.5492.71  0.15  60.382.53  441.7995.61  0.23  31.031.05  432.0295.28  0.07 
IvG Comparisons with Some Dimension Estimation Techniques
Methods  Accuracy  Time Cost  Para.  

PCE  90.31  0.57  109  30 
MLE+PCA  68.29  3.34  11.6  10 
MiNDML+PCA  67.14  3.95  11  10 
MiNDKL+PCA  72.71  2791.19  16  22 
DANCo+PCA  71.71  28804.01  15  22 
In this section, we compare PCE with three dimension estimators, i.e., maximum likelihood estimation (MLE) [76], minimum neighbor distance Estimators (MiND) [77], and DANCo [78]. MiND has two variants which are denoted as MiNDML and MiNDKL. All these estimators need specifying the size of neighborhood of which the optimal value is found from the range of [10 30] with an interval of 2. Since these estimators cannot be used for dimension reduction, we report the performance of these estimators with PCA, i.e., we first estimate the feature dimension with an estimator and then extract features using PCA with the estimated dimension. We carry out experiments with the NN classifier on a subset of the AR data set of which both the training and testing set include 700 nondisguised facial images. Table XIII shows that our approach outperforms the baseline estimators by a considerable performance margin in terms of classification accuracy and time cost.
IvH Scalability Evaluation
In this section, we investigate the scalability performance of PCE by using the whole USPS data set, where of PCE is fixed as . In the experiments, we randomly split the whole data set into two partitions for training and testing, where the number of training samples increases from 500 to 9,500 with an interval of 500 and thus 19 partitions are obtained. Fig. 5 reports the classification accuracy and the time cost taken by PCE. From the results, we could see that the recognition rate of PCE almost remains unchanged when 1,500 samples are available for training. Considering different classifiers, SRC slightly performs better than NN, and both of them remarkably outperform SVM. PCE is computational efficient, it only take about seven seconds to handle 9,500 samples. Moreover, PCE could be further speeded up by adopting large scale SVD methods. However, this has been out of scope for this paper.
V Conclusion
In this paper, we have proposed a novel unsupervised subspace learning method, called principal coefficients embedding (PCE). Unlike existing subspace learning methods, PCE can automatically determine the optimal dimension of feature space and obtain the lowdimensional representation of a given data set. Experimental results on several popular image databases have shown that our PCE achieves a good performance with respect to additive noise, nonadditive noise, and partial disguised images.
The work would be further extended or improved from the following aspects. First, the paper currently only considers one category of image recognition, i.e., image identification. In the future, PCE can be extended to handle the other category of image recognition, i.e., face verification which aims to determine whether a given pair of facial images is from the same subject or not. Second, PCE is a unsupervised method which does not adopt the label information. If such information is available, one can develop the supervised or semisupervised version of PCE under the framework of graph embedding. Third, PCE can be extended to handle outliers by enforcing norm or Laplacian noises by enforcing norm over the errors term in our objective function.
Acknowledgment
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions that significantly improve the quality of this paper.
References
 [1] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
 [2] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood components analysis,” in Proc. of 18th Adv. in Neural Inf. Process. Syst., Vancouver, Canada, Dec. 2004, pp. 513–520.

[3]
H.T. Chen, H.W. Chang, and T.L. Liu, “Local discriminant embedding and its
variants,” in
Proc. of 18th IEEE Conf. Comput. Vis. and Pattern Recognit.
, vol. 2, San Diego, CA, Jun. 2005, pp. 846–853.  [4] J. Lu, X. Zhou, Y.P. Tan, Y. Shang, and J. Zhou, “Neighborhood repulsed metric learning for kinship verification,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 36, no. 2, pp. 331–345, 2014.
 [5] H. Wang, X. Lu, Z. Hu, and W. Zheng, “Fisher discriminant analysis with l1norm,” IEEE Trans. Cybern., vol. 44, no. 6, pp. 828–842, Jun. 2014.

[6]
Y. Yuan, J. Wan, and Q. Wang, “Congested scene classification via efficient unsupervised feature learning and density estimation,”
Pattern Recognit., vol. 56, pp. 159 – 169, 2016.  [7] C. Xu, D. Tao, and C. Xu, “Largemargin multiviewinformation bottleneck,” IEEE Trans Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1559–1572, Aug. 2014.
 [8] D. Cai, X. He, and J. Han, “Semisupervised discriminant analysis,” in Proc. of 11th Int. Conf. Comput. Vis., Rio de Janeiro, Brazil, Oct. 2007, pp. 1–7.

[9]
S. Yan and H. Wang, “Semisupervised learning by sparse representation,” in
Proc. of 9th SIAM Int. Conf. Data Min., Sparks, Nevada, Apr. 2009, pp. 792–801.  [10] C. Xu, D. Tao, C. Xu, and Y. Rui, “Largemargin weakly supervised dimensionality reduction,” in Proc. of 31th Int. Conf. Machine Learn., Beijing, Jun. 2014, pp. 865–873.
 [11] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cogn. Neurosci., vol. 3, no. 1, pp. 71–86, 1991.
 [12] X. He, D. Cai, S. Yan, and H.J. Zhang, “Neighborhood preserving embedding,” in Proc. of 10th IEEE Conf. Comput. Vis., Beijing, China, Oct. 2005, pp. 1208–1213.
 [13] X. He, S. Yan, Y. Hu, P. Niyogi, and H.J. Zhang, “Face recognition using laplacianfaces,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 27, no. 3, pp. 328–340, 2005.
 [14] L. S. Qiao, S. C. Chen, and X. Y. Tan, “Sparsity preserving projections with applications to face recognition,” Pattern Recognit., vol. 43, no. 1, pp. 331–341, 2010.
 [15] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang, “Learning with L1graph for image analysis,” IEEE Trans. Image Process., vol. 19, no. 4, pp. 858–866, 2010.
 [16] C. Xu, D. Tao, and C. Xu, “Multiview intact space learning,” IEEE Trans Pattern Anal. Mach. Intell., vol. 37, no. 12, pp. 2531–2544, Dec. 2015.
 [17] S. C. Yan, D. Xu, B. Y. Zhang, H. J. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 29, no. 1, pp. 40–51, 2007.
 [18] M. Polito and P. Perona, “Grouping and dimensionality reduction by locally linear embedding,” in Proc. of 14th Adv. in Neural Inf. Process. Syst., Vancouver, British Columbia, Canada, Dec. 2001.
 [19] Y. Shuicheng, J. Liu, X. Tang, and T. Huang, “A parameterfree framework for general supervised subspace learning,” IEEE Trans. Inf. Forensics Security, vol. 2, no. 1, pp. 69–76, Mar. 2007.
 [20] B. Kégl, “Intrinsic dimension estimation using packing numbers,” in Proc. of 15th Adv. in Neural Inf. Process. Syst., Vancouver, British Columbia, Canada 2002, pp. 681–688.
 [21] F. Camastra and A. Vinciarelli, “Estimating the intrinsic dimension of data with a fractalbased method,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 24, no. 10, pp. 1404–1407, Oct 2002.
 [22] D. Mo and S. Huang, “Fractalbased intrinsic dimension estimation and its application in dimensionality reduction,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 1, pp. 59–71, Jan 2012.
 [23] P. Mordohai and G. Medioni, “Dimensionality estimation, manifold learning and function approximation using tensor voting,” J. Mach. Learn. Res., vol. 11, pp. 411–450, 2010.
 [24] K. M. Carter, R. Raich, and A. Hero, “On local intrinsic dimension estimation and its applications,” IEEE Trans. Signal Process., vol. 58, no. 2, pp. 650–663, Feb 2010.

[25]
L. K. Saul and S. T. Roweis, “Think globally, fit locally: Unsupervised learning of low dimensional manifolds,”
J. Mach. Learn. Res., vol. 4, pp. 119–155, Dec. 2003.  [26] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM, vol. 58, no. 3, p. 11, 2011.
 [27] R. He, B.G. Hu, W.S. Zheng, and X.W. Kong, “Robust principal component analysis based on maximum correntropy criterion,” IEEE Trans. Image Process., vol. 20, no. 6, pp. 1485–1494, 2011.
 [28] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma, “RASL: Robust alignment by sparse and lowrank decomposition for linearly correlated images,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 34, no. 11, pp. 2233–2246, 2012.
 [29] Q. Zhao, D. Meng, Z. Xu, W. Zuo, and L. Zhang, “Robust principal component analysis with complex noise,” in Proc. of 31th Int. Conf. Mach. Learn., Beijing, China, Jun. 2014, pp. 55–63.
 [30] F. Nie, J. Yuan, and H. Huang, “Optimal mean robust principal component analysis,” in Proc. of 31th Int. Conf. Mach. Learn., Beijing, China, Jun. 2014, pp. 1062–1070.
 [31] X. Liu, Y. Mu, D. Zhang, B. Lang, and X. Li, “Largescale unsupervised hashing with shared structure learning,” IEEE Trans. Cybern., vol. 45, no. 9, pp. 1811–1822, Sep. 2015.
 [32] D. Hsu, S. M. Kakade, and T. Zhang, “Robust matrix decomposition with sparse corruptions,” IEEE Trans. Inf. Theory, vol. 57, no. 11, pp. 7221–7234, Nov. 2011.
 [33] B.K. Bao, G. Liu, R. Hong, S. Yan, and C. Xu, “General subspace learning with corrupted training data via graph embedding,” IEEE Trans. Image Process., vol. 22, no. 11, pp. 4380–4393, Nov. 2013.
 [34] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Subspace learning from image gradient orientations,” IEEE Trans. Pa