Subspace and manifold learning, also referred to as representation learning [bengio2013representation, zhong2016overview]ghojogh2019feature]. The subspace of data is a lower-dimensional space than the input space of data which can appropriately represent the data with the smallest possible representation error. In other words, the data usually exist on a submanifold or subspace with a lower intrinsic dimensionality [wang2012geometric, warner2013foundations].
The submanifold of data can be either linear or nonlinear. Different linear and nonlinear methods have been proposed in the literature for subspace and manifold learning. Principal Component Analysis (PCA) [jolliffe2011principal], first proposed in [pearson1901liii], was one of the first methods in linear subspace learning. The Fisher Discriminant Analysis (FDA) [friedman2001elements], first proposed in [fisher1936use], was one of the first linear supervised subspace learning methods. Both PCA and FDA were based on the scatter of data. Metric Multi-Dimensional Scaling (MDS) [cox2000multidimensional] was a linear method which tried to preserve the similarity of the data points. In later approaches after MDS, the cost function in MDS was changed to preserving the distances of points [lee2007nonlinear], yielding to Sammon mapping [sammon1969nonlinear]
. Sammon mapping can probably be considered as the first nonlinear subspace learning method. Another approach to handle nonlinearity of data is to modify the data instead of changing the linear algorithm. That was the perspective of kernel PCA[scholkopf1997kernel, scholkopf1998nonlinear] which used the dual of PCA and the kernel trick [hofmann2008kernel] to pull the data to the feature space hoping that it becomes roughly linear in that space. Kernel FDA [mika1999fisher, mika2000invariant] was also proposed for handling the nonlinear data in a supervised manner. However, it did not use the kernel trick (which we will explain why in this paper) but used representation theory [alperin1993local] instead. Recently, deep FDA [diaz2017deep, diaz2019deep] was proposed which uses a least squares approach [ye2007least, zhang2010regularized].
After Sammon mapping, there was not any important method, which is actually nonlinear without changing the data like kernel PCA/FDA, until Isomap [tenenbaum2000global] and Locally Linear Embedding (LLE) [roweis2000nonlinear]. The former used geodesic distance rather than Euclidean distance in the kernel of MDS [ham2004kernel, strange2014open] and the latter reconstructed every point by its -Nearest Neighbors (-NN) to locally fit the data [saul2003think]. Stochastic Neighbor Embedding (SNE) [hinton2003stochastic]
was a probabilistic approach to manifold learning where the probability of a point being neighbor of others was tried to be preserved. While the Gaussian distribution was used in SNE, the Student-distribution was used in the embedded space in -SNE method in order to tackle the crowding problem [maaten2008visualizing]hinton2006reducing]
was proposed in order to learn the latent subspace of data in an undercomplete autoencoder. One of the most recent important subspace learning methods was Supervised PCA (SPCA)[barshan2011supervised]
. The SPCA made use of the empirical estimation of the Hilbert-Schmidt Independence Criterion (HSIC)[gretton2005measuring]
. The HSIC computes the dependence of two random variables by calculating their correlation in the feature space. The SPCA tried to maximize the dependence of projected data and the labels in order to use the information of labels for better embedding. Recently, supervised random projection[karimi2018srp] used low-rank kernel approximation in the formulation of SPCA for better efficiency.
In this paper, we propose a generalized subspace learning method, named Roweis Discriminant Analysis (RDA), named after Sam Roweis (1972–2010) who contributed significantly to subspace and manifold learning. He proposed many important methods in subspace/manifold learning including LLE [roweis2000nonlinear, saul2003think], SNE [hinton2003stochastic], and metric learning by class collapsing [globerson2006metric]. The proposed RDA is a family of infinite number of subspace learning methods including PCA, FDA, and SPCA which have been proposed in the literature. It is a generalized method and contains many methods which use linear projection into a lower dimensional subspace. The main contributions of our paper can be listed as the following:
proposing RDA as a generalized subspace learning method based on eigenvalue and generalized eigenvalue problems,
proposing Double Supervised Discriminant Analysis (DSDA), which is not yet proposed in the literature to the best of our knowledge, as one of the extreme cases in RDA,
proposing RDA in the feature space to have kernel RDA which generalizes kernel PCA, kernel SPCA, and kernel FDA,
explaining the reasons behind some of the characteristics of PCA, SPCA, and FDA such as (I) why SPCA can be used for both classification and regression but FDA is only for classification, (II) why PCA and SPCA have their dual methods but FDA does not have a dual, (III) why kernel PCA and kernel SPCA can use kernel trick but kernel FDA uses representation theory rather than kernel trick, (IV) why PCA is the best linear method for reconstruction,
proposing Roweisfaces and kernel Roweisfaces for demonstrating the generalizability of the RDA approach for the eigenfaces [turk1991eigenfaces, turk1991face], kernel eigenfaces [yang2000face], Fisherfaces [belhumeur1997eigenfaces], kernel Fisherfaces [yang2002kernel] and supervised eigenfaces [barshan2011supervised, ghojogh2019unsupervised].
The remainder of the paper is organized as follows. Section II introduces projection into a subspace and the reconstruction of data after projection. Two general forms of optimization for subspace learning are also introduced. In Section III, the theory of PCA, FDA, and SPCA is briefly reviewed. The proposed RDA is explained in Section IV. A dual is also proposed for the RDA in Section V but for some special cases. Section VI explains two types of kernel RDA, one based on kernel trick and another using representation theory. The experimental results are reported in Section VII where the Roweisfaces are also proposed. Section VIII summarizes and concludes the article.
Ii Subspace and Projection
Let , , , , , , , and denote the training sample size, test (out-of-sample) sample size, dimensionality of data, number of classes, sample size of the -th class, the -th training instance, the -th training instance in the -th class, and the -th test instance, respectively. We stack the training data points column-wise in a matrix and similarly for the out-of-sample data points as .
Let be the projection matrix whose columns are the projection directions spanning the desired subspace so the subspace is the column-space of . If we truncate the projection matrix to have , the subspace is spanned by projection directions and it will be a dimensional subspace where . We want the projection directions to be orthonormal to capture different information; therefore:
is the identity matrix.
The projection of training data into the subspace and its reconstruction after the projection are [wang2012geometric]:
respectively, where and . The projection and reconstruction of the out-of-sample data are:
respectively, where and . Note that in subspace learning, it is recommended to center the data, either training or out-of-sample, using the training mean. If we do that, the training mean should be added back to the reconstructed data. If the mean of training data is:
the centered training data are:
is the centering matrix. The covariance matrix, or the total scatter, is defined as:
where it is noticed that the centering matrix is symmetric and idempotent.
If we project the centered training data, we have:
where and denote the Frobenius norm and trace of matrix, respectively. Hence,
is the squared length of the reconstruction. This term can also be interpreted as the variance of the projected data according to quadratic characteristic of variance. We want the variance of projection to be maximized where the projection matrix is orthogonal; otherwise, the optimization problem is ill-defined. If we consider a general scatter of data,, the optimization is:
where the constraint makes the problem well-defined. The Lagrangian [boyd2004convex] of the problem is:
where is a diagonal matrix including the Lagrange multipliers. Setting the derivative of Lagrangian to zero gives:
which is the eigenvalue problem for where the columns of and the diagonal of
are the eigenvectors and eigenvalues of, respectively [ghojogh2019eigenvalue]. The eigenvectors and eigenvalues are sorted from the leading (largest eigenvalue) to the trailing (smallest eigenvalue) because we are maximizing in the optimization problem (the reason lies in the second order condition).
We can also have two types of scatters, e.g., and . As the scatter matrix is symmetric and positive semi-definite, we can decompose it as:
is because of Singular Value Decomposition (SVD) andis for . The can be interpreted as manipulated (or rotated) projection directions. We want the manipulated projection matrix to be orthogonal; thus:
In this case, the optimization is expressed as:
and the constraint makes the problem well-defined. The Lagrangian [boyd2004convex] of the problem is:
where is a diagonal matrix which includes the Lagrange multipliers. Setting the derivative of Lagrangian to zero gives:
which is the generalized eigenvalue problem where the columns of and the diagonal of are the eigenvectors and eigenvalues, respectively [ghojogh2019eigenvalue]. The eigenvectors and eigenvalues are again sorted from the leading to the trailing because of maximization.
Iii PCA, FDA, and SPCA
The optimization problem in PCA [pearson1901liii, jolliffe2011principal, ghojogh2019unsupervised] is expressed as:
The FDA [fisher1936use, friedman2001elements] maximizes the Fisher criterion [xu2006analysis, fukunaga2013introduction]:
The Fisher criterion is a generalized Rayleigh-Ritz Quotient [parlett1998symmetric]. Hence, the optimization in Eq. (18) is equivalent to [ghojogh2019fisher]:
where the and are the between and within scatters, respectively, defined as:
where the mean of the -th class is:
The total scatter can be considered as the summation of the between and within scatters [ye2007least, welling2005fisher]:
Therefore, the Fisher criterion can be written as [welling2005fisher]:
The is a constant and can be dropped in the optimization problem because the variable and not the objective is the goal; therefore, the optimization in FDA can now be expressed as:
Hence, the FDA directions can be obtained by the generalized eigenvalue problem [welling2005fisher]. Note that some articles, such as [ye2007least, zhang2010regularized, diaz2017deep], solve the generalized eigenvalue problem by considering another version of the Fisher criterion which is . This criterion is obtained if we consider minimization of the inverse of Eq. (18), use Eq. (23) for , drop the constant , and convert minimization to maximization by inverting the criterion again.
Comparing the Eqs. (17) and (25) shows that PCA captures the orthonormal directions with the maximum variance of data; however, FDA has the same goal but also it requires the manipulated directions to be orthonormal. This manipulation is done by the within scatter which makes sense because the within scatters make use of the class labels. This comparison gives a hint for the connection between PCA and FDA.
The SPCA [barshan2011supervised] makes use of the empirical estimation of the Hilbert-Schmidt Independence Criterion (HSIC) [gretton2005measuring]:
where and are the kernels over the first and second random variable. The idea of HSIC is to measure the dependence of two random variables by calculating the correlation of their pulled values to the Hilbert space. The SPCA uses HSIC for the projected data and the labels and maximizes the dependence of them in order to make use of the labels. It uses the linear kernel for the projected data and an arbitrary valid kernel for the labels . Therefore, the scaled Eq. (26) in SPCA is:
where is because of the cyclic property of the trace. The optimization problem in SPCA is expressed as:
where is the kernel matrix over the labels of data, either for classification or regression. The solution to Eq. (28) is the eigenvalue problem for according to Eq. (12). Hence, the SPCA directions are the eigenvectors of . Note that this term, restricted to a linear kernel for , is used as the between scatter in [zhang2010regularized, diaz2017deep, diaz2019deep] hinting for a connection between SPCA and FDA if we compare the objectives in Eqs. (19) and (28).
PCA, FDA, and SPCA, which were explained, are several fundamental subspace learning methods which are different in terms of whether/how to use the labels. Comparing Eqs. (17), (25), and (28) and considering the general forms in Eqs. (11) and (15) show that these methods belong to a family of methods based on eigenvalue and generalized eigenvalue problems. This gave us a motivation to propose a generalized subspace learning method, named RDA, as a family of methods including PCA, FDA, and SPCA.
Iv Roweis Discriminant Analysis (RDA)
RDA aims at maximizing the interpreted as the squared length of the reconstruction or the scatter of projection (see Eq. (10)), while requiring the manipulated projection directions to be orthonormal (see Eq. (14)). Therefore, the optimization of RDA is formalized as:
where and are the first and second Roweis matrices defined as:
The and are the first and second Roweis factors. Note that in Eq. (30) is double-centering the matrix .
The solution to Eq. (29) is the generalized eigenvalue problem according to Eq. (16). Therefore, the RDA directions are the eigenvectors of this generalized eigenvalue problem. The RDA directions are sorted from the leading to trailing eigenvalues because of the maximization in Eq. (29).
The optimization of the RDA can be interpreted in another way using what we define as the Roweis criterion. We want to maximize this criterion:
As in FDA, the Roweis criterion is a generalized Rayleigh-Ritz Quotient [parlett1998symmetric]; thus, the optimization in Eq. (33) is equivalent to Eq. (29). It is noteworthy that if we consider only a one-dimensional RDA subspace, the Roweis criterion is:
where is the only RDA projection direction. In this case, the Eq. (29) becomes:
Iv-B The Special Cases of RDA & the Roweis Map
In fact, RDA is a family of infinite number of algorithms for subspace learning. By choosing any number for the and in the range , RDA gives us a new algorithm for learning the subspace of data. We define a map, named Roweis map, which includes the infinite number of special cases of RDA where three of its corners are PCA, FDA, and SPCA. The rows and columns of the Roweis map are the values of and , respectively. Figure 1-a shows this map.
An interesting thing about RDA is that one of the four extreme values for and is not yet proposed in the literature to the best of our knowledge. If and , the Eq. (29) becomes:
whose solution is the generalized eigenvalue problem according to Eq. (16). As can be seen, this optimization uses the labels twice, once in the kernel over the labels and once in the within scatter. Therefore, we name this special subspace learning method the Double Supervised Discriminant Analysis (DSDA).
Iv-C Properties of the Roweis Matrices and Factors
Iv-C1 Properties of the Roweis Factors
In Eq. (32), if , we are fully using the kernel over labels and if , it reduces to the identity matrix. So, we have:
On the other hand, in Eq. (31), if , we are fully using the within scatter using the labels and if , it reduces to the identity matrix. Hence:
Therefore, we can conclude that if the Roweis factor is one, we fully use the labels as a supervised method and if the Roweis factor is zero, we are not using the labels at all. Therefore, the Roweis factor is a measure of using labels or being supervised. As we have two Roweis factors, we define:
as the supervision level which is a planar function depicted in Fig. 2. Note that the extremes and refer to the unsupervised and fully (double) supervised subspace learning, respectively. Recall that the Roweis map includes the different subspace learning methods from unsupervised PCA (top-left corner of map) to the DSDA (bottom-right corner of map). The value can be interpreted as using the labels once (as in both FDA and SPCA).
It is noteworthy that if , we have according to Eq. (41). Therefore, according to Eqs. (29) and (40), the optimization of RDA merely includes and not regarding the labels. On the other hand, the is a soft measure of similarity between the labels while the uses the labels strictly for knowing which data instance belongs to which class. It shows that if we use (), the labels must be for classes in classification; however, if using solely (), the labels can be either for classification or regression. This sheds light to why SPCA (with ) can be used for both classification and regression tasks [barshan2011supervised] but FDA (with ) is only for classification.
Iv-C2 Properties of the Roweis Matrices
The Roweis matrices, and , are positive semi-definite. The reason is that the within scatter, the kernel matrix, and the identity matrix are positive semi-definite and , so:
The Roweis matrices are also symmetric because the within scatter, the kernel matrix, and the identity matrix are symmetric; hence:
It is also worth mentioning that according to the definitions of , , and , the and the second Roweis matrices have the essence of kernel and scatter matrices, respectively, while the first Roweis matrix has the mixture essence of scatter and kernel matrices. According to Eq. (27), the first Roweis matrix is the term used in the HSIC over the projected data and the labels.
Now, we analyze the rank of the Roweis matrices. We have where is the pulled to the feature space [hofmann2008kernel]. We usually have . Hence, . According to Eq. (32) and the subadditivity property of the rank:
Note that in the extreme cases and , the rank of is and , respectively. According to Eq. (30):
where the is because of subtracting the mean in centering the matrix (e.g., consider the case with only one data instance). In general, the rank of a covariance (scatter) matrix over the -dimensional data with sample size is at most . The is because the covariance matrix is a matrix, the is because we iterate over data instances for calculating the covariance matrix, and the is again because of subtracting the mean. Hence, according to Eq. (21), we have . According to Eq. (31) and the subadditivity property of the rank:
In the extreme cases and , the rank of is and , respectively.
Iv-D Dimensionality of the RDA Subspace
There exist a rigorous solution for the generalized eigenvalue problem [ghojogh2019eigenvalue]. However, we can solve this problem as:
where stacks the eigenvectors column-wise. If is singular, we strengthen its diagonal:
where is a very small positive number, large enough to make full rank. In the literature, this approach is known as regularized discriminant analysis [friedman1989regularized].
Therefore, the , which is the dimensionality of the RDA subspace, is at most . The leading eigenvectors are considered as the RDA directions and the rest of eigenvectors are invalid and ignored for having zero or very small eigenvalues.
Iv-E Robust RDA
According to Eq. (48), the rank of is at most . In the extreme case , its rank is at most . Thus, for the case is very close to one, i.e., , if , the is singular. This happens when the dimensionality of data is huge but the sample size is small. In this case, we face a problem if we use the Eq. (49), for solving the generalized eigenvalue problem . Two possible approaches to tackle this problem are the pseudo-inverse of [webb2003statistical] and strengthening the diagonal of [friedman1989regularized, diaz2017deep]. We can also propose the Robust RDA (RRDA) inspired by robust FDA [deng2007robust, guo2015feature] to be robust against singularity. We emphasis that RRDA is useful if , , where we want to use Eq. (49) for solving the generalized eigenvalue problem.
In RRDA, the is decomposed using eigenvalue decomposition: where and include the eigenvectors and eigenvalues of , respectively. The eigenvalues are sorted as and the eigenvectors are sorted accordingly. If is close to singularity, the first eigenvalues are valid and the rest eigenvalues are either very small or zero. The appropriate is obtained as:
where is a selected number close to . The invalid eigenvalues are replaced with :
Hence, the is replaced with :
and the robust RDA directions are the eigenvectors of the generalized eigenvalue problem .
V Dual RDA for
The dual RDA exists for for the reason explained later in this section. This statement clarifies why, in the literature, PCA and SPCA (with ) have their dual methods but FDA (with ) does not have a dual.
As the matrix is symmetric and positive semi-definite, we can decompose it as:
where is because of SVD and is for:
Thus, the can be decomposed as:
and is because the and are both symmetric and in the the same form so their summation is also symmetric (recall Eq. (45)) and in the same form. We can obtain Eq. (58) in another way, too: as is symmetric and positive semi-definite (see Eq. (43)), we can decompose it:
where is because of SVD and is for:
For the reason that will be explained at the end of this section, we use the incomplete SVD in for where , , [golub1971singular]. Therefore, we have . If we had used the complete SVD, we would have .
. According to the properties of SVD, the matrix of left singular vectors ofis equivalent to the matrix of eigenvectors of . Therefore, we can use incomplete SVD for :
where the columns of and are the left and right singular vectors (i.e., the eigenvectors of and ), respectively. The diagonal entries of are the singular value of , i.e., the square root of eigenvalues of or .
The dual RDA exists only for because in Eq. (62), the
is an orthogonal matrix so. This implies that the constraint in Eq. (29) should be which means . Recall that this is the reason for FDA (with ) not having a dual.
We obtain from Eq. (62):
where the orthogonality of is noticed. The projection of data in dual RDA is:
Note that is symmetric. Similarly, out-of-sample projection in dual RDA is:
The reconstruction of training data in dual RDA is:
To summarize, in RDA, the projection of training and out-of-sample data is Eq. (2) and (4), respectively. In RDA, the reconstruction of training and out-of-sample data is Eq. (3) and (5), respectively. However, in dual RDA, the projection of training and out-of-sample data is Eq. (64) and (65), respectively. In dual RDA, the reconstruction of training and out-of-sample data is Eq. (66) and (67), respectively.
Note that in RDA and dual RDA, we can truncate , , and , respectively, to , , and in order to have a -dimensional subspace (). For determining the appropriate in RDA, the scree plot [cattell1966scree] or the ratio [abdi2010principal] can be used, where is the -th largest eigenvalue.
The dual RDA is very useful especially if the dimensionality of data is much greater than the sample size of data, i.e., . In this case . According to Eq. (62), and are the eigenvectors of and , respectively. As in this case, the computation of eigenvectors of is much faster and needs less storage than . Therefore, in this case, we use the eigenvalue decomposition:
rather than using Eq. (62). Note that in dual RDA, we do not require but only and . Hence, if , it is recommended to use dual RDA which is more efficient than RDA in terms of speed of calculation and storage.
Vi Kernel RDA
Let be the pulling function mapping the data to the feature space . In other words, . Let denote the dimensionality of the feature space, i.e., while . Note that we usually have . The kernel over two vectors and is the inner product of their pulled data [hofmann2008kernel]:
We can have two types of kernel RDA, one using the dual RDA and kernel trick and the other one using the representation theory [alperin1993local]. The former, which makes use of the inner product of data instances, holds for only two special cases but the latter works for the entire range of the Roweis map.
Vi-a Kernel RDA Using Dual RDA for Special Cases
If , we have the dual RDA and can use it for working out the kernel RDA using the kernel trick. However, the dual RDA is useful for the kernel trick in the two cases of and . The reason is that in these cases, the inner product of data points appear enabling us to use the kernel trick. Therefore, we can use the dual RDA for kernel trick in (i.e., PCA) and (i.e., SPCA). This explains why, in the literature, PCA and SPCA (with ) have their kernel methods using kernel trick (or their dual) but kernel FDA (with ) uses representation theory and not the kernel trick.
Vi-A1 Special Case of Kernel RDA: Kernel PCA
If , we have according to Eq. (58). In this case, . So, we decompose in Eq. (62). In this case, RDA is reduced to PCA where the data should be centered. Hence, in reconstruction, we add the training mean back. Therefore, the Eqs. (64), (65), (66), and (67) are modified to:
respectively, where is because and are orthogonal matrices and is the centered out-of-sample data using the training mean:
In the kernel RDA for (i.e., kernel PCA [scholkopf1997kernel, scholkopf1998nonlinear]), the incomplete SVD is applied on the which is the centered data in the feature space. We use the Eqs. (70), (71), (72), and (73) and replace the inner products by the kernels:
where is the centered out-of-sample data in the feature space using the mean of the pulled training data and is because the double-centered kernel over the training and out-of-sample data is calculated as: