 # Roweis Discriminant Analysis: A Generalized Subspace Learning Method

We present a new method which generalizes subspace learning based on eigenvalue and generalized eigenvalue problems. This method, Roweis Discriminant Analysis (RDA), is named after Sam Roweis to whom the field of subspace learning owes significantly. RDA is a family of infinite number of algorithms where Principal Component Analysis (PCA), Supervised PCA (SPCA), and Fisher Discriminant Analysis (FDA) are special cases. One of the extreme special cases, which we name Double Supervised Discriminant Analysis (DSDA), uses the labels twice; it is novel and has not appeared elsewhere. We propose a dual for RDA for some special cases. We also propose kernel RDA, generalizing kernel PCA, kernel SPCA, and kernel FDA, using both dual RDA and representation theory. Our theoretical analysis explains previously known facts such as why SPCA can use regression but FDA cannot, why PCA and SPCA have duals but FDA does not, why kernel PCA and kernel SPCA use kernel trick but kernel FDA does not, and why PCA is the best linear method for reconstruction. Roweisfaces and kernel Roweisfaces are also proposed generalizing eigenfaces, Fisherfaces, supervised eigenfaces, and their kernel variants. We also report experiments showing the effectiveness of RDA and kernel RDA on some benchmark datasets.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Subspace and manifold learning, also referred to as representation learning [bengio2013representation, zhong2016overview]

, are very useful in machine learning and pattern analysis for feature extraction, data visualization, and dimensionality reduction

[ghojogh2019feature]. The subspace of data is a lower-dimensional space than the input space of data which can appropriately represent the data with the smallest possible representation error. In other words, the data usually exist on a submanifold or subspace with a lower intrinsic dimensionality [wang2012geometric, warner2013foundations].

The submanifold of data can be either linear or nonlinear. Different linear and nonlinear methods have been proposed in the literature for subspace and manifold learning. Principal Component Analysis (PCA) [jolliffe2011principal], first proposed in [pearson1901liii], was one of the first methods in linear subspace learning. The Fisher Discriminant Analysis (FDA) [friedman2001elements], first proposed in [fisher1936use], was one of the first linear supervised subspace learning methods. Both PCA and FDA were based on the scatter of data. Metric Multi-Dimensional Scaling (MDS) [cox2000multidimensional] was a linear method which tried to preserve the similarity of the data points. In later approaches after MDS, the cost function in MDS was changed to preserving the distances of points [lee2007nonlinear], yielding to Sammon mapping [sammon1969nonlinear]

. Sammon mapping can probably be considered as the first nonlinear subspace learning method. Another approach to handle nonlinearity of data is to modify the data instead of changing the linear algorithm. That was the perspective of kernel PCA

[scholkopf1997kernel, scholkopf1998nonlinear] which used the dual of PCA and the kernel trick [hofmann2008kernel] to pull the data to the feature space hoping that it becomes roughly linear in that space. Kernel FDA [mika1999fisher, mika2000invariant] was also proposed for handling the nonlinear data in a supervised manner. However, it did not use the kernel trick (which we will explain why in this paper) but used representation theory [alperin1993local] instead. Recently, deep FDA [diaz2017deep, diaz2019deep] was proposed which uses a least squares approach [ye2007least, zhang2010regularized].

After Sammon mapping, there was not any important method, which is actually nonlinear without changing the data like kernel PCA/FDA, until Isomap [tenenbaum2000global] and Locally Linear Embedding (LLE) [roweis2000nonlinear]. The former used geodesic distance rather than Euclidean distance in the kernel of MDS [ham2004kernel, strange2014open] and the latter reconstructed every point by its -Nearest Neighbors (-NN) to locally fit the data [saul2003think]. Stochastic Neighbor Embedding (SNE) [hinton2003stochastic]

was a probabilistic approach to manifold learning where the probability of a point being neighbor of others was tried to be preserved. While the Gaussian distribution was used in SNE, the Student-

distribution was used in the embedded space in -SNE method in order to tackle the crowding problem [maaten2008visualizing]

. Gradually, deep learning became popular where the Deep Belief Network (DBN)

[hinton2006reducing]

was proposed in order to learn the latent subspace of data in an undercomplete autoencoder. One of the most recent important subspace learning methods was Supervised PCA (SPCA)

[barshan2011supervised]

. The SPCA made use of the empirical estimation of the Hilbert-Schmidt Independence Criterion (HSIC)

[gretton2005measuring]

. The HSIC computes the dependence of two random variables by calculating their correlation in the feature space. The SPCA tried to maximize the dependence of projected data and the labels in order to use the information of labels for better embedding. Recently, supervised random projection

[karimi2018srp] used low-rank kernel approximation in the formulation of SPCA for better efficiency.

In this paper, we propose a generalized subspace learning method, named Roweis Discriminant Analysis (RDA), named after Sam Roweis (1972–2010) who contributed significantly to subspace and manifold learning. He proposed many important methods in subspace/manifold learning including LLE [roweis2000nonlinear, saul2003think], SNE [hinton2003stochastic], and metric learning by class collapsing [globerson2006metric]. The proposed RDA is a family of infinite number of subspace learning methods including PCA, FDA, and SPCA which have been proposed in the literature. It is a generalized method and contains many methods which use linear projection into a lower dimensional subspace. The main contributions of our paper can be listed as the following:

1. proposing RDA as a generalized subspace learning method based on eigenvalue and generalized eigenvalue problems,

2. proposing Double Supervised Discriminant Analysis (DSDA), which is not yet proposed in the literature to the best of our knowledge, as one of the extreme cases in RDA,

3. proposing RDA in the feature space to have kernel RDA which generalizes kernel PCA, kernel SPCA, and kernel FDA,

4. explaining the reasons behind some of the characteristics of PCA, SPCA, and FDA such as (I) why SPCA can be used for both classification and regression but FDA is only for classification, (II) why PCA and SPCA have their dual methods but FDA does not have a dual, (III) why kernel PCA and kernel SPCA can use kernel trick but kernel FDA uses representation theory rather than kernel trick, (IV) why PCA is the best linear method for reconstruction,

5. proposing Roweisfaces and kernel Roweisfaces for demonstrating the generalizability of the RDA approach for the eigenfaces [turk1991eigenfaces, turk1991face], kernel eigenfaces [yang2000face], Fisherfaces [belhumeur1997eigenfaces], kernel Fisherfaces [yang2002kernel] and supervised eigenfaces [barshan2011supervised, ghojogh2019unsupervised].

The remainder of the paper is organized as follows. Section II introduces projection into a subspace and the reconstruction of data after projection. Two general forms of optimization for subspace learning are also introduced. In Section III, the theory of PCA, FDA, and SPCA is briefly reviewed. The proposed RDA is explained in Section IV. A dual is also proposed for the RDA in Section V but for some special cases. Section VI explains two types of kernel RDA, one based on kernel trick and another using representation theory. The experimental results are reported in Section VII where the Roweisfaces are also proposed. Section VIII summarizes and concludes the article.

## Ii Subspace and Projection

Let , , , , , , , and denote the training sample size, test (out-of-sample) sample size, dimensionality of data, number of classes, sample size of the -th class, the -th training instance, the -th training instance in the -th class, and the -th test instance, respectively. We stack the training data points column-wise in a matrix and similarly for the out-of-sample data points as .

Let be the projection matrix whose columns are the projection directions spanning the desired subspace so the subspace is the column-space of . If we truncate the projection matrix to have , the subspace is spanned by projection directions and it will be a dimensional subspace where . We want the projection directions to be orthonormal to capture different information; therefore:

 U⊤U=I, (1)

where

is the identity matrix.

The projection of training data into the subspace and its reconstruction after the projection are [wang2012geometric]:

 Rp×n∋˜X:=U⊤X, (2) Rd×n∋ˆX:=UU⊤X=U˜X, (3)

respectively, where and . The projection and reconstruction of the out-of-sample data are:

 Rp×nt∋˜Xt:=U⊤Xt, (4) Rd×n∋ˆXt:=UU⊤Xt=U˜Xt, (5)

respectively, where and . Note that in subspace learning, it is recommended to center the data, either training or out-of-sample, using the training mean. If we do that, the training mean should be added back to the reconstructed data. If the mean of training data is:

 μ:=1nn∑i=1xi, (6)

the centered training data are:

 Rd×n∋˘X:=XH=X−μ, (7)

where and:

 Rn×n∋H:=I−(1/n)11⊤, (8)

is the centering matrix. The covariance matrix, or the total scatter, is defined as:

 Rn×n∋ST :=n∑i=1(xi−μ)(xi−μ)⊤ (9) =˘X˘X⊤(???)=XHHX⊤=XHX⊤,

where it is noticed that the centering matrix is symmetric and idempotent.

If we project the centered training data, we have:

 ||ˆX||2F=||UU⊤˘X||2F=tr(U⊤STU), (10)

where and denote the Frobenius norm and trace of matrix, respectively. Hence,

is the squared length of the reconstruction. This term can also be interpreted as the variance of the projected data according to quadratic characteristic of variance. We want the variance of projection to be maximized where the projection matrix is orthogonal; otherwise, the optimization problem is ill-defined. If we consider a general scatter of data,

, the optimization is:

 maximizeU tr(U⊤SU), (11) subject to U⊤U=I,

where the constraint makes the problem well-defined. The Lagrangian [boyd2004convex] of the problem is:

 L=tr(U⊤SU)−tr(Λ⊤(U⊤U−I)),

where is a diagonal matrix including the Lagrange multipliers. Setting the derivative of Lagrangian to zero gives:

 Rd×p∋∂L∂U=2SU−2UΛset=0⟹SU=UΛ, (12)

which is the eigenvalue problem for where the columns of and the diagonal of

are the eigenvectors and eigenvalues of

, respectively [ghojogh2019eigenvalue]. The eigenvectors and eigenvalues are sorted from the leading (largest eigenvalue) to the trailing (smallest eigenvalue) because we are maximizing in the optimization problem (the reason lies in the second order condition).

We can also have two types of scatters, e.g., and . As the scatter matrix is symmetric and positive semi-definite, we can decompose it as:

 S2(a)=ΨSΩSΨ⊤S=ΨSΩ(1/2)SΩ(1/2)SΨ⊤S(b)=Δ⊤Δ, (13)

where

is because of Singular Value Decomposition (SVD) and

is for . The can be interpreted as manipulated (or rotated) projection directions. We want the manipulated projection matrix to be orthogonal; thus:

 (ΔU)⊤(ΔU)=U⊤S2Uset=I. (14)

In this case, the optimization is expressed as:

 maximizeU tr(U⊤S1U), (15) subject to U⊤S2U=I,

and the constraint makes the problem well-defined. The Lagrangian [boyd2004convex] of the problem is:

 L=tr(U⊤S1U)−tr(Λ⊤(U⊤S2U−I)),

where is a diagonal matrix which includes the Lagrange multipliers. Setting the derivative of Lagrangian to zero gives:

 ∂L∂U=2S1U−2S2UΛset=0⟹S1U=S2UΛ, (16)

which is the generalized eigenvalue problem where the columns of and the diagonal of are the eigenvectors and eigenvalues, respectively [ghojogh2019eigenvalue]. The eigenvectors and eigenvalues are again sorted from the leading to the trailing because of maximization.

## Iii PCA, FDA, and SPCA

The optimization problem in PCA [pearson1901liii, jolliffe2011principal, ghojogh2019unsupervised] is expressed as:

 maximizeU tr(U⊤STU), (17) subject to U⊤U=I,

where is the total scatter defined in Eq. (9). The solution to Eq. (17) is the eigenvalue problem for according to Eq. (12). Thus, the PCA directions are the eigenvectors of the total scatter.

The FDA [fisher1936use, friedman2001elements] maximizes the Fisher criterion [xu2006analysis, fukunaga2013introduction]:

 maximizeU   fF(U):=dB(U)dW(U):=tr(U⊤SBU)tr(U⊤SWU). (18)

The Fisher criterion is a generalized Rayleigh-Ritz Quotient [parlett1998symmetric]. Hence, the optimization in Eq. (18) is equivalent to [ghojogh2019fisher]:

 maximizeU tr(U⊤SBU), (19) subject to U⊤SWU=I,

where the and are the between and within scatters, respectively, defined as:

 Rd×d∋SB:=c∑j=1nj(μj−μ)(μj−μ)⊤, (20) Rd×d∋SW:=c∑j=1nj∑i=1(x(j)i−μj)(x(j)i−μj)⊤, (21)

where the mean of the -th class is:

 Rt∋μj:=1njnj∑i=1x(j)i. (22)

The total scatter can be considered as the summation of the between and within scatters [ye2007least, welling2005fisher]:

 ST=SB+SW⟹SB=ST−SW. (23)

Therefore, the Fisher criterion can be written as [welling2005fisher]:

 fF(U)=dT(U)dW(U)−1:=tr(U⊤STU)tr(U⊤SWU)−1. (24)

The is a constant and can be dropped in the optimization problem because the variable and not the objective is the goal; therefore, the optimization in FDA can now be expressed as:

 maximizeU tr(U⊤STU), (25) subject to U⊤SWU=I.

Hence, the FDA directions can be obtained by the generalized eigenvalue problem [welling2005fisher]. Note that some articles, such as [ye2007least, zhang2010regularized, diaz2017deep], solve the generalized eigenvalue problem by considering another version of the Fisher criterion which is . This criterion is obtained if we consider minimization of the inverse of Eq. (18), use Eq. (23) for , drop the constant , and convert minimization to maximization by inverting the criterion again.

Comparing the Eqs. (17) and (25) shows that PCA captures the orthonormal directions with the maximum variance of data; however, FDA has the same goal but also it requires the manipulated directions to be orthonormal. This manipulation is done by the within scatter which makes sense because the within scatters make use of the class labels. This comparison gives a hint for the connection between PCA and FDA.

The SPCA [barshan2011supervised] makes use of the empirical estimation of the Hilbert-Schmidt Independence Criterion (HSIC) [gretton2005measuring]:

 HSIC:=1(n−1)2tr(K1HK2H), (26)

where and are the kernels over the first and second random variable. The idea of HSIC is to measure the dependence of two random variables by calculating the correlation of their pulled values to the Hilbert space. The SPCA uses HSIC for the projected data and the labels and maximizes the dependence of them in order to make use of the labels. It uses the linear kernel for the projected data and an arbitrary valid kernel for the labels . Therefore, the scaled Eq. (26) in SPCA is:

 tr(X⊤UU⊤XHKyH)(a)=tr(U⊤XHKyHX⊤U), (27)

where is because of the cyclic property of the trace. The optimization problem in SPCA is expressed as:

 maximizeU tr(U⊤XHKyHX⊤U), (28) subject to U⊤U=I,

where is the kernel matrix over the labels of data, either for classification or regression. The solution to Eq. (28) is the eigenvalue problem for according to Eq. (12). Hence, the SPCA directions are the eigenvectors of . Note that this term, restricted to a linear kernel for , is used as the between scatter in [zhang2010regularized, diaz2017deep, diaz2019deep] hinting for a connection between SPCA and FDA if we compare the objectives in Eqs. (19) and (28).

PCA, FDA, and SPCA, which were explained, are several fundamental subspace learning methods which are different in terms of whether/how to use the labels. Comparing Eqs. (17), (25), and (28) and considering the general forms in Eqs. (11) and (15) show that these methods belong to a family of methods based on eigenvalue and generalized eigenvalue problems. This gave us a motivation to propose a generalized subspace learning method, named RDA, as a family of methods including PCA, FDA, and SPCA.

## Iv Roweis Discriminant Analysis (RDA)

### Iv-a Methodology

RDA aims at maximizing the interpreted as the squared length of the reconstruction or the scatter of projection (see Eq. (10)), while requiring the manipulated projection directions to be orthonormal (see Eq. (14)). Therefore, the optimization of RDA is formalized as:

 maximizeU tr(U⊤R1U), (29) subject to U⊤R2U=I,

where and are the first and second Roweis matrices defined as:

 Rd×d∋R1:=XHPHX⊤, (30) Rd×d∋R2:=r2SW+(1−r2)I, (31)

respectively, where:

 Rn×n∋P:=r1Ky+(1−r1)I. (32)

The and are the first and second Roweis factors. Note that in Eq. (30) is double-centering the matrix .

The solution to Eq. (29) is the generalized eigenvalue problem according to Eq. (16). Therefore, the RDA directions are the eigenvectors of this generalized eigenvalue problem. The RDA directions are sorted from the leading to trailing eigenvalues because of the maximization in Eq. (29).

The optimization of the RDA can be interpreted in another way using what we define as the Roweis criterion. We want to maximize this criterion:

 maximizeU   fR(U):=dR1(U)dR2(U):=tr(U⊤R1U)tr(U⊤R2U). (33)

As in FDA, the Roweis criterion is a generalized Rayleigh-Ritz Quotient [parlett1998symmetric]; thus, the optimization in Eq. (33) is equivalent to Eq. (29). It is noteworthy that if we consider only a one-dimensional RDA subspace, the Roweis criterion is:

 fR(u):=dR1(u)dR2(u):=u⊤R1uu⊤R2u, (34)

where is the only RDA projection direction. In this case, the Eq. (29) becomes:

 maximizeu u⊤R1u, (35) subject to u⊤R2u=1.

### Iv-B The Special Cases of RDA & the Roweis Map

Consider the Eqs. (30), (31), and (32). We know that the range of both and is . If we consider the extreme cases of and , we find an interesting relationship:

 r1=0,r2=0⟹RDA≡PCA, (36) r1=0,r2=1⟹RDA≡FDA, (37) r1=1,r2=0⟹RDA≡SPCA, (38)

by comparing Eq. (29) with Eqs. (17), (25), and (28) and noticing the Eq. (9). We see that PCA, FDA, and SPCA are all special cases of RDA.

In fact, RDA is a family of infinite number of algorithms for subspace learning. By choosing any number for the and in the range , RDA gives us a new algorithm for learning the subspace of data. We define a map, named Roweis map, which includes the infinite number of special cases of RDA where three of its corners are PCA, FDA, and SPCA. The rows and columns of the Roweis map are the values of and , respectively. Figure 1-a shows this map. Figure 1: The Roweis map: (a) the map including infinite number of subspace learning methods and its special cases, (b) the map in its input and feature spaces where the map in the feature space is pulled from the map in the input space.

An interesting thing about RDA is that one of the four extreme values for and is not yet proposed in the literature to the best of our knowledge. If and , the Eq. (29) becomes:

 maximizeU tr(U⊤XHKyHX⊤U), (39) subject to U⊤SWU=I,

whose solution is the generalized eigenvalue problem according to Eq. (16). As can be seen, this optimization uses the labels twice, once in the kernel over the labels and once in the within scatter. Therefore, we name this special subspace learning method the Double Supervised Discriminant Analysis (DSDA).

### Iv-C Properties of the Roweis Matrices and Factors

#### Iv-C1 Properties of the Roweis Factors

In Eq. (32), if , we are fully using the kernel over labels and if , it reduces to the identity matrix. So, we have:

 P={Ky%ifr1=1,Iif r1=0. (40)

On the other hand, in Eq. (31), if , we are fully using the within scatter using the labels and if , it reduces to the identity matrix. Hence:

 R2={SWif r2=1,Iif r2=0. (41)

Therefore, we can conclude that if the Roweis factor is one, we fully use the labels as a supervised method and if the Roweis factor is zero, we are not using the labels at all. Therefore, the Roweis factor is a measure of using labels or being supervised. As we have two Roweis factors, we define:

 [0,1]∋s:=(r1+r2)/2, (42)

as the supervision level which is a planar function depicted in Fig. 2. Note that the extremes and refer to the unsupervised and fully (double) supervised subspace learning, respectively. Recall that the Roweis map includes the different subspace learning methods from unsupervised PCA (top-left corner of map) to the DSDA (bottom-right corner of map). The value can be interpreted as using the labels once (as in both FDA and SPCA). Figure 2: The supervision level shown on the Roweis map (best viewed in color).

It is noteworthy that if , we have according to Eq. (41). Therefore, according to Eqs. (29) and (40), the optimization of RDA merely includes and not regarding the labels. On the other hand, the is a soft measure of similarity between the labels while the uses the labels strictly for knowing which data instance belongs to which class. It shows that if we use (), the labels must be for classes in classification; however, if using solely (), the labels can be either for classification or regression. This sheds light to why SPCA (with ) can be used for both classification and regression tasks [barshan2011supervised] but FDA (with ) is only for classification.

#### Iv-C2 Properties of the Roweis Matrices

The Roweis matrices, and , are positive semi-definite. The reason is that the within scatter, the kernel matrix, and the identity matrix are positive semi-definite and , so:

 r1Ky⪰0,(1−r1)I⪰0⟹P⪰0,R1⪰0, (43) r2SW⪰0,(1−r2)I⪰0⟹R2⪰0. (44)

The Roweis matrices are also symmetric because the within scatter, the kernel matrix, and the identity matrix are symmetric; hence:

 R1=R⊤1,   R2=R⊤2,   P=P⊤. (45)

It is also worth mentioning that according to the definitions of , , and , the and the second Roweis matrices have the essence of kernel and scatter matrices, respectively, while the first Roweis matrix has the mixture essence of scatter and kernel matrices. According to Eq. (27), the first Roweis matrix is the term used in the HSIC over the projected data and the labels.

Now, we analyze the rank of the Roweis matrices. We have where is the pulled to the feature space [hofmann2008kernel]. We usually have . Hence, . According to Eq. (32) and the subadditivity property of the rank:

 rank(P)≤rank(Ky)+rank(In)≤min(n,t)+n. (46)

Note that in the extreme cases and , the rank of is and , respectively. According to Eq. (30):

 rank(R1) ≤min(rank(X)+rank(H)+rank(P)) ≤min(d,n−1,t)=min(d,n−1), (47)

where the is because of subtracting the mean in centering the matrix (e.g., consider the case with only one data instance). In general, the rank of a covariance (scatter) matrix over the -dimensional data with sample size is at most . The is because the covariance matrix is a matrix, the is because we iterate over data instances for calculating the covariance matrix, and the is again because of subtracting the mean. Hence, according to Eq. (21), we have . According to Eq. (31) and the subadditivity property of the rank:

 rank(R2)≤rank(SW)+rank(Id)≤min(d,n−1)+d. (48)

In the extreme cases and , the rank of is and , respectively.

### Iv-D Dimensionality of the RDA Subspace

There exist a rigorous solution for the generalized eigenvalue problem [ghojogh2019eigenvalue]. However, we can solve this problem as:

 R1U=R2UΛ ⟹R−12R1U=UΛ ⟹U=eig(R−12R1), (49)

where stacks the eigenvectors column-wise. If is singular, we strengthen its diagonal:

 U=eig((R2+εI)−1R1), (50)

where is a very small positive number, large enough to make full rank. In the literature, this approach is known as regularized discriminant analysis [friedman1989regularized].

According to Eqs. (47) and (48), the rank of in Eq. (49) is:

 rank(R−12R1) ≤min(rank(R−12),% rank(R1)) ≤min(d,n−1). (51)

Therefore, the , which is the dimensionality of the RDA subspace, is at most . The leading eigenvectors are considered as the RDA directions and the rest of eigenvectors are invalid and ignored for having zero or very small eigenvalues.

It is noteworthy that according to Eq. (20), we have . If Eq. (19) is used for FDA, we have in FDA. In RDA, for the values , RDA is reduced to FDA when the Eq. (25) is used; hence, the dimensionality of FDA as a special case of RDA is at most and not .

### Iv-E Robust RDA

According to Eq. (48), the rank of is at most . In the extreme case , its rank is at most . Thus, for the case is very close to one, i.e., , if , the is singular. This happens when the dimensionality of data is huge but the sample size is small. In this case, we face a problem if we use the Eq. (49), for solving the generalized eigenvalue problem . Two possible approaches to tackle this problem are the pseudo-inverse of [webb2003statistical] and strengthening the diagonal of [friedman1989regularized, diaz2017deep]. We can also propose the Robust RDA (RRDA) inspired by robust FDA [deng2007robust, guo2015feature] to be robust against singularity. We emphasis that RRDA is useful if , , where we want to use Eq. (49) for solving the generalized eigenvalue problem.

In RRDA, the is decomposed using eigenvalue decomposition: where and include the eigenvectors and eigenvalues of , respectively. The eigenvalues are sorted as and the eigenvectors are sorted accordingly. If is close to singularity, the first eigenvalues are valid and the rest eigenvalues are either very small or zero. The appropriate is obtained as:

 d′:=argminm(∑mj=1λjd∑k=1λk≥0.98), (52)

where is a selected number close to . The invalid eigenvalues are replaced with :

 Rd×d∋Λ′:=% diag([λ1,…,λd′,λ∗,…,λ∗]⊤), (53)

where [deng2007robust]:

 λ∗:=1d−d′d∑j=d′+1λj. (54)

Hence, the is replaced with :

 Rd×d∋R′2:=Φ⊤Λ′Φ, (55)

and the robust RDA directions are the eigenvectors of the generalized eigenvalue problem .

## V Dual RDA for r2=0

The dual RDA exists for for the reason explained later in this section. This statement clarifies why, in the literature, PCA and SPCA (with ) have their dual methods but FDA (with ) does not have a dual.

As the matrix is symmetric and positive semi-definite, we can decompose it as:

 Ky(a)=ΨKΩKΨ⊤K=ΨKΩ(1/2)KΩ(1/2)KΨ⊤K(b)=ΥΥ⊤, (56)

where is because of SVD and is for:

 Rn×n∋Υ:=ΨKΩ(1/2)K. (57)

Thus, the can be decomposed as:

 R1 =XH(r1Ky+(1−r1)I)HX⊤ (a)=r1XHΥΥ⊤H⊤X⊤+(1−r1)XHH⊤X⊤ (b)=r1QQ⊤+(1−r1)˘X˘X⊤ (???)=r1QQ⊤+(1−r1)ST(c)=WW⊤, (58)

where is because of Eq. (56) and is symmetric, is because of Eq. (7) and:

 Rd×n∋Q:=XHΥ, (59)

and is because the and are both symmetric and in the the same form so their summation is also symmetric (recall Eq. (45)) and in the same form. We can obtain Eq. (58) in another way, too: as is symmetric and positive semi-definite (see Eq. (43)), we can decompose it:

 R1(a)=ΨRΩRΨ⊤R=ΨRΩ(1/2)RΩ(1/2)RΨ⊤R(b)=WW⊤, (60)

where is because of SVD and is for:

 W:=ΨRΩ(1/2)R. (61)

For the reason that will be explained at the end of this section, we use the incomplete SVD in for where , , [golub1971singular]. Therefore, we have . If we had used the complete SVD, we would have .

If , we have according to Eq. (31). In this case, the Eq. (29) becomes similar to Eq. (11) and thus the solution which is the generalized eigenvalue problem gets reduced to the eigenvalue problem for

. According to the properties of SVD, the matrix of left singular vectors of

is equivalent to the matrix of eigenvectors of . Therefore, we can use incomplete SVD for :

 Rd×k∋W=UΣV⊤, (62)

where the columns of and are the left and right singular vectors (i.e., the eigenvectors of and ), respectively. The diagonal entries of are the singular value of , i.e., the square root of eigenvalues of or .

The dual RDA exists only for because in Eq. (62), the

is an orthogonal matrix so

. This implies that the constraint in Eq. (29) should be which means . Recall that this is the reason for FDA (with ) not having a dual.

We obtain from Eq. (62):

 WV(???)=UΣ⟹U=WVΣ−1, (63)

where the orthogonality of is noticed. The projection of data in dual RDA is:

 ˜X (???)=U⊤X(???)=(WVΣ−1)⊤X=Σ−⊤V⊤W⊤X (???)=Σ−1V⊤Ω(1/2)⊤RΨ⊤RX. (64)

Note that is symmetric. Similarly, out-of-sample projection in dual RDA is:

 ˜Xt=Σ−1V⊤Ω(1/2)⊤RΨ⊤RXt. (65)

The reconstruction of training data in dual RDA is:

 ˆX (???)=UU⊤X=U˜X (a)=WVΣ−1Σ−1V⊤Ω(1/2)⊤RΨ⊤RX (???)=ΨRΩ(1/2)RVΣ−2V⊤Ω(1/2)⊤RΨ⊤RX, (66)

where is because of Eqs. (63) and (64). Similarly, reconstruction of out-of-sample data in dual RDA is:

 ˆXt=ΨRΩ(1/2)RVΣ−2V⊤Ω(1/2)⊤RΨ⊤RXt. (67)

To summarize, in RDA, the projection of training and out-of-sample data is Eq. (2) and (4), respectively. In RDA, the reconstruction of training and out-of-sample data is Eq. (3) and (5), respectively. However, in dual RDA, the projection of training and out-of-sample data is Eq. (64) and (65), respectively. In dual RDA, the reconstruction of training and out-of-sample data is Eq. (66) and (67), respectively.

Note that in RDA and dual RDA, we can truncate , , and , respectively, to , , and in order to have a -dimensional subspace (). For determining the appropriate in RDA, the scree plot [cattell1966scree] or the ratio [abdi2010principal] can be used, where is the -th largest eigenvalue.

The dual RDA is very useful especially if the dimensionality of data is much greater than the sample size of data, i.e., . In this case . According to Eq. (62), and are the eigenvectors of and , respectively. As in this case, the computation of eigenvectors of is much faster and needs less storage than . Therefore, in this case, we use the eigenvalue decomposition:

 W⊤W=VΣ2V⊤, (68)

rather than using Eq. (62). Note that in dual RDA, we do not require but only and . Hence, if , it is recommended to use dual RDA which is more efficient than RDA in terms of speed of calculation and storage.

## Vi Kernel RDA

Let be the pulling function mapping the data to the feature space . In other words, . Let denote the dimensionality of the feature space, i.e., while . Note that we usually have . The kernel over two vectors and is the inner product of their pulled data [hofmann2008kernel]:

 R∋k(x1,x2):=ϕ(x1)⊤ϕ(x2). (69)

We can have two types of kernel RDA, one using the dual RDA and kernel trick and the other one using the representation theory [alperin1993local]. The former, which makes use of the inner product of data instances, holds for only two special cases but the latter works for the entire range of the Roweis map.

### Vi-a Kernel RDA Using Dual RDA for Special Cases

If , we have the dual RDA and can use it for working out the kernel RDA using the kernel trick. However, the dual RDA is useful for the kernel trick in the two cases of and . The reason is that in these cases, the inner product of data points appear enabling us to use the kernel trick. Therefore, we can use the dual RDA for kernel trick in (i.e., PCA) and (i.e., SPCA). This explains why, in the literature, PCA and SPCA (with ) have their kernel methods using kernel trick (or their dual) but kernel FDA (with ) uses representation theory and not the kernel trick.

#### Vi-A1 Special Case of Kernel RDA: Kernel PCA

If , we have according to Eq. (58). In this case, . So, we decompose in Eq. (62). In this case, RDA is reduced to PCA where the data should be centered. Hence, in reconstruction, we add the training mean back. Therefore, the Eqs. (64), (65), (66), and (67) are modified to:

 ˘X=UΣV⊤(a)⟹U⊤˘X=ΣV⊤(???)⟹˜X=ΣV⊤, (70) ˜Xt=Σ−1V⊤˘X⊤˘Xt, (71) ˘X=UΣV⊤(a)⟹U=˘XVΣ−1 ⟹ˆX(???)=U˜X+μ(???)=˘XVV⊤+μ, (72) ˆXt=˘XVΣ−2V⊤˘X⊤˘Xt+μ, (73)

respectively, where is because and are orthogonal matrices and is the centered out-of-sample data using the training mean:

 ˘Xt:=Xt−μ. (74)

In the kernel RDA for (i.e., kernel PCA [scholkopf1997kernel, scholkopf1998nonlinear]), the incomplete SVD is applied on the which is the centered data in the feature space. We use the Eqs. (70), (71), (72), and (73) and replace the inner products by the kernels:

 ˘Φ(X)=UΣV⊤⟹Φ(˜X)=ΣV⊤, (75) Φ(˜Xt)=Σ−1V⊤˘Φ(X)⊤˘Φ(Xt)(a)=Σ−1V⊤˘Kt, (76) Φ(ˆX)=˘Φ(X)VV⊤+μ, (77) Φ(ˆXt)=˘Φ(X)VΣ−2V⊤˘Φ(X)⊤˘Φ(Xt)+μ (a)=˘Φ(X)VΣ−2V⊤˘Kt+μ, (78)

where is the centered out-of-sample data in the feature space using the mean of the pulled training data and is because the double-centered kernel over the training and out-of-sample data is calculated as:

 Rn×nt∋˘Kt :=Kt−1n1n×nKt