Supervised Linear Dimension-Reduction Methods: Review, Extensions, and Comparisons

09/09/2021
by   Shaojie Xu, et al.
wells fargo
15

Principal component analysis (PCA) is a well-known linear dimension-reduction method that has been widely used in data analysis and modeling. It is an unsupervised learning technique that identifies a suitable linear subspace for the input variable that contains maximal variation and preserves as much information as possible. PCA has also been used in prediction models where the original, high-dimensional space of predictors is reduced to a smaller, more manageable, set before conducting regression analysis. However, this approach does not incorporate information in the response during the dimension-reduction stage and hence can have poor predictive performance. To address this concern, several supervised linear dimension-reduction techniques have been proposed in the literature. This paper reviews selected techniques, extends some of them, and compares their performance through simulations. Two of these techniques, partial least squares (PLS) and least-squares PCA (LSPCA), consistently outperform the others in this study.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/24/2020

Torus Probabilistic Principal Component Analysis

One of the most common problems that any technique encounters is the hig...
06/05/2010

Rasch-based high-dimensionality data reduction and class prediction with applications to microarray gene expression data

Class prediction is an important application of microarray gene expressi...
02/06/2019

Principal Model Analysis Based on Partial Least Squares

Motivated by the Bagging Partial Least Squares (PLS) and Principal Compo...
08/20/2018

Supervised Kernel PCA For Longitudinal Data

In statistical learning, high covariate dimensionality poses challenges ...
04/17/2012

Regularized Partial Least Squares with an Application to NMR Spectroscopy

High-dimensional data common in genomics, proteomics, and chemometrics o...
11/28/2017

Wisdom of the crowd from unsupervised dimension reduction

Wisdom of the crowd, the collective intelligence derived from responses ...
11/09/2017

Dimension Reduction of High-Dimensional Datasets Based on Stepwise SVM

The current study proposes a dimension reduction method, stepwise suppor...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dimension reduction has been widely used in exploratory data analysis (EDA) and as a pre-processing step in high-dimensional prediction problems. In EDA, dimension reduction is used to discover the underlying structure in a dataset: whether the data lie in a smaller subspace so that we can interpret them more easily. In prediction problems, the goal of dimension reduction is to reduce the high-dimensional space of the predictors before fitting a regression model. The rationale is that conventional classification and regression methods often suffer from the “curse of dimensionality”, which states the number of data points required for learning grows exponentially with the dimensionality of the data. Dimension reduction is related to, but different from variable selection, but both share the goal of developing parsimonious and interpretable prediction models.

Principal component analysis (PCA) is among the earliest dimension-reduction techniques Pearson (1901); Hotelling (1933). It identifies the linear subspace that best approximates the original dataset in terms of retaining the maximum variation. The linear nature of PCA makes the results intrinsically interpretable, as the transformed variables are weighted combinations of the original ones and are orthogonal.

PCA has also been used in the context of predictive modeling, where it is called Principal Component Regression (PCR) since it uses PCA to first reduce the dimension and then do regression. However, since the information in the responses is not used during dimension reduction, there is no guarantee that PCR effectively identifies the “best” subspace. In practice, the subspace containing the maximal data variation can differ substantially from the subspace that is most relevant for the prediction problem.

To address this concern, multiple supervised dimension-reduction techniques have been proposed to take advantage of the information in responses Bair et al. (2006); Piironen & Vehtari (2018); Yu et al. (2006); Barshan et al. (2011); Geladi & Kowalski (1986); Ritchie et al. (2019). This paper provides a review of these selected approaches and proposes some minor extensions. We restrict our attention to continuous responses.

The rest of the paper is structured as follows: In Section 2, we review PCA and introduce the notation to be used throughout the paper. We divide supervised linear dimension-reduction techniques into two categories. First, Section 3 discusses the class of “wrapper methods” that includes methods built around the classical PCA and contain it as a subroutine. Section 4 discusses the second category named “intrinsic methods”. These techniques incorporate the label information directly into the objective function of the subspace learning problem.

We compare the performance of all the algorithms on synthetic and real data sets. The experimental results and analysis are provided in Section 5 and Section 6, using simulated data sets and real data sets respectively. Section 7 concludes the paper with comments on non-linear and semi-supervised dimension reduction techniques.

2 Preliminaries: PCA

Consider an unsupervised learning problem where the data are represented in matrix form . Here, each row of contains one of the data points, and point is a

dimensional vector of variables. Throughout, we assume the data have been column-centered, so that the mean over each column of

is zero. The goal in PCA is to find an appropriate linear subspace of

and the corresponding orthogonal basis, which can be done by computing the eigenvalue spectral decomposition of the matrix

. The algorithm is summarized in Algorithm 1

. In practice, the PCA basis vectors can be calculated directly using singular value decomposition (SVD) on

.

PCA has several attractive properties. First, the K-dimensional linear subspace spanned by the principle components (PCs), , captures the maximum variability in the data. Second, the transformed variable vectors are uncorrelated:

(1)

where are the eigenvalues of . Last, the reconstructed data are also the best linear approximations of the original data measured by Euclidean norm. That is, is the maximizer of the following optimization problem:

(2)

In the rest of the paper, we focus on the prediction problem where the continuous responses associated with are available. We use to represent the basis. We will add the algorithm name as a subscript when the context is not clear. In addition, we will use superscripts to denote variable selection or iteration. In the presence of a superscript, the inner product will be written as .

  Given centered training data and a desired lower dimension .
  Calculate the covariance matrix .
  Calculate the first largest eigenvalues of

and their corresponding eigenvectors

.
  Build the PCA basis , .
  To transform a new sample :
     Return
Algorithm 1 PCA

3 Wrapper methods

We first consider the wrapper methods that contain PCA as a subroutine and build around it to incorporate the label information. Since PCA learns the subspace in an unsupervised manner, the resulting subspace may not be the “best” in developing a regression model. Supervised linear dimension-reduction methods address this concern by incorporating the response information into the learning process. One way to utilize the information is to add pre-processing/post-processing steps before/after performing PCA.

3.1 Bair method

Bair et al. Bair et al. (2006) proposed a method that adds a variable selection step before running PCA. We call this method Bair method in our paper. It selects a subset of the original variables that are the most related to the responses and does PCA on this reduced variable matrix. Define a score function that measures how closely the th variable is associated with the responses. One choice for the score function is the absolute value of the covariance:

(3)

However, since the variables may have different variances, it may be better to consider the Pearson correlation coefficient:

(4)

We can rank the “importance” of the variables by their scores using either one of the two score functions. We then select the top variables to form a new data matrix using the corresponding columns from . Finally, PCA is performed on the new variable matrix .

  Given centered training data , their corresponding centered label , and a desired lower dimension .
  Given a variable score function , i.g. (3, 4).
  Calculate the score of each variable: and rank them in a decreasing order.
  for  do
     Choose the first variables which form a set .
     Construct the new data matrix by selecting the corresponding columns , from .
     Perform PCA as in Algorithm 1 on . Get the basis and the transform the data .
     Build the subsequent prediction model using and .
     Record the prediction performance.
  end for
  Choose the with the best performance and get the corresponding and .
  To transform a new sample :
     Form a new sample by selecting vairalbes .
     Return .
Algorithm 2 Bair Method

Bair method contains one hyperparameter

, which can be selected based on the performance of the subsequent prediction model measured on the training or validation set. The method is summarized in Algorithm 2. In this version, we directly use the performance on the training set to select .

By eliminating the variables that contain little information about the responses, Bair method learns a more “informative” linear subspace. Variables with large variation tend to dominate the classic PCA, but they may not be relevant for prediction. Bair method addresses this problem and effectively controls the impact of these variables.

3.2 Piironen-Vehtari (PV) method

An iterative version of Bair method was proposed in Piironen & Vehtari (2018) by Piironen and Vehtari. Although it was named ISPCA in the original paper, we refer to it as PV method. It runs Bair algorithm iteratively and gets one PC each time. Under the th iteration, a subset of variables that are most related to the responses are chosen to form a new dataset. Denote the new dataset as which contains the selected columns from the full dataset matrix . Only the first PC is calculated from . As the transformed variables within each iteration are now one-dimensional, can be chosen so that they are most correlated with the response.

Denote as the set of the selected variables. PV method then subtracts the variation explained by the chosen PC, , from the data by regressing each feature (both selected and unselected) on the one-dimensional transformed data:

(5)
(6)
  Given centered training data , their corresponding centered label , and a desired lower dimension .
  Given a variable score function , i.g. (3, 4).
  Set
  for  do
     Calculate the score of each variable: and rank them in a decreasing order.
     for  do
        Choose the first variables which form a set .
        Construct the new data matrix by selecting the corresponding columns , from .
        Perform PCA as in Algorithm 1 on to get the first PC, , and the transformed data, .
        Calculate the score using (4).
     end for
     Choose the with the highest score and record and corresponding , .
     Update the variables and the responses using (5-6) and record .
  end for
  To transform a new sample :
     Set
     for  do
        Construct a new sample by selecting variables .
        Calculate .
        Update the variables: .
     end for
     Return .
Algorithm 3 Piironen-Vehtari (PV) Method

PV method is summarized in Algorithm 3. Unlike Bair method, it uses an iterative scheme to select a diverse set of variables in each iteration, all of which will contribute to the transformed data. The iterative scheme breaks one of the major properties of the classic PCA: the basis vectors obtained by this method are no longer orthogonal to each other. Nevertheless, the features in the transformed space are still uncorrelated with each other: Piironen & Vehtari (2018).

3.3 PC post-selection (PCPS) method

The previous two methods use variable selection as a pre-processing step before applying the classic PCA. We also consider a post-processing step after PCA to incorporate the response information. We refer to it as the PC post-selection method (PCPS). We first conduct PCA with all variables to obtain the full set of PCs, and then calculates the transformed data along each PC: . One of the two score functions discussed earlier (3, 4) is then used to measure the relationship between the transformed dimensions and the responses: . The first K PCs corresponding to the highest scores are chosen to form the basis of the learned linear subspace. This technique is summarized in Algorithm 4.

  Given centered training data , their corresponding centered label , and a desired lower dimension .
  Given a PC score function , i.g. (3, 4).
  Perform PCA as in Algorithm 1 on the new data matrix, and get the PCs, .
  Calculate the score of each PC:
  Build the basis by selecting the first PCs with the highest scores.
  To transform a new sample :
     Return
Algorithm 4 PC Post-Selection (PCPS) Method

4 Intrinsic methods

As noted in (2), PCA can be derived from an optimization problem. One can expand the formulation in (2) to incorporate the label information into the objective function, which is the underlying idea of several supervised linear dimension reduction algorithms. Since the objective function contains both supervised and unsupervised components, we call them “intrinsic methods”.

4.1 Partial Least Squares (PLS)

Partial least squares (PLS) has a long history. It was originally proposed in Wold (1975) and there have been many variations developed since then Geladi & Kowalski (1986). Although PLS has been used mainly with multi-variate responses, our review focuses on its application to one-dimensional responses.

PLS constructs the basis of a linear subspace iteratively. Within each iteration , a basis vector is calculated by maximizing the covariance between the responses and the transformed input variables:

(7)

This optimization problem can be solved directly using eigenvalue decomposition, while the iterative solver in the original PLS can be viewed as the power-iteration method for computing the eigenvectors. After each iteration, both the variable matrix and the responses are updated by subtracting the parts explained along the direction of :

(8)
(9)

PLS is described in Algorithm 5.

We also consider an extended version of the PLS algorithm by combining the supervised objective in (7) and the unsupervised objective in PCA (2). The new objective function in each iteration is now:

(10)

and an equivalent form is:

(11)

The hyperparameter is used to balance between the supervised and the unsupervised objectives. When , the new model reduces to the original PLS method. When , the new model becomes the classic PCA. Similar to PCA, (11) can be solved by eigenvalue decomposition. Our extended version of PLS is summarized in Algorithm 6.

  Given centered training data , their corresponding centered label , and a desired lower dimension .
  Set ,
  for  do
     Solve (7) by calculating the largest eigenvalue of and setting its corresponding eigenvector as .
     Update the data matrix and the responses using (8-9).
  end for
  Build the basis , .
  To transform a new sample :
     Return
Algorithm 5 PLS
  Given centered training data , their corresponding centered label , and a desired lower dimension .
  Select a value for the hyperparameter .
  Set ,
  for  do
     Solve (10) by calculating the largest eigenvalue of and setting its corresponding eigenvector as .
     Update the data matrix and the responses using (8-9).
  end for
  Build the basis , .
  To transform a new sample :
     Return
Algorithm 6 Extended PLS

4.2 Barshan method

Barshan et. al. formulated their method Barshan et al. (2011) in a general manner using reproducing kernel Hilbert space. We call this method Barshan method in our paper. The “kernel trick” allows this method to achieve non-linear dimension reduction. This level of generality is beyond the scope of this review paper. We focus on a specific version of Barshan method with a linear target variable (response) kernel. The objective function of this version has the following form:

(12)

It can also be represented in an equivalent form:

(13)

Equation (13) can be viewed as maximizing the sum of squares of the covariance between the response and each dimension of the transformed variable . The method is described in Algorithm 7.

Notice that the formulation in (13) becomes (7) when contains only one column. Therefore, PLS can be viewed as running Barshan method with a linear target variable kernel within each iteration Barshan et al. (2011). The basis learned by Algorithm 7 and 5 share the same first basis vector. However, the rest of the vectors are different and they span different subspaces. Both bases are orthogonal.

Similar to how we extended PLS to include an unsupervised objective, we can also modify Barshan method to include an unsupervised component. The new objective function is now:

(14)

with its equivalent form:

(15)

Our extended version of Barshan method is summarized in Algorithm 8.

  Given centered training data , their corresponding centered label , and a desired lower dimension .
  Solve (12) by calculating the first largest eigenvalues of and their corresponding eigenvectors .
  Build the PCA basis , .
  To transform a new sample :
     Return
Algorithm 7 Barshan Method with Linear Target Variable Kernel
  Given centered training data , their corresponding centered label , and a desired lower dimension .
  Select a value for the hyperparameter .
  Solve (15) by calculating the first largest eigenvalues of and their corresponding eigenvectors .
  Build the PCA basis , .
  To transform a new sample :
     Return
Algorithm 8 Extended Barshan Method with Linear Target Variable Kernel

4.3 Least-squares PCA (LSPCA)

In Barshan method, the supervised objective measures the sum of the covariance between the responses and the transformed variables along each direction. An alternative is to measure the least square error of regressing the responses onto these variables. This approach, named least-squares PCA (LSPCA) was proposed by Ritchie et al. (2019). It learns a linear subspace by solving the following optimization problem:

(16)

Unlike Barshan method, this optimization problem does not have a known closed-form solution. One method to solve it, based on matrix manifold optimization, is proposed in Ritchie et al. (2019). Other optimization methods with orthogonal constraints have been discussed in Absil et al. (2009); Wen & Yin (2013). We refer readers to Algorithm 1 in the original paper Ritchie et al. (2019) for the details.

We have already noted the similarities between the objective functions of Barshan method (14) and LSPCA (16). It can be shown that the two are indeed equivalent if , that is, when the data have been standardized and decorellated. In this scenario, the optimal for any given has the form:

(17)

By plugging back into the first term in (16), we have:

(18)

Therefore, Barshan method in Algorithm 8 and LSPCA learn the same subspace when .

4.4 Supervised probabilistic PCA (SPPCA) method

The SPPCA method Yu et al. (2006) is a model-based approach. It assumes that both the high-dimensional observed input variables and their responses are controlled by some low dimensional latent variables and are conditionally independent given these latent variables. Specifically,

(19)
(20)

where , , , ,

are random variables following Gaussian distributions:

, , . This probabilistic model contains four parameters: = .

Given observation pairs

, we can obtain the maximum likelihood estimates (MLEs) of the model parameters by solving:

(21)

We refer readers to Algorithm 1 in the original paper Yu et al. (2006)

for an expectation-maximization (EM) algorithm to obtain the MLE.

111The term in equation (6) in the original paper should have been multiplied by the number of samples.

As the response is assumed to have a linear relationship with the latent variable, SPPCA returns a dimension reduction scheme and a prediction model simultaneously. However, the basis obtained by this method is no longer orthogonal. Given a new sample , the predicted value of the corresponding latent variable is

(22)

and its predicted label can be calculated as .

SPPCA maximizes the likelihood of the data-label pair jointly and its performance is often sensitive to the dimension of the data Ritchie et al. (2019)

. High dimensional data tend to overpower the information provided by the response, leading to results that are very similar to the classic PCA.

5 Comparison of the methods using simulated data sets

We compared the performance of the reviewed algorithms using simulations. For Barshan and PLS, the experiments were based on the extended versions rather than the original ones. The simulated data were generated by i.i.d. sampling from multivariate Gaussian. We examined two different sample sizes: . The first case with and revealed potential problems when there was “curse-of-dimensionality”. We simulated samples for testing. We designed different scenarios for the distribution of the the eigenvalues of . As shown in Figure 1, the decay of the eigenvalues in the first case was exponential in nature and was much faster than that in the second case which was close to linear.

Figure 1: The eigenvalues of X’s covariance matrix, (a) the case of fast decay and (b) the case of slow decay.

The relationship between the input variables and their responses satisfied a linear model:

(23)

where the coefficient vector was confined in a 10-dimensional linear subspace spanned by the columns of with its corresponding lower-dimensional representation as . We fixed as a vector of all s. The term

was an additive Gaussian noise with standard deviation of

when the eigenvalues of decayed fast and was when the decay was slow.

We designed three cases to cover different alignments between the subspace containing the maximal data variation and the subspace where the coefficient lived. In the “well-aligned” case, we constructed such that its columns were the eigenvectors of corresponding to the top ten largest eigenvalues. In the “misaligned” case, the columns of were the 11th to the 20th eigenvectors. In the “partially-aligned” case, we included the 11th, 13th, 15th, 17th, and 19th eigenvectors in , and also added another 5 randomly chosen vectors that were orthogonal to the 5 included eigenvectors. The randomly chosen vectors tended to capture some of the space spanned by top vectors.

Since the algorithms did not know the true dimension of the linear embedding, we chose to learn a -dimensional subspace. The

hyperparameter in Barshan method, PLS, and LSPCA was tuned on a validation set. After the dimension-reduction step, we built a regression model using the transformed input variables and responses. We also included a baseline algorithm which ran ordinary least square (OLS) directly on the original high-dimensional data. For each test case, we ran 100 trials. In each trial, the eigenvectors of the covariance matrix

were randomly constructed, and the training and testing data were randomly drawn from the distribution. The mean squared errors (MSEs) of each algorithm were calculated on the training and testing samples in each trial. We used the average MSEs over the 100 trials to compare the performance of different models.

5.1 Results of the case with fast decaying eigenvalues in

We first discuss the results of the case with fast decaying eigenvalues as shown in Figure 1.a. The average training and testing MSEs are summarized in Table 1. The top performing algorithms under each testing setting are highlighted. When the number of training samples is limited, OLS severely overfits the data as it does not exploit the low-dimensional embedding. However, the classic PCA performs well when there is a good alignment between the subspace containing the maximal data variation and the subspace where the coefficient lives. The tuned s in Barshan method, PLS, and LSPCA have large values, so the algorithms behave the same as the classic PCA. PV and PCPS methods show some sign of overfitting.

When the subspaces are not well-aligned, the classic PCA uses a “unsatisfactory” subspace for the prediction task, so the performance is much worse. As discussed earlier, the results of SPPCA are very similar to the PCA, since the information contained in the high-dimensional data themselves has overpowered the information provided by the responses. Bair method, Barshan method, and PCPS all find better subspace by exploiting the information in the responses. The best performance is achieved by PV, PLS, and LSPCA. Combining with the results from the first simulation test, we see that the flexible tuning scheme allows LSPCA and PLS to consistently achieve the best performance.

5.2 Results of the case with slowly decaying eigenvalues in

We repeated the simulation tests with new data where the covariance matrix has slowly decaying eigenvalues as shown in Figure 1.b. The average training and testing MSEs are summarized in Table 2. Many of the findings are similar to those in the previous case, with the main difference being for PCA. When the eigenvalues of decay slowly, PCA is unable to achieve good performance even when the subspace spanned by the top PCs is well-aligned with the subspace in which lives, especially when the number of training samples is small. A closer investigation shows that the limited number of training samples causes relatively large error in the estimation of the eigenvectors. Additionally, the lack of significant differences among the top eigenvalues manifests the impact of these estimation errors by causing PCA to choose the wrong subspace. Supervised dimension-reduction algorithms use the information in the responses to alleviate this problem. LSPCA and PLS continue to achieve top performance.

Well-Aligned Subspaces Misaligned Subspaces Partially-Aligned Subspaces
150 1500 150 1500 150 1500
OLS 0.082 / 0.763 0.233 / 0.267 0.081 / 0.781 0.234 / 0.267 0.081 / 0.771 0.234 / 0.268
PCA 0.226 / 0.285 0.248 / 0.253 0.875 / 1.154 0.990 / 1.019 0.538 / 0.700 0.583 / 0.595
Bair 0.222 / 0.287 0.248 / 0.253 0.548 / 0.741 0.552 / 0.567 0.404 / 0.544 0.464 / 0.476
PV 0.206 / 0.316 0.253 / 0.262 0.216 / 0.330 0.261 / 0.270 0.209 / 0.320 0.258 / 0.267
PCPS 0.204 / 0.370 0.246 / 0.255 0.247 / 0.356 0.253 / 0.259 0.263 / 0.376 0.286 / 0.292
Barshan 0.224 / 0.285 0.248 / 0.252 0.508 / 0.689 0.544 / 0.560 0.423 / 0.560 0.465 / 0.475
PLS 0.216 / 0.284 0.247 / 0.252 0.237 / 0.338 0.257 / 0.264 0.231 / 0.323 0.262 / 0.268
LSPCA 0.216 / 0.285 0.247 / 0.252 0.192 / 0.315 0.246 / 0.254 0.193 / 0.311 0.248 / 0.257
SPPCA 0.225 / 0.285 0.248 / 0.253 0.810 / 1.074 0.911 / 0.939 0.510 / 0.666 0.554 / 0.566
Table 1: The average training/testing MSEs from 100 simulation trials under different settings. The eigenvalues of decay fast. The top three testing MSEs under each setting are in bold.
Well-Aligned Subspaces Misaligned Subspaces Partially-Aligned Subspaces
150 1500 150 1500 150 1500
OLS 2.05 / 19.79 5.81 / 6.70 2.05 / 19.23 5.83 / 6.69 2.04 / 19.28 5.84 / 6.68
PCA 33.60 / 52.77 22.41 / 24.41 38.72 / 57.36 45.19 / 49.57 37.04 / 52.11 43.04 / 46.46
Bair 17.40 / 32.83 11.59 / 12.50 18.93 / 34.97 14.38 / 15.73 18.53 / 34.59 19.54 / 21.40
PV 4.67 / 17.14 6.72 / 7.55 4.88 / 17.88 7.68 / 8.66 5.32 / 19.64 9.03 / 10.15
PCPS 18.51 / 32.58 12.90 / 13.85 20.11 / 34.74 16.75 / 17.95 19.83 / 34.18 23.71 / 25.35
Barshan 11.17 / 24.44 7.27 / 7.96 11.71 / 25.59 7.45 / 8.25 12.10 / 26.28 10.44 / 11.55
PLS 2.77 / 13.02 5.81 / 6.70 2.86 / 13.18 5.83 / 6.69 2.90 / 14.65 5.84 / 6.68
LSPCA 3.03 / 13.16 5.86 / 6.66 3.02 / 13.19 5.83 / 6.69 2.98 / 14.49 5.84 / 6.68
SPPCA 12.09 / 27.30 13.73 / 15.30 12.41 / 27.74 14.26 / 15.94 12.31 / 27.22 15.05 / 16.75
Table 2: The average training/testing MSEs from 100 simulation trials under different settings. The eigenvalues of decay slowly. The top three testing MSEs under each setting are in bold.

5.3 The effect of on LSPCA and on the extensions of Barshan and PLS

Three of the algorithms (LSPCA, extended Barshan, and extended PLS) use a hyperparameter to balance between the supervised and the unsupervised objectives. We explore how the value of affects this balance in the simulation tests. Figure 2 shows the results when and the eigenvalues of decay slowly.

Figure 2 clearly shows the impact of tuning in all three subspace alignment cases for all algorithms. As value increases, the performance of all three algorithms approaches the classic PCA’s as expected. As decreases, the supervised components play a bigger role. Different algorithms utilize the label information differently and their performance converge to different levels. The performance of LSPCA becomes very close to the performance of OLS. The small number of training samples causes OLS to overfit. Both extended Barshan and extended PLS achieve the best performance with the right balance between the unsupervised and supervised objectives. Similar balancing effect of is also observed under other simulation test settings.

Figure 2: The effect of on LSPCA, extended Barshan, and extended PLS. and the eigenvalues of decay slowly. (a) the case of well-aligned subspaces, (b) the case of misaligned subspaces, and (c) the case of partially-aligned subspaces.

6 Comparison of the methods using two real data sets

We compared the performance of the reviewed algorithms using two real data sets on the UCI repository Dua & Graff (2017): wine-quality Cortez et al. (2009) and Parkinsons-telemonitoring Tsanas et al. (2009). The wine-quality dataset uses physicochemical features of the wine to predict its sensory quality. The Parkinsons-telemonitoring uses biomedical voice measurements to predict the clinicians’ scores of Parkinson’s disease symptom on the UPDRS scale. As in Section 5, for Barshan and PLS, the extended version is used.

For the wine-quality dataset, we scaled each of the 11 features to between 0 and 1. We split the dataset to 3919 training samples and 979 testing samples. We gradually increase , the dimension of the subspace to be learned, and observe the change of the model performance. The eigenvalues of the covariance matrix , the training MSEs, and the testing MSEs are shown in Figure 3. SPPCA and PCA again have very similar performance as we observed in the simulation tests. All other supervised methods outperform the classic PCA. LSPCA, PLS and PV method are able to find proper linear subspaces of very low dimension, demonstrating their effectiveness of incorporating the label information.

Figure 3: Test results on the wine-quality dataset: (a) the eigenvalues of the covariance matrix ,(b) training MSE vs. subspace dimension, and (c) testing MSE vs. subspace dimension. Different curves correspond to different methods.

For the Parkinsons-telemonitoring dataset, we scaled each of the 16 medical features to between 0 and 1. We split the dataset to 4700 training samples and 1175 testing samples. The test results are shown in Figure 4 and similar conclusions can be drawn. All supervised methods except SPPCA outperform the classic PCA. PLS, PCPS, and LSPCA are able to find proper linear subspaces of very low dimension.

Figure 4: Test results on the Parkinsons-telemonitoring dataset: (a) the eigenvalues of the covariance matrix , (b) training MSE vs. subspace dimension, and (c) testing MSE vs. subspace dimension. Different curves correspond to different methods.

7 Conclusion

We have reviewed several popular supervised linear dimension-reduction techniques for one-dimensional continuous response. We also extended some of these methods to include both the unsupervised and supervised components. We divided the techniques into two categories: the wrapper methods contain and the intrinsic methods. Our experimental results showed that the extended versions of PLS and LSPCA consistently outperform the other methods. Both are intrinsic methods with a balancing scheme between the supervised and unsupervised objectives.

Since many methods contain steps to evaluate the correlation between the input variables (original or transformed) and the responses, they can also be extended to work with binary/discrete labels. Other methods such as SPPCA, PLS, and LSPCA rely on an underlying linear relationship between the transformed variables and the responses. Extending them to binary/discrete labels requires further investigation.

Our review has focused only on linear dimension-reduction techniques. One way to extend them to non-linear models is to use the kernel trick. Moreover, all the reviewed methods can be extended to semi-supervised settings where we have both labeled and unlabled data.

References

  • Absil et al. (2009) Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
  • Bair et al. (2006) Bair, E., Hastie, T., Paul, D., and Tibshirani, R. Prediction by supervised principal components. Journal of the American Statistical Association, 101(473):119–137, 2006.
  • Barshan et al. (2011) Barshan, E., Ghodsi, A., Azimifar, Z., and Jahromi, M. Z. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7):1357–1371, 2011.
  • Cortez et al. (2009) Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4):547–553, 2009.
  • Dua & Graff (2017) Dua, D. and Graff, C.

    UCI machine learning repository, 2017.

    URL http://archive.ics.uci.edu/ml.
  • Geladi & Kowalski (1986) Geladi, P. and Kowalski, B. R. Partial least-squares regression: a tutorial. Analytica chimica acta, 185:1–17, 1986.
  • Hotelling (1933) Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
  • Pearson (1901) Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  • Piironen & Vehtari (2018) Piironen, J. and Vehtari, A. Iterative supervised principal components. In

    International Conference on Artificial Intelligence and Statistics

    , pp. 106–114, 2018.
  • Ritchie et al. (2019) Ritchie, A., Scott, C., Balzano, L., Kessler, D., and Sripada, C. S. Supervised principal component analysis via manifold optimization. In

    2019 IEEE Data Science Workshop (DSW)

    , pp. 6–10. IEEE, 2019.
  • Tsanas et al. (2009) Tsanas, A., Little, M., McSharry, P., and Ramig, L. Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests. Nature Precedings, pp. 1–1, 2009.
  • Wen & Yin (2013) Wen, Z. and Yin, W. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2):397–434, 2013.
  • Wold (1975) Wold, H. Soft modelling by latent variables: the non-linear iterative partial least squares (nipals) approach.

    Journal of Applied Probability

    , 12(S1):117–142, 1975.
  • Yu et al. (2006) Yu, S., Yu, K., Tresp, V., Kriegel, H.-P., and Wu, M. Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 464–473, 2006.