1 Introduction
Dimension reduction has been widely used in exploratory data analysis (EDA) and as a preprocessing step in highdimensional prediction problems. In EDA, dimension reduction is used to discover the underlying structure in a dataset: whether the data lie in a smaller subspace so that we can interpret them more easily. In prediction problems, the goal of dimension reduction is to reduce the highdimensional space of the predictors before fitting a regression model. The rationale is that conventional classification and regression methods often suffer from the “curse of dimensionality”, which states the number of data points required for learning grows exponentially with the dimensionality of the data. Dimension reduction is related to, but different from variable selection, but both share the goal of developing parsimonious and interpretable prediction models.
Principal component analysis (PCA) is among the earliest dimensionreduction techniques Pearson (1901); Hotelling (1933). It identifies the linear subspace that best approximates the original dataset in terms of retaining the maximum variation. The linear nature of PCA makes the results intrinsically interpretable, as the transformed variables are weighted combinations of the original ones and are orthogonal.
PCA has also been used in the context of predictive modeling, where it is called Principal Component Regression (PCR) since it uses PCA to first reduce the dimension and then do regression. However, since the information in the responses is not used during dimension reduction, there is no guarantee that PCR effectively identifies the “best” subspace. In practice, the subspace containing the maximal data variation can differ substantially from the subspace that is most relevant for the prediction problem.
To address this concern, multiple supervised dimensionreduction techniques have been proposed to take advantage of the information in responses Bair et al. (2006); Piironen & Vehtari (2018); Yu et al. (2006); Barshan et al. (2011); Geladi & Kowalski (1986); Ritchie et al. (2019). This paper provides a review of these selected approaches and proposes some minor extensions. We restrict our attention to continuous responses.
The rest of the paper is structured as follows: In Section 2, we review PCA and introduce the notation to be used throughout the paper. We divide supervised linear dimensionreduction techniques into two categories. First, Section 3 discusses the class of “wrapper methods” that includes methods built around the classical PCA and contain it as a subroutine. Section 4 discusses the second category named “intrinsic methods”. These techniques incorporate the label information directly into the objective function of the subspace learning problem.
We compare the performance of all the algorithms on synthetic and real data sets. The experimental results and analysis are provided in Section 5 and Section 6, using simulated data sets and real data sets respectively. Section 7 concludes the paper with comments on nonlinear and semisupervised dimension reduction techniques.
2 Preliminaries: PCA
Consider an unsupervised learning problem where the data are represented in matrix form . Here, each row of contains one of the data points, and point is a
dimensional vector of variables. Throughout, we assume the data have been columncentered, so that the mean over each column of
is zero. The goal in PCA is to find an appropriate linear subspace ofand the corresponding orthogonal basis, which can be done by computing the eigenvalue spectral decomposition of the matrix
. The algorithm is summarized in Algorithm 1. In practice, the PCA basis vectors can be calculated directly using singular value decomposition (SVD) on
.PCA has several attractive properties. First, the Kdimensional linear subspace spanned by the principle components (PCs), , captures the maximum variability in the data. Second, the transformed variable vectors are uncorrelated:
(1) 
where are the eigenvalues of . Last, the reconstructed data are also the best linear approximations of the original data measured by Euclidean norm. That is, is the maximizer of the following optimization problem:
(2) 
In the rest of the paper, we focus on the prediction problem where the continuous responses associated with are available. We use to represent the basis. We will add the algorithm name as a subscript when the context is not clear. In addition, we will use superscripts to denote variable selection or iteration. In the presence of a superscript, the inner product will be written as .
3 Wrapper methods
We first consider the wrapper methods that contain PCA as a subroutine and build around it to incorporate the label information. Since PCA learns the subspace in an unsupervised manner, the resulting subspace may not be the “best” in developing a regression model. Supervised linear dimensionreduction methods address this concern by incorporating the response information into the learning process. One way to utilize the information is to add preprocessing/postprocessing steps before/after performing PCA.
3.1 Bair method
Bair et al. Bair et al. (2006) proposed a method that adds a variable selection step before running PCA. We call this method Bair method in our paper. It selects a subset of the original variables that are the most related to the responses and does PCA on this reduced variable matrix. Define a score function that measures how closely the th variable is associated with the responses. One choice for the score function is the absolute value of the covariance:
(3) 
However, since the variables may have different variances, it may be better to consider the Pearson correlation coefficient:
(4) 
We can rank the “importance” of the variables by their scores using either one of the two score functions. We then select the top variables to form a new data matrix using the corresponding columns from . Finally, PCA is performed on the new variable matrix .
Bair method contains one hyperparameter
, which can be selected based on the performance of the subsequent prediction model measured on the training or validation set. The method is summarized in Algorithm 2. In this version, we directly use the performance on the training set to select .By eliminating the variables that contain little information about the responses, Bair method learns a more “informative” linear subspace. Variables with large variation tend to dominate the classic PCA, but they may not be relevant for prediction. Bair method addresses this problem and effectively controls the impact of these variables.
3.2 PiironenVehtari (PV) method
An iterative version of Bair method was proposed in Piironen & Vehtari (2018) by Piironen and Vehtari. Although it was named ISPCA in the original paper, we refer to it as PV method. It runs Bair algorithm iteratively and gets one PC each time. Under the th iteration, a subset of variables that are most related to the responses are chosen to form a new dataset. Denote the new dataset as which contains the selected columns from the full dataset matrix . Only the first PC is calculated from . As the transformed variables within each iteration are now onedimensional, can be chosen so that they are most correlated with the response.
Denote as the set of the selected variables. PV method then subtracts the variation explained by the chosen PC, , from the data by regressing each feature (both selected and unselected) on the onedimensional transformed data:
(5)  
(6) 
PV method is summarized in Algorithm 3. Unlike Bair method, it uses an iterative scheme to select a diverse set of variables in each iteration, all of which will contribute to the transformed data. The iterative scheme breaks one of the major properties of the classic PCA: the basis vectors obtained by this method are no longer orthogonal to each other. Nevertheless, the features in the transformed space are still uncorrelated with each other: Piironen & Vehtari (2018).
3.3 PC postselection (PCPS) method
The previous two methods use variable selection as a preprocessing step before applying the classic PCA. We also consider a postprocessing step after PCA to incorporate the response information. We refer to it as the PC postselection method (PCPS). We first conduct PCA with all variables to obtain the full set of PCs, and then calculates the transformed data along each PC: . One of the two score functions discussed earlier (3, 4) is then used to measure the relationship between the transformed dimensions and the responses: . The first K PCs corresponding to the highest scores are chosen to form the basis of the learned linear subspace. This technique is summarized in Algorithm 4.
4 Intrinsic methods
As noted in (2), PCA can be derived from an optimization problem. One can expand the formulation in (2) to incorporate the label information into the objective function, which is the underlying idea of several supervised linear dimension reduction algorithms. Since the objective function contains both supervised and unsupervised components, we call them “intrinsic methods”.
4.1 Partial Least Squares (PLS)
Partial least squares (PLS) has a long history. It was originally proposed in Wold (1975) and there have been many variations developed since then Geladi & Kowalski (1986). Although PLS has been used mainly with multivariate responses, our review focuses on its application to onedimensional responses.
PLS constructs the basis of a linear subspace iteratively. Within each iteration , a basis vector is calculated by maximizing the covariance between the responses and the transformed input variables:
(7) 
This optimization problem can be solved directly using eigenvalue decomposition, while the iterative solver in the original PLS can be viewed as the poweriteration method for computing the eigenvectors. After each iteration, both the variable matrix and the responses are updated by subtracting the parts explained along the direction of :
(8)  
(9) 
PLS is described in Algorithm 5.
We also consider an extended version of the PLS algorithm by combining the supervised objective in (7) and the unsupervised objective in PCA (2). The new objective function in each iteration is now:
(10) 
and an equivalent form is:
(11) 
The hyperparameter is used to balance between the supervised and the unsupervised objectives. When , the new model reduces to the original PLS method. When , the new model becomes the classic PCA. Similar to PCA, (11) can be solved by eigenvalue decomposition. Our extended version of PLS is summarized in Algorithm 6.
4.2 Barshan method
Barshan et. al. formulated their method Barshan et al. (2011) in a general manner using reproducing kernel Hilbert space. We call this method Barshan method in our paper. The “kernel trick” allows this method to achieve nonlinear dimension reduction. This level of generality is beyond the scope of this review paper. We focus on a specific version of Barshan method with a linear target variable (response) kernel. The objective function of this version has the following form:
(12) 
It can also be represented in an equivalent form:
(13) 
Equation (13) can be viewed as maximizing the sum of squares of the covariance between the response and each dimension of the transformed variable . The method is described in Algorithm 7.
Notice that the formulation in (13) becomes (7) when contains only one column. Therefore, PLS can be viewed as running Barshan method with a linear target variable kernel within each iteration Barshan et al. (2011). The basis learned by Algorithm 7 and 5 share the same first basis vector. However, the rest of the vectors are different and they span different subspaces. Both bases are orthogonal.
Similar to how we extended PLS to include an unsupervised objective, we can also modify Barshan method to include an unsupervised component. The new objective function is now:
(14) 
with its equivalent form:
(15) 
Our extended version of Barshan method is summarized in Algorithm 8.
4.3 Leastsquares PCA (LSPCA)
In Barshan method, the supervised objective measures the sum of the covariance between the responses and the transformed variables along each direction. An alternative is to measure the least square error of regressing the responses onto these variables. This approach, named leastsquares PCA (LSPCA) was proposed by Ritchie et al. (2019). It learns a linear subspace by solving the following optimization problem:
(16) 
Unlike Barshan method, this optimization problem does not have a known closedform solution. One method to solve it, based on matrix manifold optimization, is proposed in Ritchie et al. (2019). Other optimization methods with orthogonal constraints have been discussed in Absil et al. (2009); Wen & Yin (2013). We refer readers to Algorithm 1 in the original paper Ritchie et al. (2019) for the details.
We have already noted the similarities between the objective functions of Barshan method (14) and LSPCA (16). It can be shown that the two are indeed equivalent if , that is, when the data have been standardized and decorellated. In this scenario, the optimal for any given has the form:
(17) 
By plugging back into the first term in (16), we have:
(18) 
Therefore, Barshan method in Algorithm 8 and LSPCA learn the same subspace when .
4.4 Supervised probabilistic PCA (SPPCA) method
The SPPCA method Yu et al. (2006) is a modelbased approach. It assumes that both the highdimensional observed input variables and their responses are controlled by some low dimensional latent variables and are conditionally independent given these latent variables. Specifically,
(19)  
(20) 
where , , , ,
are random variables following Gaussian distributions:
, , . This probabilistic model contains four parameters: = .Given observation pairs
, we can obtain the maximum likelihood estimates (MLEs) of the model parameters by solving:
(21) 
We refer readers to Algorithm 1 in the original paper Yu et al. (2006)
for an expectationmaximization (EM) algorithm to obtain the MLE.
^{1}^{1}1The term in equation (6) in the original paper should have been multiplied by the number of samples.As the response is assumed to have a linear relationship with the latent variable, SPPCA returns a dimension reduction scheme and a prediction model simultaneously. However, the basis obtained by this method is no longer orthogonal. Given a new sample , the predicted value of the corresponding latent variable is
(22) 
and its predicted label can be calculated as .
SPPCA maximizes the likelihood of the datalabel pair jointly and its performance is often sensitive to the dimension of the data Ritchie et al. (2019)
. High dimensional data tend to overpower the information provided by the response, leading to results that are very similar to the classic PCA.
5 Comparison of the methods using simulated data sets
We compared the performance of the reviewed algorithms using simulations. For Barshan and PLS, the experiments were based on the extended versions rather than the original ones. The simulated data were generated by i.i.d. sampling from multivariate Gaussian. We examined two different sample sizes: . The first case with and revealed potential problems when there was “curseofdimensionality”. We simulated samples for testing. We designed different scenarios for the distribution of the the eigenvalues of . As shown in Figure 1, the decay of the eigenvalues in the first case was exponential in nature and was much faster than that in the second case which was close to linear.
The relationship between the input variables and their responses satisfied a linear model:
(23) 
where the coefficient vector was confined in a 10dimensional linear subspace spanned by the columns of with its corresponding lowerdimensional representation as . We fixed as a vector of all s. The term
was an additive Gaussian noise with standard deviation of
when the eigenvalues of decayed fast and was when the decay was slow.We designed three cases to cover different alignments between the subspace containing the maximal data variation and the subspace where the coefficient lived. In the “wellaligned” case, we constructed such that its columns were the eigenvectors of corresponding to the top ten largest eigenvalues. In the “misaligned” case, the columns of were the 11th to the 20th eigenvectors. In the “partiallyaligned” case, we included the 11th, 13th, 15th, 17th, and 19th eigenvectors in , and also added another 5 randomly chosen vectors that were orthogonal to the 5 included eigenvectors. The randomly chosen vectors tended to capture some of the space spanned by top vectors.
Since the algorithms did not know the true dimension of the linear embedding, we chose to learn a dimensional subspace. The
hyperparameter in Barshan method, PLS, and LSPCA was tuned on a validation set. After the dimensionreduction step, we built a regression model using the transformed input variables and responses. We also included a baseline algorithm which ran ordinary least square (OLS) directly on the original highdimensional data. For each test case, we ran 100 trials. In each trial, the eigenvectors of the covariance matrix
were randomly constructed, and the training and testing data were randomly drawn from the distribution. The mean squared errors (MSEs) of each algorithm were calculated on the training and testing samples in each trial. We used the average MSEs over the 100 trials to compare the performance of different models.5.1 Results of the case with fast decaying eigenvalues in
We first discuss the results of the case with fast decaying eigenvalues as shown in Figure 1.a. The average training and testing MSEs are summarized in Table 1. The top performing algorithms under each testing setting are highlighted. When the number of training samples is limited, OLS severely overfits the data as it does not exploit the lowdimensional embedding. However, the classic PCA performs well when there is a good alignment between the subspace containing the maximal data variation and the subspace where the coefficient lives. The tuned s in Barshan method, PLS, and LSPCA have large values, so the algorithms behave the same as the classic PCA. PV and PCPS methods show some sign of overfitting.
When the subspaces are not wellaligned, the classic PCA uses a “unsatisfactory” subspace for the prediction task, so the performance is much worse. As discussed earlier, the results of SPPCA are very similar to the PCA, since the information contained in the highdimensional data themselves has overpowered the information provided by the responses. Bair method, Barshan method, and PCPS all find better subspace by exploiting the information in the responses. The best performance is achieved by PV, PLS, and LSPCA. Combining with the results from the first simulation test, we see that the flexible tuning scheme allows LSPCA and PLS to consistently achieve the best performance.
5.2 Results of the case with slowly decaying eigenvalues in
We repeated the simulation tests with new data where the covariance matrix has slowly decaying eigenvalues as shown in Figure 1.b. The average training and testing MSEs are summarized in Table 2. Many of the findings are similar to those in the previous case, with the main difference being for PCA. When the eigenvalues of decay slowly, PCA is unable to achieve good performance even when the subspace spanned by the top PCs is wellaligned with the subspace in which lives, especially when the number of training samples is small. A closer investigation shows that the limited number of training samples causes relatively large error in the estimation of the eigenvectors. Additionally, the lack of significant differences among the top eigenvalues manifests the impact of these estimation errors by causing PCA to choose the wrong subspace. Supervised dimensionreduction algorithms use the information in the responses to alleviate this problem. LSPCA and PLS continue to achieve top performance.
WellAligned Subspaces  Misaligned Subspaces  PartiallyAligned Subspaces  

150  1500  150  1500  150  1500  
OLS  0.082 / 0.763  0.233 / 0.267  0.081 / 0.781  0.234 / 0.267  0.081 / 0.771  0.234 / 0.268 
PCA  0.226 / 0.285  0.248 / 0.253  0.875 / 1.154  0.990 / 1.019  0.538 / 0.700  0.583 / 0.595 
Bair  0.222 / 0.287  0.248 / 0.253  0.548 / 0.741  0.552 / 0.567  0.404 / 0.544  0.464 / 0.476 
PV  0.206 / 0.316  0.253 / 0.262  0.216 / 0.330  0.261 / 0.270  0.209 / 0.320  0.258 / 0.267 
PCPS  0.204 / 0.370  0.246 / 0.255  0.247 / 0.356  0.253 / 0.259  0.263 / 0.376  0.286 / 0.292 
Barshan  0.224 / 0.285  0.248 / 0.252  0.508 / 0.689  0.544 / 0.560  0.423 / 0.560  0.465 / 0.475 
PLS  0.216 / 0.284  0.247 / 0.252  0.237 / 0.338  0.257 / 0.264  0.231 / 0.323  0.262 / 0.268 
LSPCA  0.216 / 0.285  0.247 / 0.252  0.192 / 0.315  0.246 / 0.254  0.193 / 0.311  0.248 / 0.257 
SPPCA  0.225 / 0.285  0.248 / 0.253  0.810 / 1.074  0.911 / 0.939  0.510 / 0.666  0.554 / 0.566 
WellAligned Subspaces  Misaligned Subspaces  PartiallyAligned Subspaces  

150  1500  150  1500  150  1500  
OLS  2.05 / 19.79  5.81 / 6.70  2.05 / 19.23  5.83 / 6.69  2.04 / 19.28  5.84 / 6.68 
PCA  33.60 / 52.77  22.41 / 24.41  38.72 / 57.36  45.19 / 49.57  37.04 / 52.11  43.04 / 46.46 
Bair  17.40 / 32.83  11.59 / 12.50  18.93 / 34.97  14.38 / 15.73  18.53 / 34.59  19.54 / 21.40 
PV  4.67 / 17.14  6.72 / 7.55  4.88 / 17.88  7.68 / 8.66  5.32 / 19.64  9.03 / 10.15 
PCPS  18.51 / 32.58  12.90 / 13.85  20.11 / 34.74  16.75 / 17.95  19.83 / 34.18  23.71 / 25.35 
Barshan  11.17 / 24.44  7.27 / 7.96  11.71 / 25.59  7.45 / 8.25  12.10 / 26.28  10.44 / 11.55 
PLS  2.77 / 13.02  5.81 / 6.70  2.86 / 13.18  5.83 / 6.69  2.90 / 14.65  5.84 / 6.68 
LSPCA  3.03 / 13.16  5.86 / 6.66  3.02 / 13.19  5.83 / 6.69  2.98 / 14.49  5.84 / 6.68 
SPPCA  12.09 / 27.30  13.73 / 15.30  12.41 / 27.74  14.26 / 15.94  12.31 / 27.22  15.05 / 16.75 
5.3 The effect of on LSPCA and on the extensions of Barshan and PLS
Three of the algorithms (LSPCA, extended Barshan, and extended PLS) use a hyperparameter to balance between the supervised and the unsupervised objectives. We explore how the value of affects this balance in the simulation tests. Figure 2 shows the results when and the eigenvalues of decay slowly.
Figure 2 clearly shows the impact of tuning in all three subspace alignment cases for all algorithms. As value increases, the performance of all three algorithms approaches the classic PCA’s as expected. As decreases, the supervised components play a bigger role. Different algorithms utilize the label information differently and their performance converge to different levels. The performance of LSPCA becomes very close to the performance of OLS. The small number of training samples causes OLS to overfit. Both extended Barshan and extended PLS achieve the best performance with the right balance between the unsupervised and supervised objectives. Similar balancing effect of is also observed under other simulation test settings.
6 Comparison of the methods using two real data sets
We compared the performance of the reviewed algorithms using two real data sets on the UCI repository Dua & Graff (2017): winequality Cortez et al. (2009) and Parkinsonstelemonitoring Tsanas et al. (2009). The winequality dataset uses physicochemical features of the wine to predict its sensory quality. The Parkinsonstelemonitoring uses biomedical voice measurements to predict the clinicians’ scores of Parkinson’s disease symptom on the UPDRS scale. As in Section 5, for Barshan and PLS, the extended version is used.
For the winequality dataset, we scaled each of the 11 features to between 0 and 1. We split the dataset to 3919 training samples and 979 testing samples. We gradually increase , the dimension of the subspace to be learned, and observe the change of the model performance. The eigenvalues of the covariance matrix , the training MSEs, and the testing MSEs are shown in Figure 3. SPPCA and PCA again have very similar performance as we observed in the simulation tests. All other supervised methods outperform the classic PCA. LSPCA, PLS and PV method are able to find proper linear subspaces of very low dimension, demonstrating their effectiveness of incorporating the label information.
For the Parkinsonstelemonitoring dataset, we scaled each of the 16 medical features to between 0 and 1. We split the dataset to 4700 training samples and 1175 testing samples. The test results are shown in Figure 4 and similar conclusions can be drawn. All supervised methods except SPPCA outperform the classic PCA. PLS, PCPS, and LSPCA are able to find proper linear subspaces of very low dimension.
7 Conclusion
We have reviewed several popular supervised linear dimensionreduction techniques for onedimensional continuous response. We also extended some of these methods to include both the unsupervised and supervised components. We divided the techniques into two categories: the wrapper methods contain and the intrinsic methods. Our experimental results showed that the extended versions of PLS and LSPCA consistently outperform the other methods. Both are intrinsic methods with a balancing scheme between the supervised and unsupervised objectives.
Since many methods contain steps to evaluate the correlation between the input variables (original or transformed) and the responses, they can also be extended to work with binary/discrete labels. Other methods such as SPPCA, PLS, and LSPCA rely on an underlying linear relationship between the transformed variables and the responses. Extending them to binary/discrete labels requires further investigation.
Our review has focused only on linear dimensionreduction techniques. One way to extend them to nonlinear models is to use the kernel trick. Moreover, all the reviewed methods can be extended to semisupervised settings where we have both labeled and unlabled data.
References
 Absil et al. (2009) Absil, P.A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
 Bair et al. (2006) Bair, E., Hastie, T., Paul, D., and Tibshirani, R. Prediction by supervised principal components. Journal of the American Statistical Association, 101(473):119–137, 2006.
 Barshan et al. (2011) Barshan, E., Ghodsi, A., Azimifar, Z., and Jahromi, M. Z. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7):1357–1371, 2011.
 Cortez et al. (2009) Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4):547–553, 2009.

Dua & Graff (2017)
Dua, D. and Graff, C.
UCI machine learning repository, 2017.
URL http://archive.ics.uci.edu/ml.  Geladi & Kowalski (1986) Geladi, P. and Kowalski, B. R. Partial leastsquares regression: a tutorial. Analytica chimica acta, 185:1–17, 1986.
 Hotelling (1933) Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
 Pearson (1901) Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.

Piironen & Vehtari (2018)
Piironen, J. and Vehtari, A.
Iterative supervised principal components.
In
International Conference on Artificial Intelligence and Statistics
, pp. 106–114, 2018. 
Ritchie et al. (2019)
Ritchie, A., Scott, C., Balzano, L., Kessler, D., and Sripada, C. S.
Supervised principal component analysis via manifold optimization.
In
2019 IEEE Data Science Workshop (DSW)
, pp. 6–10. IEEE, 2019.  Tsanas et al. (2009) Tsanas, A., Little, M., McSharry, P., and Ramig, L. Accurate telemonitoring of parkinson’s disease progression by noninvasive speech tests. Nature Precedings, pp. 1–1, 2009.
 Wen & Yin (2013) Wen, Z. and Yin, W. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(12):397–434, 2013.

Wold (1975)
Wold, H.
Soft modelling by latent variables: the nonlinear iterative partial
least squares (nipals) approach.
Journal of Applied Probability
, 12(S1):117–142, 1975.  Yu et al. (2006) Yu, S., Yu, K., Tresp, V., Kriegel, H.P., and Wu, M. Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 464–473, 2006.
Comments
There are no comments yet.