Principal component analysis (PCA) (Jolliffe, 2002) is a fundamental statistical tool for dimensionality reduction, data processing, and visualization of multivariate data, with various applications in biology, engineering, and social science. In regression analysis, it can be useful to replace many original explanatory variables with a few principal components, which is called the principal component regression (PCR) (Massy, 1965; Jolliffe, 1982). PCR is widely used in various fields of research and many extensions of PCR have been proposed (see, e.g., Hartnett et al., 1998; Rosital et al.
, 2001; Reiss and Ogden, 2007; Wang and Abbott, 2008). Whereas PCR is a useful tool for analyzing multivariate data, this method may not have enough prediction accuracy if the response variable depends on the principal components with small eigenvalues. The problem arises from the two-stage procedure for PCR; a few principal components are selected with large eigenvalues, but without any relation to response variable, and then the regression model is constructed using them as new explanatory variables.
In this paper, we deal with PCA and regression analysis simultaneously, and propose a one-stage procedure for PCR to address this problem. The procedure combines two loss functions; one is the ordinary regression analysis loss and the other is PCA loss with some devices proposed by Zouet al.
(2006). In addition, in order to easily interpret estimated principal component loadings and select the number of principal components automatically, we impose thetype regularization on the parameters. This one-stage procedure is called the sparse principal component regression (SPCR) in this paper. SPCR gives sparse principal component loadings that are related to the response variable and selects the number of principal components simultaneously. We also establish a monotonically decreasing estimation procedure for the loss function using the coordinate descent algorithm (Friedman et al., 2010), because SPCR can be obtained via the convex optimization problem for each of parameters.
The partial least squares regression (PLS) (Wold, 1975; Frank and Friedman, 1993) is a dimension reduction technique, which incorporates information between the explanatory variables and the response variable. Recently, Chun and Keleş (2010) have proposed the sparse partial least squares regression (SPLS) that imposes sparsity in the dimension reduction step of PLS, and then constructed a regression model regarding some SPLS components as new explanatory variables, although it is a two-stage procedure. Besides PLS and SPLS, several methods have been proposed for performing dimension reduction and regression analysis simultaneously. Bair et al. (2006) proposed the supervised principal component analysis, which is regression analysis in which the explanatory variables are related to the response variable with respect to correlation. Yu et al. (2006) presented the supervised probabilistic principal component analysis from the Bayesian viewpoint. By imposing the type regularization into the objective function, Allen et al. (2013) and Chen and Huang (2012) introduced the regularized partial least squares and the sparse reduced-rank regression, respectively. However, none of them integrated the two loss functions for ordinary regression analysis and PCA along with the type regularization.
This paper is organized as follows. In Section 2, we review PCA and the sparse principal component analysis (SPCA) by Zou et al. (2006). We propose SPCR and discuss alternative methods to SPCR in Section 3. Section 4 provides an efficient algorithm for SPCR and a method for selecting tuning parameters in SPCR. Monte Carlo simulations and real data analyses are provided in Section 5. Concluding remarks are given in Section 6. The R language software package spcr, which implements SPCR, is available on the Comprehensive R Archive Network (http://cran.r-project.org). Supplementary materials can be found in https://sites.google.com/site/shuichikawanoen/research/suppl_spcr.pdf.
2.1 Principal component analysis
Let be an data matrix, where and denote the sample size and the number of variables, respectively. Without loss of generality, we assume that the column means of the matrix are all zero.
PCA is usually implemented by using the singular value decomposition (SVD) of. When the SVD of is represented by
the principal components are and the corresponding loadings of the principal components are the columns of . Here, is an orthogonal matrix, is a orthogonal matrix, and is an matrix given by
where , , and is the
matrix with all zero elements. Note that the vectorsare also the principal components, since .
The loading matrix can be obtained by solving the following least squares problem (see, e.g., Hastie et al., 2009);
where is a loading matrix, denotes the number of principal components, and is the identity matrix. The solution is given by
where and is a arbitrary orthogonal matrix.
2.2 Sparse principal component analysis
Zou et al. (2006) proposed an alternative least squares problem given by
where is a matrix and is a regularization parameter. The minimizer of is given by
where , is a positive constant, and is an arbitrary orthogonal matrix. The case yields the same solution as (1). Formula (2) is a quadratic programming problem with respect to each parameter matrix and , but Formula (1) is not.
In addition, Zou et al. (2006) proposed to add a sparse regularization term for to easily interpret the estimate , which is called SPCA;
where ’s are regularization parameters with positive value and is the norm of . Note that the minimization problem (4) is also the quadratic programming problem with respect to each parameter matrix and . After simple calculation, the problem (4) becomes
This optimization problem is analogous to the elastic net problem in Zou and Hastie (2005), and hence Zou et al. (2006) proposed an alternating algorithm to estimate and iteratively. In particular, the LARS algorithm (Efron et al., 2004) is employed to obtain the estimate of numerically.
Another approach to obtain a sparse loading matrix is SCoTLASS (Jolliffe et al., 2003). However, Zou et al. (2006) pointed out that the loadings obtained by SCoTLASS are not sparse enough. Also, Lee et al. (2010) and Lee and Huang (2013) developed SPCA for binary data.
3 Sparse principal component regression
3.1 Sparse principal component regression with adaptive loading
Suppose that we have data for response variables in addition to data . We consider regression analysis in the situation that the response variable is explained by variables aggregated by PCA of . A naive approach is to construct a regression model with a few principal components corresponding to large eigenvalues, which are previously constructed. This approach is called PCR. In general, principal components are irrelevant with the response variables. Therefore, PCR might fail to predict the response if the response is associated with principal components corresponding to small eigenvalues.
To overcome this drawback, we propose SPCR using the principal components as follows:
where is an intercept, is a coefficient vector, and are regularization parameters with positive value, and and are tuning parameters whose values are between zero and one.
The first term in Formula (5) means the least squares loss between the response and the principal components . The second term induces PCA loss of data . The tuning parameter controls the trade-off between the first and second terms, and then the value of can be determined by users for any purpose. For example, a smaller value for is used when we aim to obtain better prediction accuracies, while a larger value for is used when we aim to obtain the exact formulation of the principal component loadings. The third and fifth terms encourage sparsity on and , respectively. The sparsity on enables us to easily interpret the loadings of the principal components. Meanwhile, the sparsity on induces automatic selection of the number of principal components. The tuning parameter controls the trade-off between the and norms for the parameter , which was introduced in Zou and Hastie (2005). For detailed roles of this parameter and the norm, see Zou and Hastie (2005).
We see that (5) is a quadratic programming problem with respect to each parameter, because the problem only combines a regression loss with PCA loss. The optimization problem appears to be simple. However, it is not easy to numerically obtain the estimates of the parameters if we do not introduce the regularization terms for and , because there exists an identification problem for and . For an arbitrary orthogonal matrix , we have
where and . This causes non-unique estimators for and . However, we incorporate the -penalties on (5) and then we can expect to obtain the minimizer, because the parameter exists on a hypersphere due to orthogonal invariance in (3) and the -penalty implies a hypersquare region. For more details, see, e.g., Tibshirani (1996), Jennrich (2006), Choi et al. (2011), and Hirose and Yamamoto (2014). The -penalties on and play two types of roles on sparsity and identification problem.
3.2 Adaptive sparse principal component regression
In the numerical study in Sect. 5, we observe that SPCR does not produce enough sparse solution for the loading matrix . We, therefore, assign different weights to different parameters in the loading matrix . This idea was adopted in the adaptive lasso (Zou, 2006). Let us consider the weighted sparse principal component regression, given by
where is an incorporated weight for the parameter . We call this procedure the adaptive sparse principal component regression (aSPCR). In this paper, we define the weight as , where is an estimate of the parameter obtained from SPCR. In the adaptive lasso, the weight is constructed using the least squares estimators, but it is not applicable due to the identification problem, as described in Sect. 3.1.
Since aSPCR is a quadratic programming problem with respect to each parameter, we can estimate the parameters according to an efficient estimation algorithm for SPCR. In addition, aSPCR enjoys properties similar to SPCR as described in Sect. 3.1.
3.3 Related work
PLS (see, e.g., Wold, 1975; Frank and Friedman, 1993) seeks directions that relate to and capture the most variable directions in the -space, which is, in general, formulated by
for , where and is the covariance matrix of . The solutions in the problem (6) are derived from NIPALS (Wold, 1975) or SIMPLS (de Jong, 1993).
To incorporate sparsity into PLS, SPLS was introduced by Chun and Keleş (2010). The first SPLS direction vector is obtained by
where , and are tuning parameters with positive value. Note that the problem (7) becomes the original maximum eigenvalue problem of PLS when , , and . This SPLS problem is solved by alternately estimating the parameters and . The idea is similar to that used in SPCA. Chun and Keleş (2010) furthermore introduced the SPLS-NIPALS and SPLS-SIMPLS algorithm for deriving the rest of the direction vectors, and then predicted the response variable by a linear model with SPLS loading vectors as new explanatory variables; it is a two-stage procedure.
To emphasize a difference between our proposed method and the related work described above, we consider an example as follows. Suppose that
This model has another expression in the form
The covariance structures are given by
Let us select the explanatory variable that maximizes the covariance:
Consider the case . It follows that and . In this case, it is clear that the first variable has a larger effect in than the second variable . Remember that PLS and SPLS are based on the maximization of covariance, so that they will firstly select the variable on the second maximization, whereas they will firstly select the variable on the first maximization. Therefore, on the first maximization, PLS and SPLS fail to select the explanatory variable largely associated with the response. Meanwhile, SPCR will select the first variable on both maximizations, because the prediction error remains unchanged after normalization.
4.1 Computational algorithm
For estimating the parameter , we utilize the same algorithm given by Zou et al. (2006). The parameters and are estimated by the coordinate descent algorithm (Friedman et al., 2010), because the optimization problems include the regularization terms, respectively.
The optimization problem in aSPCR is rewritten as follows:
where is the -th element of the vector . SPCR is a special case of aSPCR with . The detailed algorithm is given as follows.
- given , and :
The coordinate-wise update for has the form:
and is the soft-threshholding operator with value
- given , and :
The update expression for is given by
- given , and :
The estimate of is obtained by
- given , and :
The estimate of is derived from
These procedures are iterated until convergence.
4.2 More efficient algorithm
To speed up our algorithm, we apply the covariance updates, which was proposed by Friedman et al. (2010), into the parameter updates.
We can rewrite the update of the parameter in (8) in the form
where is the current estimate of , and . After simple calculation, the first term on the right-hand side (up to ) becomes
and the second term on the right-hand side (up to ) is
These formulas largely reduces computational task, because we update only the last term on (10) and (11) when the estimate of is non-zero, while we do not update (10) and (11) when the estimate of is zero.
4.3 Selection of tuning parameters
SPCR and aSPCR depend on four tuning parameters . To avoid this hard computational task, we fix the values of and , and then optimize only two tuning parameters and .
The tuning parameter plays a role in prediction accuracy. While a smaller value for provides good prediction, the estimated models often tend to be unstable due to the flexibility of . We tried many simulations with several values for , and then we concluded to set in this study. The tuning parameter takes the role in the trade-off between the and penalties on . The value of in elastic net is usually determined by users (Zou and Hastie, 2005). In our simulation studies we fixed as 0.01.
The tuning parameters and are optimized using -fold cross-validation. When the original dataset is divided into datasets , the CV criterion is given by
where is a vector of which the elements are all one, and are the estimates computed with the data removing the -th part. We employed in our simulation. The tuning parameters and were, respectively, selected from 10 equally-spaced values on , where and were determined according to the function glmnet in R.
5 Numerical study
5.1 Monte Carlo simulations
Monte Carlo simulations were conducted to investigate the performances of our proposed method. Three models were examined in this study.
In the first model, we considered the 10-dimensional covariate vector, and generated the response
from the linear regression model given by
We used (Case 1(a)), where is the identity matrix, and (Case 1(b)). Case 1(a) is a simple situation. Case 1(b) corresponds to the situation discussed in Sect. 3.3.
In the second model, we considered the 20-dimensional covariate vector according to a multivariate normal distribution , and generated the response by
We used and , where and is a sparse approximation of the fourth eigenvector of
is a sparse approximation of the fourth eigenvector of(Case 2). This case deals with the situation where the response is associated with the principal component loading with small eigenvalue. Note that even if each explanatory variable is normalized, the principal component does not have unit variance in general.
In the third model, we assumed the 30-dimensional covariate vector according to a multivariate normal distribution , and generated the response by
We used with , and . Two cases were considered for , where the first nine and last 15 values are zero. First, we used that is an approximation of the first eigenvector of (Case 3(a)). Second, we used that is a sparse approximation of the third eigenvector of (Case 3(b)). Case 3 is a more complex situation.
The sample size was set to
. The standard errorwas set to 0.1 or 1. Our proposed methods, SPCR and aSPCR, were fitted to the simulated data with one or 10 components for Case 1, one or five components for Case 2, and 10 components for Case 3. Our proposed methods were compared with SPLS, PLS, and PCR. SPLS was computed by the package spls in R, and PLS and PCR by the package pls in R. The number of components and the values of tuning parameters in SPLS, PLS, and PCR were selected by 10-fold cross-validation. The performance was evaluated by . The simulation was conducted 100 times and the MSE was estimated by 1,000 random samples.