Consider the typical multiple linear regression setting where the available data consist of i.i.d. copies of the continuous response
and the predictor vectorWithout loss of generality (WLOG), we assume ’s are centered and ’s are standardized throughout the article. Thus the intercept term is presumed to be 0 in linear models, for which the general form is given by with and For the sake of convenience, we sometimes omit the subscript When the design matrix is of full column rank , as well as its corresponding predicted value at a new observation , enjoys many attractive properties.
However, OLS becomes problematic when is rank-deficient, in which case the Gram matrix is singular. This may happen either because of multicollinearity when the predictors are highly correlated or because of high dimensionality when A wealth of proposals have been made to combat the problem. Besides others, we are particularly concerned with a group of techniques that include ridge regression (RR; Hoerl and Kennard, 1970), principal components regression (PCR; Massy, 1965), partial least squares regression (PLSR; Wold, 1966 & 1978), and continuum regression (CR; Stone and Brooks, 1990). One common feature of these approaches lies in the fact that they first extract orthogonal or uncorrelated components that are linear combinations of and then regress the response directly on the orthogonal components. The number of orthogonal components doesn’t exceed and , hence reducing the dimensionality. This is the key how these types of methods approach high-dimensional or multicollinear data.
In this article, we first introduce a general framework, termed weighted orthogonal components regression (WOCR), which puts the aforementioned methods into a unified class. Compared to the original predictors in , there is a natural ordering in the orthogonal components. This information allows us to parameterize the weight function in WOCR with low-dimensional parameters, which are essentially the tuning parameters, and estimate the tuning parameters via optimization. The WOCR formulation also facilitates convenient comparison of the available methods and suggests their new natural variants by introducing more intuitive weight functions.
We shall restrict our attention to PCR and RR models. The remainder of the article is organized as follows. In Section 2, we introduce the general framework of WOCR. Section 3 exemplifies the applications of WOCR with RR and PCR. More specifically, we demonstrate how WOCR formulation can be used to estimate the tuning parameter in RR and select the number of principal components in PCR, and then introduce their better variants on the basis of WOCR. Section 4 presents numerical results from simulated studies that are designed to illustrate and assess WOCR and make comparisons with others. We also provide real data illustrations in Section 5. Section 6 concludes with a brief discussion, including the implication of WOCR on PLSR and CR models.
2 Weighted Orthogonal Components Regression (WOCR)
Denote so that Let be the orthogonal components extracted in some principled way, satisfying that if and 1 otherwise. Here forms an orthonormal basis of the column space of , Since , suppose for The condition implies that , i.e., vectors and are orthogonal, which implies that and are orthogonal if, furthermore, or
is an eigenvector of
associated with a non-zero eigenvalue. In matrix form, let, and . We have with but it is not necessarily true that The construction of matrix may (e.g., in RR and PCR) or may not (e.g., in PLSR and CR) depend on the response again, our discussion will be restricted to the former scenario. It is worth noting that extracting components reduces the original problem into an (with problem, hence making automatic dimension reduction.
2.1 Model Specification
The general form of a WOCR model can be conveniently expressed in terms of the fitted vector
where is the regression coefficient and is the weight for the -th orthogonal component . In matrix form, (1) becomes
We will see that RR, PCR, and many others are all special cases of the above WOCR specification, with different choices of For example, if or , then (1) amounts to the least square fitting since
This WOCR formulation allows us to conveniently study its general properties. It follows immediately from (2) that the associated hat matrix is
The resultant sum of square errors (SSE) is given by Note that is not an idempotent or projection matrix in general, neither is Instead,
The diagonal matrix has diagonal element . Therefore,
From (2), the WOCR estimate of is
It follows that, given new data matrix , the predicted vector is
Although not further pursued here, many other quantities and properties of WOCR can be derived accordingly with the generic form, including as studied in Hoerl and Kennard (1970) and Hwang and Nettleton (2003).
2.2 Parameterizing the Weights
The next important component in specifying WOCR is to parameterize the weights in in a principled way. The key motivation stems from the observation that, compared to the original regressors in , the orthogonal components in are naturally ordered in terms of some measure. This ordering may be attributed to some specific variation that each is intended to account for. Another natural ordering is based on the coefficients Because of orthogonality, the regression coefficient remains the same for in both the simple regression and multiple regression settings.
This motivates us to parameterize the weights based on the ordering measure. It is intuitive to assign more weights to more important components. To do so, can be specified as a function monotone in the ordering measure and parameterized with a low-dimensional vector Two such examples are given in Figure 1
. Among many other choices, the usage of sigmoid functions will be advocated in this article because they provide a smooth approximation to the 0-1 threshold indicator function that is useful for the component selection purpose and they are also flexible enough to adjust for achieving improved prediction accuracy. In general, we denoteThe vector in the weight function are essentially the tuning parameters. This parameterization conveniently expands these conventional modeling methods by providing several natural WOCR variants that are more attractive, as illustrated in the next section.
Determining the tuning parameters is yet another daunting task. In common practice, one fits the model for a number of fixed tuning parameters and then resorts to cross-validation or a model selection criterion to compare the model fittings. This can be computationally intensive, especially with big data. When a model selection criterion is used, WOCR provides a computationally efficient way of determining the tuning parameter . The key idea is to plug the specification (1) in a model selection criterion and optimize with respect to Depending on the scenarios, commonly used model selection criteria include the Akaike information criterion (AIC; Akaike, 1974), the generalized cross-validation (GCV; Golub, Heath, and Wahba, 1979), and the Bayesian information criterion (BIC; Schwarz, 1978
). What is involved in these model selection criteria are essentially SSE and the degrees of freedom (DF). A general form of SSE is given by (4). For DF, we follow the generalized definition by Efron (2004):
If neither the components nor the weights depends on then DF, often termed as the effective degrees of freedom (EDF) in this scenario, is computed as
With either components or the weights depends on the computation of DF is more difficult and will be treated on a case-by-case basis.
The specific forms of GCV, AIC, and BIC can be obtained accordingly. We treat the model selection as an objective function for . The best tuning parameter can then be estimated by optimization. Since is of low dimension, the optimization can be solved efficiently. This saves the computational cost in selecting the tuning parameter.
3 WOCR Examples
We show how several conventional models relate to WOCR with different weight specifications and different ways of constructing the orthogonal components and then how the WOCR formulation can help improve and expand them. In this section, we first discuss how WOCR helps determine the optimal tuning parameter in ridge regression and make inference accordingly. Next, we show that WOCR facilitates an efficient computational method for selecting the number of components in PCR. The key idea is to approximate the 0-1 threshold function with a smooth sigmoid weight function. Several natural variants of RR and PCR that are advantageous in predictive modeling are then derived within the WOCR framework.
3.1 Pre-Tuned Ridge Regression
The ridge regression (Hoerl and Kennard, 1970) can be formulated as a penalized least square optimization problem
with the tuning parameter The solution yields the ridge estimator
The singular value decomposition (SVD) of data matrixoffers a useful insight into RR (see, e.g., Hastie, Tibshirani, and Friedman, 2009). Suppose that the SVD of is given by
where both and have orthonormal column vectors that form an orthonormal basis for the column space and the row space of , respectively, and matrix with singular values satisfying Noticing that the column vectors of yield the principal directions. Since , it can be seen that is the -th normalized principal component.
The fitted vector in RR conforms well to the general form (1) of WOCR, as established by the following proposition.
Regardless of the magnitude of , the fitted vector in ridge regression can be written as
with and for
The proof when (i.e., and hence ) can be found in, e.g., Hastie, Tibshirani, and Friedman (2009). We consider the general case including the scenario. With the general SVD form (9) of , we have , but it is not necessarily true that , nor for
First, plugging the SVD of into yields
Define by completing an orthonormal basis for Hence is invertible with Also define and by appending 0 matrix or components to and Then it can be easily checked that in (11) can be rewritten as
One natural ordering of the principal components s is based on their associated singular values . Hence, the weight function is monotone in and parameterized with one single parameter See Figure 1(a) for a graphical illustration of this weight function. In view of , matrix in WOCR is given as
Since RR is most useful for predictive modeling without considering component selection, GCV is an advisable criterion for selecting the best tuning parameter With our WOCR approach, we first plugging (10) into GCV to form an objective function for and then optimize it with respect to . On the basis of (4) and (8), the specific form of is given up to some irrelevant constant, by
GCV has a wide applicability even in the ultra-high dimensions. Alternatively, AIC can be used instead. If GCV is asymptotically equivalent to
The best tuning parameter in RR can be estimated as Bringing back to yields the final RR estimator. Since the tuning parameter is determined beforehand, we call this method ‘pre-tuning’. We denote this pre-tuned RR method as where the first argument indicates the ordering on which basis the components are sorted and the second argument indicates the tuning parameter . We shall use this as a generic notation for other new WOCR models. As we shall demonstrate with simulation in Section 4.1, provides nearly identical fitting results to RR; however, pre-tuning dramatically improves the computational efficiency, especially when dealing with massive data.
One statistically awkward issue with regularization is selection of the tuning parameter. First of all, this is a one-dimensional optimization problem, yet done in a poor way in current practice by selecting a grid of values and evaluating the objective function at each value. The pre-tuned version helps amend this deficiency. Secondly, although the tuning parameter is often selected in a data-adaptive way and hence clearly is a statistic, no statistical inference is made for the tuning parameter unless within the Bayesian setting. The above pre-tuning method yields a convenient way of making inference on Since the objective function is smooth in , the statistical properties of follow well through standard M-estimation arguments. However, this is not the theme of WOCR, thus we shall not pursue further.
3.2 Pre-Tuned PCR
PCR regresses the response on the first () principal components as given by the SVD of in (9). The fitted vector in PCR can be rewritten as
where and for Clearly, PCR can be put in the WOCR form with Conventionally, the ordering of principal components is aligned with the singular values thus we may rewrite with a threshold value if is known. Either the number of components or the threshold is the tuning parameter. Selecting the optimal by examining many PCR models is a discrete process.
To facilitate pre-tuning, we replace the indicator weight with a smooth sigmoid function. While many other choices are available, it is convenient to use the logistic or expit function so that
Figure 1(b) plots with for different choices of . It can be seen that a larger value yields a better approximation to the indicator function while a smaller yields a smoother function which is favorable for optimization. In order to emulate PCR, the parameter can be fixed a priori at a relatively large value. Our numerical studies shows that the performance of the method is quite robust with respect to the choice of . On that basis, we recommend fixing in the range of
Since PCR involves selection of the optimal number of PCs, BIC, given by is selection-consistent (Yang, 2005) and often has a superior empirical performance in variable selection. The hat matrix in PCR is idempotent, so is Thus the SSE can be reduced a little bit as , which then can be approximated by substituting with The DF can be approximately in a similar way as This results in the following form for BIC
which is treated as an objective function of . We estimate the best cutoff point by optimizing with respect to . This is a one-dimensional smooth optimization problem with a search range Once is available, we use it as a threshold to select the components and fit a regular PCR. We denote this pre-tuned PCR approach as Compared to the discrete selection in PCR, is computationally more efficient. Furthermore, it performs better in selecting the correct number of components, especially when weak signals are present. This is an additional benefit of smoothing as opposed to the discrete selection process in PCR, as we will demonstrate with simulation.
3.3 WOCR Variants of RR and PCR Models
Not only can many existing models be cast into the WOCR framework, but it also suggests new favorable variants. We explore some of them. One first variant of PCR is leave both and free in (14). More specifically, we first obtain and then compute the WOCR fitted vector in (1) with weight for This will give PCR more flexibility and adaptivity and hence may lead to improved predictive power. In this approach, selecting components is no longer a concern; thus GCV or AIC can be used as the objective function instead. We denote this approach as
The principal components are constructed independently from the response. Artemiou and Li (2009) and Ni (2011) argued that the response tends to be more correlated with the leading principal components; this is usually not the case in reality, however. See, e.g., Jollife (1982) and Hadi and Ling (1998) for real-life data illustrations. Nevertheless, there has not been a principled way to deal with this issue in PCR. WOCR can provide a convenient solution: one simply bases the ordering of on the regression coefficients and defines the weights via a monotone function of or, preferably, . However, doing so will induce dependence on the response to the weights. As a result, the associated DF has to be computed differently, as established in Proposition 3.2.
Suppose that the WOCR model (1) has orthogonal components constructed independently of and weights where is a smooth monotonically increasing function and is the parameter vector. Its degrees of freedom (DF) can be estimated as
Clearly both PCR and RR can be benefited from this reformulation. As a variant of RR, the weight now becomes and hence It follows that the estimated DF is
The best tuning parameter can be obtained by minimizing GCV. Using similar notations as earlier, we denote this RR variant as It is worth noting that is, in fact, not a ridge regression model. Its solution can no longer be nicely motivated by a regularized or constrained least square optimization problem as in the original RR. But what really matters in these methods is the predictive power. By directly formulating the fitted values , the WOCR model (1) facilitates a direct and flexible model specification that focuses on prediction.
For PCR, the weight becomes Hence, and
Depending on whether or not we want to select components, we may fix at a larger value or leave it free. This results in two PCR variants, which we denote as and respectively.
Table 1 summarizes the WOCR models that we have discussed so far. Among them, and resemble the conventional RR and PCR, yet with pre-tuning. Depending on the analytic purpose, we also suggest a preferable objective function for each WOCR model. In general, we have recommended using GCV for predictive purposes, in which scenarios AIC can be used as an alternative. AIC is equivalent to GCV if , both being selection-efficient in the sense prescribed by Shibata (1981). On the other hand, if selecting components is desired, using BIC is recommended.
It is worth noting that the WOCR model has a close connection with the MIC (Minimum approximated Information Criterion) sparse estimation method of Su (2015), Su et al. (2016), and Su et al. (2017). MIC yields sparse estimation in the ordinary regression setting by solving a -dimensional smooth optimization problem
where with diagonal element approximating the indicator function Comparatively, solves a one-dimensional optimization problem
where with diagonal element approximating The substantial simplification in is because of the orthogonality of the design matrix Hence the coefficient estimates in multiple regression are the same as those in simple regression and can be computed ahead. Furthermore, the orthogonal regressors , i.e., the columns of , are naturally ordered by . This allows us to formulate a one-parameter smooth approximation to the indicator function which induces selection of in this PCR variant.
3.4 Implementation: R Package Wocr
The proposed WOCR method is implemented in an R package WOCR. The current version is hosted on GitHub at https://github.com/xgsu/WOCR.
The main function WOCR() has an argument model= with values in RR.d.lambda, RR.gamma.lambda, PCR.d.c, PCR.gamma.c, PCR.d.a.c, and PCR.gamma.a.c, which corresponds to the six WOCR variants as listed in Table 1. Among them, , , , and involves one-dimensional smooth optimization. This can be solved via the Brent (1973) method, which is conveniently available in the R function optim(). Owing to the nonconvex nature, dividing the search range of the decision variable can be helpful. The other two methods, and , involve two-dimensional smooth nonconvex optimization. Mullen (2014) provides a comprehensive comparison of many global optimization algorithms currently available in R (R Core Team, 2018). We have followed her suggestion to choose the generalized simulated annealing method (Tsallis and Stariolo, 1996), which is available from the R package GenSA (Xiang et al., 2013). More details of the implementation can be found from the help file of the WOCR package.
4 Simulation Studies
This section presents some of the simulation studies that we have conducted to investigate the performance of WOCR models and compare them to other methods.
4.1 Comparing Ridge Regression with
We first compare the conventional ridge regression with its pretuned version, i.e., . The data are generated as follows. We first simulate the design matrix
from a multivariate normal distributionwith and for Apply SVD to extract matrix and . Then we form the mean response as
For each simulated data set, we apply RR (as implemented by the R function lm.ridge) and , both selecting with minimum GCV.
To compare, we consider the mean square error (MSE) for prediction. To this end, a test data set of is generated in advance. The fitted RR and from each simulation run will be applied to the test set and the MSE is obtained accordingly. The ‘best’ tuning parameter is also recorded. We only report the results for the setting , , Two sample sizes are considered. For each model configuration, a total of simulation runs are considered.
In the simulation, we found how to specify the search points could be a problem in the current practice of ridge regression. Initially, we found the ridge regression gave inferior performance compared to in many scenarios. However, after adjusting its search range, the results became nearly identical to what had. This point will be further illustrated in Section 4.3. It is also worth noting that the minimum GCV tends to select a very small in the ultra-high dimensional case with
To demonstrate the computational advantages of over RR, we generated data from the same model A in (16). We first fix and let vary in And then we fix and let vary in For each setting, we recorded the CPU computing time for RR and averaged from three simulation runs. We have set the search range for as The results are plotted in Figure 2(a) and 2(b). It can be seen that is much faster than RR, especially when either or gets large.
4.2 Comparing PCR with and
Next we compare the two WOCR variants that are close to PCR. Data of dimension and are generated from Model A in (16), yet with two sets of coefficients given as follows:
The first set (i) fits perfectly to ordinary PCR and hence with number of useful components being 5, while the second set (ii) corresponds to the situation where the response is only associated with the fifth principal components, a scenario that fits best to Recall that the shape parameter in both and is fixed at a relatively larger value. Concerning its choice, we consider four values A total of 200 simulation run is made for each configuration. For each simulated data set, the ordinary PCR is fit with minimum cross-validated error, as implemented in R package pls while and are fit with minimum BIC. Figure 3 plots the number of components selected by each method via boxplot and the MSE for predicting an independent test data set of
generated from the same model setting via mean plus/minus standard error bar plot.
It can be seen that PCR substantially overfits in both model settings, resulting in high prediction errors as well. In the first scenario (i), and both do well with similar performance. In the second scenario (ii), fails in identifying the correct principal components while remains successful by switching the ordering from singular values to regression coefficients For the different choices, the performance of and is quite stable with some minor variations.
4.3 Predictive Performance Comparisons
To assess the predictive performance of WOCR models, we generate data of size from two nonlinear models used in (Friedman, 1991), which are given as follows:
The covariates of dimension are independently generated from the uniform[0,1] distribution and the random error term follows In both models, only the first five predictors are involved in the mean response function. Two choices of are considered with For each simulated data set, ridge regression, PCR, and six WOCR variants in Table 1 are applied with default or recommended settings. In particular, we fix the scale parameter in and To apply ridge regression, we have used
Table 2 presents the prediction MSE (mean and SE) and the median number of selected components by each method, out of 200 simulation runs. First of all, it can be seen that the ridge regression appears to provide the worst results in terms of MSE. This is because of deficiencies involved in the current practice of ridge regression that computes ridge estimators for a discrete set of within some specific range, which may not even include the true global GCV minimum. Comparatively, provides a computationally efficient and reliable way of finding the ‘best’ tuning parameter. We could have refit the ridge regression according to suggested by Another interesting observation is that tends to give more favorable results than because sorting the components according to borrows strength from the association with the response.
Among PCR variants, neither nor performs well. On the basis of BIC, they are aimed to find a parsimonious true model when the true model is among the candidate models, which, however, is not the case here. In terms of prediction accuracy, it can be seen that , , and are highly competitive, all yielding similar performance to PCR. Note that PCR determines the best tuning parameter via 10-fold cross-validation, while , and are based on a smooth optimization of GCV and hence are computationally advantageous. In these simulation settings, PCR has selected all components and hence simply amounts to the ordinary least square fitting.
5 Real Data Examples
For further illustration, we apply WOCR to two well-known data sets, which are BostonHousing2 and concrete. The Boston housing data relates to prediction the median value of owner-occupied homes for 506 census tracts of Boston from the 1970 census. We used the corrected version BostonHousing2 available from R package mlbench (Leisch and Dimitriadou, 2012), with dimension observations and predictors. The concrete
data is available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/). The goal of this data set is to predict the concrete compressive strength based on a few characteristics of the concrete. The data set has observations and continuous predictors.
Figure 4 plots the singular values and the regression coefficients in absolute value for both data sets. It can be seen that decreases gradually as expected. The bar plot of , however, shows different patterns. In the BostonHousing2 data, the very first component is highly correlated with the response, while others shows alternate weak correlations. In the concrete data, the third component is most correlated with the response, followed by the 6th and 5th principal components. The first two components are only very weakly correlated. This data set shows a good example where the top components are not necessarily the most relevant components in terms of association with the response.
To compare different models, a unified approach is taken. We randomly partition the data into the training set and the test set with a ratio of approximately 2:1 in sample sizes. The training set is used to construct models and then the constructed models are applied to the test set for prediction. The default settings in Table 1 are used for each WOCR, while the default 10-fold CV method is used to select the best model for ridge regression and PCR. We repeat this entire procedure for 200 runs. The prediction MSE and the number of components for every method is recorded for each run. The results are summarized in Table 3.
|Method||average-MSE||SE-MSE||# comps||average-MSE||SE-MSE||# comps|
While most methods provide largely similar results, some details are noteworthy. For ridge regression, outperforms the original ridge regression slightly but it is much faster in computation time. Comparatively, improves the prediction accuracy by basing the weights on ’s for the concrete data, where the top components are not the most relevant to the response as shown in Figure 4. Among the PCR models, both and are among top performers in terms of prediction.
Neither nor perform as well as others in terms of prediction accuracy owing to their different emphasis. Concerning component selection, yields simpler models than and PCR. This is determined by the nature of each method and data set. Referring to Figure 4, clearly helps extract parsimonious models with simpler structures.
We have proposed a new way of constructing predictive models based on orthogonal components extracted from the original data. The approach makes good use of the natural monotonicity associated with those orthogonal components. It allows efficient determination of the tuning parameters. The approach results in several interesting alternative models to RR and PCR. These new variants make improvement on either predictive performance or selection of the components. Overall speaking, , , and are highly competitive in terms of predictive performance. better aims for model parsimony by making selection on the basis of association with the response.
WOCR can be implemented with more flexibility. First of all, we have advocated the use of logistic or expit function in regulating the weights. The logistic function is rotationally symmetrical about the point . To have more flexible weights, we may consider a generalized version of the expit function, The range of the gexpit function remains (0, 1). Since its value at is now the parameter changes the rotational symmetry unless Secondly, selecting the number of principal components is a major concern in PCR. We have used BIC in both and for this purpose. BIC is derived in the fixed dimensional setting (i.e., fixed and ). It is worth noting that the dimension in the WOCR family is instead of . If is close to , the modified or generalized BIC (see, e.g., Chen and Chen, 2008) can be used instead. In particular, the complexity penalty suggested by Wang, Li, and Leng (2009) to replace in (14) for diverging dimensions fits well for WOCR models since the dimension cannot exceed . If there is prior information or belief that the optimal is less than some pre-specified number, it is helpful to further restrain the search range of on the basis of
The WOCR model framework generates several future research revenues. First of all, WOCR can be directly applicable to regression with components after a varimax rotation (Kaiser, 1958). WOCR can also be extended to PLSR and CR models. In those approaches, extraction of the orthogonal components takes associations with the response into consideration; thus both matrices and relate to To select the tuning parameter, -fold cross validation can be conveniently used on the basis of Equations (5) and (6). To implement pre-tuning, finding the degrees of freedom involved in these approach becomes more complicated but remains doable by following Krämer and Sugiyama (2011). The weighting and pre-tuning strategy introduced in WOCR may help make improvement in terms of predictive accuracy, computational speed, and model parsimony for these models. Secondly, the simulation results for Model B in (17) and Model C in (18) with presented in Section 4.3 highlight the variable selection issue in high-dimensional modeling. To this end, Bair (2006) considered a univariate screening step; Ishwaran and Rao (2014) showed the generalized ridge regression (Hoerl and Kennard, 1970) can help suppress the influence of unneeded predictors in certain conditions. Both approaches may be incorporated into WOCR to improve its predictive ability. Finally, WOCR can be extended to generalized linear models, e.g., via a local quadratic approximation of the log-likelihood function. The kernel trick (see, e.g., Rosipal, Trejo, and Cichoki 2011, Rosipal and Trejo 2002, and Lee and Liu 2013) can be integrated into WOCR as well.
- Akaike (1974) Akaike, H. (1974). A new look at model identification, IEEE Transactions an Automatic Control, 19: 716–723.
- Artemiou and Li (2009) Artemiou, A. A. and Li, B. (2009). On principal components and regression: A statistical explanation of a natural phenomenon. Statistica Sinica, 19: 1557–1565.
- Bair (2006) Bair, E., Hastie, T., Paul, D., and Tibshirani, R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association, 101(473): 119–137.
- Brent (1973) Brent, R. (1973). Algorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall.
- Butler and Denham (2000) Butler, N. and Denham, M. (2000). The peculiar shrinkage properties of partial least squares regression. Journal of the Royal Statistical Society, Series B, 62: 585–594.
- Chen and Chen (2008) Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.
- Efron (2004) Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association, 99: 619–633.
- Frank and Friedman (1993) Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35: 109–135.
Friedman, J. H. (1991). Multivariate adaptive regression splines.Annals of Statistics, 19: 1–67.
- Golub, Heath, and Wahba (1979) Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2): 215–223.
- Hadi and Ling (1998) Hadi, A. S. and Ling, R. F. (1998). Some cautionary notes on the use of principal components regression. The Amercian Statistician, 52: 15–19.
- Hastie, Tibshirani, and Friedman (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition.
- Hwang and Nettleton (2003) Hwang, J. T. and Nettleton, D. (2003). Princial components regression with data-chosen components and related methods. Technometrics, 45: 70–79.
- Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12: 55–67.
- Ishwaran and Rao (2014) Ishwaran, H. and Rao, J. S. (2014). Geometry and properties of generalized ridge regression in high dimensions. Contemporary Mathematics, 622: 81–93.
- Jollife (1982) Jollife, I. T. (1982). A note on the use of principal components in regression. Applied Statistics, 31: 300–303.
- Kaiser (1958) Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23: 187–200.
- Krämer and Sugiyama (2011) Krämer, N. and Sugiyama M. (2011). The degrees of freedom of partial least squares regression. Journal of the American Statistical Association, 106: 697–705.
- Lee and Liu (2013) Lee, M. H. and Liu, Y. (2013). Kernel continuum regression. Computational Statistics & Data Analysis, 68: 190–201.
- Leisch and Dimitriadou (2012) Leisch, F. and Dimitriadou, E. (2012). R Package mlbench: Machine Learning benchmark problems. URL https://cran.r-project.org/web/packages/mlbench/
- Massy (1965) Massy, W. F. (1965). Principal components regression in exploratary statistical research. Journal of the Americal Statistical Association, 60: 234–256.
- Mullen (2014) Mullen, K. M. (2014). Continuous global optimization in R. Journal of Statistical Software, 60(6).
- Ni (2011) Ni, L. (2011). Principal component regression revisited. Statistica Sinica, 21: 741–747.
- R Core Team (2018) R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
- Rosipal, Trejo, and Cichoki (2011) Rosipal, R., Trejo, L. J., and Cichoki, A. (2001). Kernel principal component regression with em approach to nonlinear principal component extraction. Technical Report, University of Paisley, UK.
- Rosipal and Trejo (2002) Rosipal, R. and Trejo, L. J. (2002). Kernel partial least squares regression in reproducing kernel Hilbert space. Journal of Machine Learning Research, 2: 97–123.
- Schwarz (1978) Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6: 461–464.
- Shibata (1981) Shibata, R. (1981). An optimal selection of regression variables. Biometrika, 68: 45–54.
- Stone and Brooks (1990) Stone, M. and Brooks, R. J. (1990). Continuum regression: Cross-validated sequentially contrsucted predition embracing ordinary least squares, partial least squares and principal components regression. Journal of the Royal Statistical Soceity, 52: 237–269.
- Su (2015) Su, X. (2015).Variable selection via subtle uprooting. Journal of Computational and Graphical Statistics, 24: 1092–1113.
- Su et al. (2016) Su, X., Wijayasinghe, C. S., Fan, J., and Zhang, Y. (2016). Sparse estimation of Cox proportional hazards models via approximated information criteria. Biometrics, 72: 751–759.
- Su et al. (2017) Su, X., Fan, J., Levine, R., Nunn, M., and Tsai, C.-L. (2017). Sparse Estimation of Generalized Linear Models (GLM) via Approximated Information Criteria. In press, Statistica Sinica.
- Tsallis and Stariolo (1996) Tsallis, C. and Stariolo, D. A. (1996). Generalized simulated annealing. Physica A, 233: 395–406.
- Wang, Li, and Leng (2009) Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society, Serires B, 71: 671–683.
- Wold (1966) Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In Multivariate Analysis (Ed. P.R. Krishnaiaah), pp. 391–420. New York, NY: Academic Press.
- Wold (1978) Wold, H. (1984). PLS regression. In Encycolpedia of Statistical Sciences (eds N. L. Johnson and S. Kotz), vol. 6, pp. 581–591. New York, NY: Wiley.
- Xiang et al. (2013) Xiang, Y., Gubian, S., Suomela, B., and Hoeng, J. (2013). Generalized simulated annealing for global optimization: The GenSA package. The R Journal, 5(1).
- Yang (2005) Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika, 92: 937–950.