We consider the linear regression model
where the error is assumed independent from , with and We focus on the estimation of linear functions, , of the regression coefficients. For example, , when the interest resides in a particular effect, or , when the interest resides in the prediction of a new outcome . We aim for estimators of such that the mean squared error, , is minimized. It holds that
where the two terms on the right hand side are the squared bias and the variance, respectively. The standard estimator for
is the ordinary least squares (LS) estimatorwhere is the design matrix and
is the response vector. Sinceis unbiased, we obtain that with . Moreover, by the Gauss-Markov theorem we know that
for any linear unbiased estimatorTherefore, one focuses on biased estimators when one aims to reduce .
It can be shown that , for any given where represents inequality in the Loewner partial order for symmetric matrices [i.e. if is non-negative definite, see e.g. Horn and Johnson (2012)]. Indeed, taking (w.l.g.) we have that
Another well-known shrinkage family is the non-negative garrote (Breiman, 1995). A simplified version of this method (which we use in this paper) is defined as
with In section 3 we show that both the total variance and the generalized variance of are smaller than those of . There is an extensive literature on shrinkage in linear regression which mainly focuses on automatic variable selection with lasso as the most prominent example, see e.g. Hastie, Tibshirani, and Wainwright (2015).
In Section 2 we consider a new family of estimators, which we call split regression estimators (SplitReg). In Section 3, we show that the total variance and generalized variance of the simplified garrote and SplitReg are smaller than those of LS. We illustrate the potential usefulness of SplitReg in Section 4 through results of some numerical comparisons with ridge regression, garrote and LASSO. Finally, Section 5 concludes.
Consider a partitioning of the design matrix with a matrix and a matrix. According to this partition, consider the matrices
Then, we define the corresponding split LS estimator by
which has mean
and covariance matrix
Example 1. Consider the linear regression model with . To simplify the notation we assume that the explanatory variables have been centered and scaled so that and let us denote
Suppose that is the main effect and is a control variable. Hence, we are mainly interested in estimating . Let us compare the LS estimator with the SplitReg estimator . The MSE for and is and , respectively. Therefore, is preferred if and only if Hence, SplitReg becomes more attractive when there is a larger error variance and/or a larger correlation between the explanatory variables (multicolinearity) and/or a smaller sample size. For example, if and , then the SplitReg estimator is preferred if
3 Variance Inequalities
To compare the variability of the estimators , and we use the following expression for . Let and , then we have
Proposition 1 (Generalized Variance Inequality). Consider the three estimators above, then we have the following inequalities for the generalized variance
Proof. To prove (a) we notice that
Similarly using that we have
This holds because and implies and respectively.
Part (b) follows because
Proposition 2 (Total Variance Inequality). Consider again the three estimators , . Then we have the following inequalities for the total variance:
Proof. From (2) it follows that
On the other hand, using equation (3) we obtain that
Now, notice that
To prove (b) let us set , then we have that
To show that split LS estimators improve the total and generalized variance of the least squares estimator more generally, we now consider nested partitions of the set . A partition is nested in the partition if for all there is a subset such that
For simplicity, we will consider the partitions and with and We will also assume (w.l.g.) that and The corresponding partitions of the matrix are
We now compare the split estimators
Proposition 3 (Nested Partitions Inequality).
Proof. To simplify the notation, assume that . We have
Therefore, to prove (a) write
Since we just need to show that .
But this has been shown in Proposition 1(a) with
playing the role of .
Similarly, to prove (b) we write
Hence, we just need to show that but this has already been shown in Proposition 2(a) with playing the role of .
From Theorem 2 and 3 it follows that
4 Potential benefit of split regression estimators
Our aim is to illustrate the potential benefits of SplitReg for the prediction of new outcomes. We use the model and notation introduced in Example 1. The prediction of a new outcome corresponds to the choice and the prediction performance of an estimator is measured by averaging over the values of , that is we calculate the prediction mean squared error . To calculate the expectation we assume that where is a correlation matrix with parameter It can be shown that for LS prediction
For garrote prediction with weights we obtain that
For ridge regression prediction with penalty we have
where with and with , We also consider prediction based on the lasso estimator, denoted by . However, in this case we do not have a closed form expression for , so we calculate it numerically. We determine the optimal prediction performance of ridge regression, lasso and garrote corresponding to values of the shrinkage parameter or which minimizes the PMSE.
To make SplitReg more flexible, similarly as for the garrote, we add a shrinkage parameter to the split coefficient estimates . Moreover, since splitting is not always better than joint estimation as illustrated in Example 1, we make SplitReg adaptive by allowing it to choose whether to use the joint LS estimates or to use the split LS estimates. The decision to split or not to split is made by comparing the performance of the best shrunken SplitReg estimator to the best shrunken joint LS estimator, i.e. the simplified garrote estimator . The PMSE of the shrunken SplitReg estimator is given by
with and where be a matrix with and .
In Figure 1 we compare the minimal attainable PMSE by the LS, ridge regression, simplified garrote, lasso and adaptive SplitReg methods as a function of when , and . The error variance was chosen such that the signal to noise ratio equals 1 or 3 and the correlation between the predictors in the training sample was given by or . From these plots we can see that all methods can improve on LS as expected. Moreover, the signal to noise ratio has little effect on the order of performance. On the other hand, changing the correlation does affect the order of performance. When increases, the differences between the methods become larger as well and especially the simplified garrote is affected by the increasing correlation in the training data. However, the adaptive SplitReg consistently yields the best performance.
In this note we have shown that next to regularization also split regressions can be used to reduce the variability of LS. Both the generalized variance and the total variance of linear functions of the SplitReg coefficients are smaller than those based on LS coefficients. This smaller variability can result in a lower MSE as illustrated in a simple setting by comparing PMSE of an adaptive SplitReg method with ridge regression, simplified garrote and lasso. We expect that the benefits are higher in high-dimensional regression models where spurious correlations frequently occur. However, determining the optimal splits is a difficult task in practice. An exhaustive search over all possible splits is not feasible, so approximate methods need to be developed. This is a topic of ongoing research. For example, minimizing a penalized loss function is a possibility that yields promising results, seeChristidis et al. (2017).
- Breiman (1995) Breiman, L. (1995) Better subset regression using the nonnegative garrote, Technometrics, 37(4), 373–384.
- Christidis et al. (2017) Christidis, A., Lakshmanan, L. V. S., Smucler, E., and Zamar, R. (2017). Ensembles of Regularized Linear Models. Technical report, available at https://arxiv.org/abs/1712.03561
- Hastie, Tibshirani, and Wainwright (2015) Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman and Hall/CRC, Florida.
- Hoerl (1962) Hoerl, A. E. (1962). Application of Ridge Analysis to Regression Problems, Chemical Engineering Progress, 58(3), 54–59.
- Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12(1), 55–67.
- Horn and Johnson (2012) Horn, R. A. and Johnson, C. R. (2012) Matrix Analysis, 2nd edition, Cambridge University Press, New York.
- Tikhonov (1963) Tikhonov, A.N. (1963). Solution of incorrectly formulated problems and the regularization method, Soviet Mathematics Doklady, 4, 1035–1038.