## 1 Introduction

We consider the linear regression model

where the error is assumed independent from , with and We focus on the estimation of linear functions, , of the regression coefficients. For example, , when the interest resides in a particular effect, or , when the interest resides in the prediction of a new outcome . We aim for estimators of such that the mean squared error, , is minimized. It holds that

(1) |

where the two terms on the right hand side are the squared bias and the variance, respectively. The standard estimator for

is the ordinary least squares (LS) estimator

where is the design matrix andis the response vector. Since

is unbiased, we obtain that with . Moreover, by the Gauss-Markov theorem we know thatfor any linear unbiased estimator

Therefore, one focuses on biased estimators when one aims to reduce .A common approach in this regard is regularization by shrinkage (Tikhonov, 1963; Hoerl, 1962). The most popular family of shrinkage estimators is ridge regression (Hoerl and Kennard, 1970), given by

It can be shown that , for any given where represents inequality in the Loewner partial order for symmetric matrices [i.e. if is non-negative definite, see e.g. Horn and Johnson (2012)]. Indeed, taking (w.l.g.) we have that

Another well-known shrinkage family is the non-negative garrote (Breiman, 1995). A simplified version of this method (which we use in this paper) is defined as

with In section 3 we show that both the total variance and the generalized variance of are smaller than those of . There is an extensive literature on shrinkage in linear regression which mainly focuses on automatic variable selection with lasso as the most prominent example, see e.g. Hastie, Tibshirani, and Wainwright (2015).

In Section 2 we consider a new family of estimators, which we call split regression estimators (SplitReg). In Section 3, we show that the total variance and generalized variance of the simplified garrote and SplitReg are smaller than those of LS. We illustrate the potential usefulness of SplitReg in Section 4 through results of some numerical comparisons with ridge regression, garrote and LASSO. Finally, Section 5 concludes.

##
2

Split Estimators

Consider a partitioning of the design matrix with a matrix and a matrix. According to this partition, consider the matrices

Then, we define the corresponding split LS estimator by

which has mean

and covariance matrix

(2) |

Example 1. Consider the linear regression model with . To simplify the notation we assume that the explanatory variables have been centered and scaled so that and let us denote

Suppose that is the main effect and is a control variable. Hence, we are mainly interested in estimating . Let us compare the LS estimator with the SplitReg estimator . The MSE for and is and , respectively. Therefore, is preferred if and only if Hence, SplitReg becomes more attractive when there is a larger error variance and/or a larger correlation between the explanatory variables (multicolinearity) and/or a smaller sample size. For example, if and , then the SplitReg estimator is preferred if

## 3 Variance Inequalities

To compare the variability of the estimators , and we use the following expression for . Let and , then we have

(3) |

Proposition 1 (Generalized Variance Inequality). Consider the three estimators above, then we have the following inequalities for the generalized variance

Proof. To prove (a) we notice that

(4) |

Similarly using that we have

(5) |

From (4) and (5) it suffices to show that

Or equivalently,

This holds because and implies and respectively.

Part (b) follows because

Proposition 2 (Total Variance Inequality). Consider again the three estimators , . Then we have the following inequalities for the total variance:

Proof. From (2) it follows that

On the other hand, using equation (3) we obtain that

Now, notice that

which implies

To prove (b) let us set , then we have that

To show that split LS estimators improve the total and generalized variance of the least squares estimator more generally, we now consider nested partitions of the set . A partition is nested in the partition if for all there is a subset such that

For simplicity, we will consider the partitions and with and We will also assume (w.l.g.) that and The corresponding partitions of the matrix are

with

We now compare the split estimators
and .

Proposition 3 (Nested Partitions Inequality).

Proof. To simplify the notation, assume that . We have

and

Therefore, to prove (a) write

Since we just need to show that .
But this has been shown in Proposition 1(a) with
playing the role of .

Similarly, to prove (b) we write

Hence, we just need to show that but this has already been shown in Proposition 2(a) with playing the role of .

###### Remark 1

From Theorem 2 and 3 it follows that

and also

## 4 Potential benefit of split regression estimators

Our aim is to illustrate the potential benefits of SplitReg for the prediction of new outcomes. We use the model and notation introduced in Example 1. The prediction of a new outcome corresponds to the choice and the prediction performance of an estimator is measured by averaging over the values of , that is we calculate the prediction mean squared error . To calculate the expectation we assume that where is a correlation matrix with parameter It can be shown that for LS prediction

For garrote prediction with weights we obtain that

For ridge regression prediction with penalty we have

where with and with , We also consider prediction based on the lasso estimator, denoted by . However, in this case we do not have a closed form expression for , so we calculate it numerically. We determine the optimal prediction performance of ridge regression, lasso and garrote corresponding to values of the shrinkage parameter or which minimizes the PMSE.

To make SplitReg more flexible, similarly as for the garrote, we add a shrinkage parameter to the split coefficient estimates . Moreover, since splitting is not always better than joint estimation as illustrated in Example 1, we make SplitReg adaptive by allowing it to choose whether to use the joint LS estimates or to use the split LS estimates. The decision to split or not to split is made by comparing the performance of the best shrunken SplitReg estimator to the best shrunken joint LS estimator, i.e. the simplified garrote estimator . The PMSE of the shrunken SplitReg estimator is given by

with and where be a matrix with and .

In Figure 1 we compare the minimal attainable PMSE by the LS, ridge regression, simplified garrote, lasso and adaptive SplitReg methods as a function of when , and . The error variance was chosen such that the signal to noise ratio equals 1 or 3 and the correlation between the predictors in the training sample was given by or . From these plots we can see that all methods can improve on LS as expected. Moreover, the signal to noise ratio has little effect on the order of performance. On the other hand, changing the correlation does affect the order of performance. When increases, the differences between the methods become larger as well and especially the simplified garrote is affected by the increasing correlation in the training data. However, the adaptive SplitReg consistently yields the best performance.

## 5 Conclusions

In this note we have shown that next to regularization also split regressions can be used to reduce the variability of LS. Both the generalized variance and the total variance of linear functions of the SplitReg coefficients are smaller than those based on LS coefficients. This smaller variability can result in a lower MSE as illustrated in a simple setting by comparing PMSE of an adaptive SplitReg method with ridge regression, simplified garrote and lasso. We expect that the benefits are higher in high-dimensional regression models where spurious correlations frequently occur. However, determining the optimal splits is a difficult task in practice. An exhaustive search over all possible splits is not feasible, so approximate methods need to be developed. This is a topic of ongoing research. For example, minimizing a penalized loss function is a possibility that yields promising results, see

Christidis et al. (2017).## References

- Breiman (1995) Breiman, L. (1995) Better subset regression using the nonnegative garrote, Technometrics, 37(4), 373–384.
- Christidis et al. (2017) Christidis, A., Lakshmanan, L. V. S., Smucler, E., and Zamar, R. (2017). Ensembles of Regularized Linear Models. Technical report, available at https://arxiv.org/abs/1712.03561
- Hastie, Tibshirani, and Wainwright (2015) Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman and Hall/CRC, Florida.
- Hoerl (1962) Hoerl, A. E. (1962). Application of Ridge Analysis to Regression Problems, Chemical Engineering Progress, 58(3), 54–59.
- Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12(1), 55–67.
- Horn and Johnson (2012) Horn, R. A. and Johnson, C. R. (2012) Matrix Analysis, 2nd edition, Cambridge University Press, New York.
- Tikhonov (1963) Tikhonov, A.N. (1963). Solution of incorrectly formulated problems and the regularization method, Soviet Mathematics Doklady, 4, 1035–1038.

Comments

There are no comments yet.