    Split regression modeling

In this note we study the benefits of splitting variables variables for reducing the variance of linear functions of the regression coefficient estimate. We show that splitting combined with shrinkage can result in estimators with smaller mean squared error compared to popular shrinkage estimators such as Lasso, ridge regression and garrote.

Authors

02/12/2020

M-estimators of scatter with eigenvalue shrinkage

A popular regularized (shrinkage) covariance estimator is the shrinkage ...
10/26/2019

Ridge-type Linear Shrinkage Estimation of the Matrix Mean of High-dimensional Normal Distribution

The estimation of the mean matrix of the multivariate normal distributio...
10/15/2021

Multiple Observers Ranked Set Samples for Shrinkage Estimators

Ranked set sampling (RSS) is used as a powerful data collection techniqu...
11/30/2020

Tuning in ridge logistic regression to solve separation

Separation in logistic regression is a common problem causing failure of...
03/09/2021

The Efficient Shrinkage Path: Maximum Likelihood of Minimum MSE Risk

A new generalized ridge regression shrinkage path is proposed that is as...
01/06/2018

Trainable ISTA for Sparse Signal Recovery

In this paper, we propose a novel sparse signal recovery algorithm calle...
10/08/2018

Visually Communicating and Teaching Intuition for Influence Functions

Estimators based on influence functions (IFs) have been shown effective ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the linear regression model

 yi=x′iβ+εi, \ \ \ i=1,...,n;

where the error is assumed independent from , with and  We focus on the estimation of linear functions, , of the regression coefficients. For example, , when the interest resides in a particular effect, or , when the interest resides in the prediction of a new outcome . We aim for estimators of such that the mean squared error, , is minimized. It holds that

 (1)

where the two terms on the right hand side are the squared bias and the variance, respectively. The standard estimator for

is the ordinary least squares (LS) estimator

where  is the design matrix and

is the response vector. Since

is unbiased, we obtain that with . Moreover, by the Gauss-Markov theorem we know that

for any linear unbiased estimator

Therefore, one focuses on biased estimators when one aims to reduce .

A common approach in this regard is regularization by shrinkage  (Tikhonov, 1963; Hoerl, 1962). The most popular family of shrinkage estimators is ridge regression (Hoerl and Kennard, 1970), given by

 ˆβλ=(X′X+λI)−1X′y, \ \ \ λ≥0.

It can be shown that , for any given where represents inequality in the Loewner partial order for symmetric matrices [i.e. if is non-negative definite, see e.g. Horn and Johnson (2012)]. Indeed, taking (w.l.g.) we have that

Another well-known shrinkage family is the non-negative garrote (Breiman, 1995). A simplified version of this method (which we use in this paper) is defined as

 ˆβω=Dˆβ, \ \ \ D=diag(ω),\

with  In section 3 we show that both the total variance and the generalized variance of  are smaller than those of . There is an extensive literature on shrinkage in linear regression which mainly focuses on automatic variable selection with lasso as the most prominent example, see e.g. Hastie, Tibshirani, and Wainwright (2015).

In Section 2 we consider a new family of estimators, which we call split regression estimators (SplitReg). In Section 3, we show that the total variance and generalized variance of the simplified garrote and SplitReg are smaller than those of LS. We illustrate the potential usefulness of SplitReg in Section 4 through results of some numerical comparisons with ridge regression, garrote and LASSO. Finally, Section 5 concludes.

2 Split Estimators

Consider a partitioning of the design matrix with a matrix and a matrix. According to this partition, consider the matrices

 A=X′X=(A11A12A21A22) \ and \ ¯¯¯¯A=(  A11−A12−A21  A22).

Then, we define the corresponding split LS estimator by

 ˜β=diag(A−111,A−122)X′y,

which has mean

 E(˜β)=diag(A−111,A−122)Aβ=(β1+A−111A12β2,β2+A−122A21β1)′

and covariance matrix

Example 1. Consider the linear regression model with . To simplify the notation we assume that the explanatory variables have been centered and scaled so that and let us denote

Suppose that is the main effect and is a control variable. Hence, we are mainly interested in estimating . Let us compare the LS estimator with the SplitReg estimator . The MSE for and is and , respectively. Therefore, is preferred if and only if  Hence, SplitReg becomes more attractive when there is a larger error variance and/or a larger correlation between the explanatory variables (multicolinearity) and/or a smaller sample size. For example, if and , then the SplitReg estimator is preferred if

3 Variance Inequalities

To compare the variability of the estimators , and we use the following expression for . Let and , then we have

 A−1=diag(B11,B22) ¯¯¯¯A diag(A−111,A−122). (3)

Proposition 1 (Generalized Variance Inequality). Consider the three estimators above, then we have the following inequalities for the generalized variance

 (a)det(Cov(˜β)) ≤det(Cov(ˆβ)), (b) det(Cov(ˆβω)) ≤det(Cov(ˆβ)).

Proof.  To prove (a) we notice that

 det(Cov(˜β))=σ2d[det(A−111)det(A−122)]2det(A) (4)

Similarly using that we have

 det(Cov(ˆβ))=σ2d[det(A−111)det(A−122)][det(B11)det(B22)]det(A) (5)

From (4) and (5)  it suffices to show that

 det(A−111)det(A−122)≤det(B11)det(B22)

Or equivalently,

 det(A11)det(A22)≥det(A11−A12A−122A21)det(A22−A21A−111A12)

This holds because   and  implies  and   respectively.

Part (b) follows because

 det(Cov(ˆβω))=σ2ddet(D)2det(A−1)≤σ2ddet(A−1)=det(Cov(ˆβ)).

Proposition 2 (Total Variance Inequality). Consider again the three estimators , . Then we have the following inequalities for the total variance:

 (a)  tr(Cov(˜β)) ≤tr(Cov(ˆβ)), (b)tr(Cov(ˆβω)) ≤tr(Cov(ˆβ)).

Proof.  From (2) it follows that

 tr(Cov(˜β))=σ2[tr(A−111)+tr(A−122)].

On the other hand, using equation (3) we obtain that

 tr(Cov(ˆβ))=σ2tr(A−1)=σ2[tr(A11−A12A−122A21)−1+tr(A22−A21A−111A12)−1].

Now, notice that

 A11 ⪰A11−A12A−122A21⇒A−111⪯(A11−A12A−122A21)−1, A22 ⪰A22−A21A−111A12⇒A−122⪯(A22−A21A−111A12)−1,

which implies

 tr(A−111) ≤tr((A11−A12A−122A21)−1), tr(A−122) ≤tr((A22−A21A−111A12)−1).

To prove (b) let us set , then we have that

 tr(Cov(ˆβω))≤σ2tr(D2B)=σ2d∑i=1w2ibii≤σ2d∑i=1bii=tr(Cov(ˆβ))

To show that split LS estimators improve the total and generalized variance of the least squares estimator more generally, we now consider nested partitions of the set . A partition  is nested in the partition  if for all there is a subset such that

 Jl=Uα∈MlIα.

For simplicity, we will consider the partitions  and with and We will also assume (w.l.g.) that and  The corresponding partitions of the matrix  are

 A=⎛⎜⎝A11A12A13A21A22A23A31A32A33⎞⎟⎠ \ , \ A=(C11C12C21C22)

with

 C11=(A11A12A21A22), C12=(A13A23) \ and \ \ C22=A33.

We now compare the split estimators and .

Proposition 3 (Nested Partitions Inequality).

 (a) det(Cov(˜α)) ≤det(Cov(˜β)), (b)tr(Cov(˜α)) ≤tr(Cov(˜β)).

Proof. To simplify the notation, assume that . We have

and

Therefore, to prove (a) write

 det(Cov(˜β)) =[det(C−111)det(C−122)]2det(A) det(Cov(˜α)) =[det(A−111)det(A−122)det(A−133)]2det(A)

Since we just need to show that . But this has been shown in Proposition 1(a) with playing the role of .
Similarly, to prove (b) we write

 tr(Cov(˜β)) tr(Cov(˜α))

Hence, we just need to show that but this has already been shown in Proposition 2(a) with playing the role of .

Remark 1

From Theorem 2 and 3 it follows that

 det(Cov(˜α))≤det(Cov(˜β))≤det(Cov(ˆβ)),

and also

 tr(Cov(˜α))≤tr(Cov(˜β))≤tr(Cov(ˆβ)).

4 Potential benefit of split regression estimators

Our aim is to illustrate the potential benefits of SplitReg for the prediction of new outcomes. We use the model and notation introduced in Example 1. The prediction of a new outcome corresponds to the choice and the prediction performance of an estimator is measured by averaging over the values of , that is we calculate the prediction mean squared error . To calculate the expectation we assume that   where  is a correlation matrix with parameter It can be shown that for LS prediction

 PMSE(ˆβ)=2(σ2/n)(1−rρ)/(1−r2).

For garrote prediction with weights we obtain that

 PMSE(ˆβω)= (w1−1)2β21+(w2−1)2β22+2ρ(w1−1)(w2−1)β1β2 +σ2n(1−r2)(w21+w22−2rw1w2ρ).

For ridge regression prediction with penalty we have

 PMSE(ˆβλ)=((1+λ/n)2−r2)−2((λ/n)2β′BΓBβ+σ2tr(ΓV)/n)

where  with and  with , We also consider prediction based on the lasso estimator, denoted by . However, in this case we do not have a closed form expression for , so we calculate it numerically. We determine the optimal prediction performance of ridge regression, lasso and garrote corresponding to values of the shrinkage parameter or which minimizes the PMSE.

To make SplitReg more flexible, similarly as for the garrote, we add a shrinkage parameter to the split coefficient estimates . Moreover, since splitting is not always better than joint estimation as illustrated in Example 1, we make SplitReg adaptive by allowing it to choose whether to use the joint LS estimates or to use the split LS estimates. The decision to split or not to split is made by comparing the performance of the best shrunken SplitReg estimator to the best shrunken joint LS estimator, i.e. the simplified garrote estimator . The PMSE of the shrunken SplitReg estimator is given by

 PMSE(˜βw)=β′T′ΓTβ+(σ2/n)tr(ΓρWΓrW),

with and where be a matrix with  and .

In Figure 1 we compare the minimal attainable PMSE by the LS, ridge regression, simplified garrote, lasso and adaptive SplitReg methods as a function of when , and . The error variance was chosen such that the signal to noise ratio equals 1 or 3 and the correlation between the predictors in the training sample was given by or . From these plots we can see that all methods can improve on LS as expected. Moreover, the signal to noise ratio has little effect on the order of performance. On the other hand, changing the correlation does affect the order of performance. When increases, the differences between the methods become larger as well and especially the simplified garrote is affected by the increasing correlation in the training data. However, the adaptive SplitReg consistently yields the best performance. Figure 1: Optimal PMSE obtained by LS, garrote, ridge regression, lasso, and adaptive SplitReg as a function of β2

5 Conclusions

In this note we have shown that next to regularization also split regressions can be used to reduce the variability of LS. Both the generalized variance and the total variance of linear functions of the SplitReg coefficients are smaller than those based on LS coefficients. This smaller variability can result in a lower MSE as illustrated in a simple setting by comparing PMSE of an adaptive SplitReg method with ridge regression, simplified garrote and lasso. We expect that the benefits are higher in high-dimensional regression models where spurious correlations frequently occur. However, determining the optimal splits is a difficult task in practice. An exhaustive search over all possible splits is not feasible, so approximate methods need to be developed. This is a topic of ongoing research. For example, minimizing a penalized loss function is a possibility that yields promising results, see

Christidis et al. (2017).

References

• Breiman (1995) Breiman, L. (1995) Better subset regression using the nonnegative garrote, Technometrics, 37(4), 373–384.
• Christidis et al. (2017) Christidis, A., Lakshmanan, L. V. S., Smucler, E., and Zamar, R. (2017). Ensembles of Regularized Linear Models. Technical report, available at https://arxiv.org/abs/1712.03561
• Hastie, Tibshirani, and Wainwright (2015) Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman and Hall/CRC, Florida.
• Hoerl (1962) Hoerl, A. E. (1962). Application of Ridge Analysis to Regression Problems, Chemical Engineering Progress, 58(3), 54–59.
• Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12(1), 55–67.
• Horn and Johnson (2012) Horn, R. A. and Johnson, C. R. (2012) Matrix Analysis, 2nd edition, Cambridge University Press, New York.
• Tikhonov (1963) Tikhonov, A.N. (1963). Solution of incorrectly formulated problems and the regularization method, Soviet Mathematics Doklady, 4, 1035–1038.