1 Ridge and linear regression
We first learn of ridge when we study linear regression. Ridge provides a remedy for an ill-conditionedmatrix. If our regression matrix has column rank less than
(or nearly so in terms of its condition number, the ratio of largest to smallest singular value), then the usual least-squares regression equation is in trouble:
The poor (large) condition number of is inherited by , which is either singular or nearly so, and here we try to invert it. The problem is that
has some eigenvalues of zero or nearly zero, so inverting is not a good idea. So what we do is add aridge on the diagonal — with — which increases all the eigenvalues by and takes the problem away:
This is the ridge regression solution proposed byHoerl and Kennard (1970) 50 years ago, and as we will see is alive and strong today.
We can write out the optimization problem that ridge is solving,
What value of shall we use? If our main concern is resolving the numerical issues with solving the least squares equation, then a small value suffices — say or perhaps that fraction of the largest eigenvalue of .
The ridge remedy (2amounts to a bias-variance trade-off. We address this further in Section 3.
The ridge modification works in many situations where we fit linear models, and the effect is not as transparent as in (2).
With GLMs we model and , and fit by maximum likelihood. If is ill conditioned, this will lead to a Hessian that is flat (near zero) in some directions, and the Newton algorithm will be unstable. We can instead maximize
which, as in (2), adds
to the Hessian, and removes the problem. One example is logistic regression. Here even ifis well behaved, if the classes are separated in -space, the usual maximum-likelihood estimator is undefined — some coefficients are infinite. A little bit of ridge resolves the issue.
The same is true in the Cox model, multiclass logistic regression, and any other model linear in the parameters.
In wide-data situations where , for example in genomics where the variables are SNPs (single nucleotide polymorphisms) and can number in the millions, and in document classification in the tens of thousands. Here regularization is essential, and requires careful tuning.
Typically we do not penalize the intercept in the linear model; in the case of least squares, we can center and upfront, and ignore the intercept. In other GLMs it is handled not quite as simply, but is a detail which for now we will ignore.
We have expressed the ridge problem in Lagrange form in (3), as opposed to the more evocative bound form
This penalty both shrinks coefficients like ridge, but also sets some to zero and thus selects. Figure 1 compares their constraint regions, which explains the ability of lasso to set coefficients to zero. The lasso has inspired ridge-like generalizations, such as the two mentioned here:
elastic net (Zou and Hastie, 2005) which mixes the ridge and lasso penalty
It still selects variables like the lasso, but deals more gracefully with correlated variables.
group lasso (Yuan and Lin, 2007) that does lasso-like selection for pre-defined groups of variables
is a vector of parameters. Notice the lack of the square on thenorms, which changes what would be an overall ridge penalty to a group-wise lasso. The overlap group lasso (Jacob et al., 2009; Hastie et al., 2015b) allows for hierarchy in selection, and has been used (by us, among others) for selecting interactions (Lim and Hastie, 2015) and additive models (Chouldechova and Hastie, 2015).
There is a Bayesian view of ridge regression. We assume , with Here we think of as random as well, and having a prior distribution . Then the negative log posterior distribution is proportional to (3) with , and the posterior mean is the ridge estimator (2). The smaller the prior variance parameter , the more the posterior mean is shrunk toward zero, the prior mean for . The Bayesian version of the lasso uses a Laplace prior, which puts much more mass at and around zero, and has wider tails than the Gaussian, per unit variance. Ridge therefore expects more variables in the model, and shrinks them all to stabilize variance. Lasso expects fewer variables, and hence is able to shrink those included less to stabilize variance.
2 Ridge computations and the SVD
In many wide-data and other ridge applications, we need to treat as a tuning parameter, and select a good value for the problem at hand. For this task we have a number of approaches available for selecting from a series of candidate values:
With a validation dataset separate from the training data, we can evaluate the prediction performance at each value of .
Cross-validation does this efficiently using just the training data, and leave-one-out (LOO) CV is especially efficient; see Section 7.
and unbiased risk estimates, where we correct the bias in the training risk at each .
Whatever the approach, they all require computing a number of solutions at different values of : the ridge regularization path. With squared-error loss as in (3
), we can achieve great efficiency via the SVD (singular-value decomposition):. If is , we use the full form of the SVD, with orthogonal, orthogonal and diagonal, with diagonal entries , where . Plugging this into (2) we get
So once we have the SVD of , we have the ridge solution for all values of .
One can use this representation to show that if two or more variables are identical (a serious problem for linear regression), their ridge coefficients are identical (and sum to what would be the ridge coefficient if just one were included in the model). Similarly, correlated variables have their coefficients shrunk toward each other — the mechanism exploited by the elastic net.
It is also illuminating to look at the fitted values using (9):
The form an orthonormal basis for the column space of . If has its column means removed (which we would have done if there were an unpenalized intercept in the model), then the are the principal components of . If , then (10) says the fit is the projection onto the column space of , using the principal components as a basis ( is the coordinate of on the th principal component). With the coordinates of are shrunk by an increasing amount as we go through the succession of principal components.
When the ridge solution with is simply the OLS solution for . When , there are infinitely many least squares solutions for , all leading to a zero-residual solution. This is because the least-squares estimating equations
are under determined (more unknowns than equations). But evidently from (9) we can get a unique solution
This is the least-squares solution with minimum norm.
3 Ridge and the bias-variance trade-off
The coefficients of ridge regression are explicitly shrunk toward the origin. If the data arise from a linear model
with i.i.d zero-mean errors , then will be a biased estimate of . If the are assumed fixed, and has full column rank, we can get an explicit expression for this bias from (9)
So the coordinates of the true coefficient along different principal components (of the training data) are differentially shrunk toward zero — the smaller the PC, the more the shrinkage. The bias increases in all components as increases. Note, when , lies in the at most -dimensional row space of , and any component of in the orthogonal component will contribute to the bias as well.
Similarly there is a nice expression for the covariance matrix under the sampling model (15):
with . With this is the usual OLS covariance in factored form. Here we see that the covariance decreases uniformly (in a PSD sense) as increases.
These play off each other when we make predictions at new locations , leading to a bias-variance trade-off.
For the expected prediction error (EPE) we add , the variance of the target response.
Even when the model is correct, a shrunken estimate can perform considerably better than the OLS solution if the bias and variances are traded off correctly. This is especially true in cases when
is large and/or the signal-to-noise ratio is small. Figure2 shows the results of a linear-model simulation with and and the (this is a population of 77%). The OLS coefficients (left panel, far left in plot) are wild but unbiased. The ridge-shrunken versions at are much closer to the true coefficients (red marks), and predictions using them achieve minimum mean-squared error from the true linear function (right panel). Of course we don’t know the best , and typically use a left-out data set or cross-validation to estimate the prediction-error curve empirically. Included in the plot is the leave-one-out (LOO) CV curve, discussed in Section 7, which finds a reasonable value for .
A general form of shrinkage for any multivariate estimator is given by the celebrated James-Stein formula (James and Stein, 1961). In the context of linear regression we have
where is the OLS estimator, and can be estimated in the usual fashion via the residual sum-of-squares for linear regression (Efron and Hastie, 2016, Chapter 7). The left panel of Figure 2 includes the James-Stein estimates on the right axis. The corresponding MSE is above 15, and so is not that good on this example.
4 Ridge and data augmentation
There are some interesting tricks for fitting a ridge regression using standard linear-model software. Suppose we augment and in the following way:
Think of the case where and are centered. We have added additional points around the origin, each a distance along a coordinate axis. It is straightforward to check that the OLS coefficient is . Another way to do this approximately is to augment in a similar way with random draws from a distribution (or any multivariate distribution with mean zero and covariance ).
is again padded withzeros. If we fit the model by weighted least squares, giving weight 1 to the original points, and to the augmentation points, we approximately achieve . Approximate because . See the left plot in Figure 3 for an example.
Another way to achieve this is to perturb each observed data vector by a random amount. We make perturbed copies of each : , where . Each of the vectors gets the same response . Then
because the zero-mean cross-terms get averaged away as gets large.
Here the augmentation is somewhat artificial. A more realistic augmentation is used in image classification with deep neural networks(Chollet and Allaire, 2018, for example). We have a limited number of labeled training images. The idea is to take each image and apply some natural deformations in such a way that humans would have no trouble making the classification. For example, suppose we have an image of a poodle. By making small rotations, scale and location changes, hue and contrast changes, it will still look like a poodle. We can now augment our training set with all these additional versions, creating a cloud of images around the original, as in Figure 3[right panel]. This kind of regularization prevents the model from over-training on the original image, and by analogy is a form of ridge regularization.
5 Ridge and dropout regularization
Two forms of regularization have arisen in recent years in the statistical and machine learning communities. They are similar in spirit to each other, and rather closely aligned with ridge regression.
Random forests (Breiman, 2001) have an option whereby each time a tree is grown, and a split is contemplated at a node, only a subset of the variables available are considered as candidates for splitting. This causes variables to stand in for each other, or share the predictive burden — a characteristic shared by ridge regression. The smaller , the more regularized the fit.
Modern deep neural networks employ a similar mechanism: dropout learning (Srivastava et al., 2014). Neural networks have layers of activations or transformations that feed forward into successive layers through a series of linear models followed by nonlinear transformations. For example going from layer with units to layer with units, consider computing the activation for a single observation during the feed-forward stage of training.
The idea is to randomly set each of the activations
to zero with probability, and inflate the remaining ones by a factor . Hence, for this observation, those nodes that survive have to stand in for those omitted. This is done independently for each observation, and can also be seen as be a form of ridge regularization, and when done correctly improves performance. The fraction omitted is a tuning parameter, and for deep networks it appears to be better to use different values at different layers.
We illustrate using a simple version of dropout for linear regression (see also Wager et al. (2013)). For simplicity we assume all variables have mean zero, so we can ignore intercepts. Consider the following random least-squares criterion:
Here the are i.i.d variables for all with
(this particular form is used so that ). Using simple probability it can be shown that the expected score equations can be written
with , where is the th column of . Hence the solution is given by
a generalized ridge regression. If the variables are standardized, the term becomes a scalar, and the solution is identical to ridge regression.
6 Ridge and the kernel trick
We start with the ridge problem (3) for the case and write out the score equation (up to a factor 2):
For it is easy to see that the solution must satisfy for some vector . In other words, the solution lies in the row space of , an -dimensional subspace of (we assume has full row rank, for ease of exposition).
From this we can easily derive
where is the gram matrix of pairwise inner products. Likewise the fit vector is given by
So even though can be very large, all we need to do are -dimensional calculations to compute the solution (although the computations needed to produce here are ). This is the kernel trick in its simplest form.
This is also the case for GLMs, and in fact any linear model fit obtained by optimizing a quadratically penalized objective. We will use a GLM as an example. Denoting the -vector of fits by , we can write the quadratically penalized log-likelihood problem as
Again it can be shown (Hastie and Tibshirani, 2004) that the solution has the form , the fit vector , and the optimization in becomes
Here the linear kernel matrix corresponds to ridged linear models, but the formulation opens the door to the rich world of function fitting in reproducing kernel Hilbert spaces (RKHS). For any positive definite bivariate function , we can think of it computing inner products in an implicit feature space : . We end up fitting functions of the form , by solving problems of the form (31) where . See Hastie et al. (2009, Chapter 6)
for details. The support-vector machine for two-class classification is of this form. Withand , the optimization problem is
For GLMs we can solve (31) using standard ridge software. We compute the Cholesky decomposition of , and reparametrize via (again an -vector). Then the objective in becomes
another ridged linear GLM, but in rather dimensions. For the linear kernel , it is easy to see that the solution for the original is , where
is the QR decomposition of(same ). The beauty here is that we reduce our wide matrix to an matrix , and then perform our ridge MLE with it instead. We can perform CV to select in this space as well, and need only map back if we want to make predictions on new wide vectors .
7 Ridge and leave-one-out cross validation
We have already talked of using the SVD to compute the ridge path of solutions efficiently. This eases the burden when computing the solution paths times during -fold cross-validation. For -fold (LOO) CV, we have another beautiful result for ridge and other linear operators.
Here is the ridge estimate computed using the -observation dataset with the pair omitted, and
is the ridge operator matrix for the original -observation matrix. The equation says we can compute all the LOO residuals for ridge from the original residuals, each scaled up by . From (10) we see we can obtain efficiently for all via
with the diagonal shrinkage matrix with elements .
One can derive this result using the Sherman-Morrison-Woodbury identity, but a more general and elegant derivation due to Golub et al. (1979) is as follows. For each pair left out we are required to solve
with solution . Let . If we insert the pair back into the size dataset, it will not change the solution to (37), since this point is on the solution surface (and hence has zero loss at .) Back at a full dataset, and using the linearity of the ridge operator, we have
from which we see that with .
For from (29) we get
Hence the summands in formula (34) can be written as
(the multiplier from (39) in the numerator and denominator cancel). Now we can set to obtain
This formula does not apply if there is an unpenalized intercept in the model. One can show in this case that the corresponding formula is
where is the doubly-centered kernel matrix (with rank at most ), is the mean projection operator, and the is a pseudo inverse.
8 Ridge, minimum norm and double descent
There has been a burst of activity in the machine learning community over some surprising behavior of a particular class of estimators. Before we get into detail, a bit of context is in order. Deep learning models are dominant in certain high SNR domains, such as image classification. Practitioners have found that fitting a deep convolutional network with many layers by gradient descent all the way down to zero training error performs very well on test data. These networks typically have more parameters than training observations and so this would normally be classed as severeoverfitting. What has also been observed is that increasing the number of parameters (via more hidden units) can improve performance in some instances (Zhang et al., 2016; Belkin et al., 2018; Hastie et al., 2019).
We first make the simple observation (Zhang et al., 2016) that with and squared-error loss, gradient descent starting at down to zero training error gives the minimum -norm solution. This is because the gradient (11) is in the row space of , and hence all the gradient updates remain there. This characterizes the minimum-norm solution (see end of Section 2).
Figure 4 is a simple demonstration of this double descent behavior. We generated data from a model with . The true function is nonlinear and nonadditive, is distributed , is , and is chosen so the SNR (signal-to-noise ratio) . The sample size is , and we use a very large test set (10K) to measure out-of-sample error. For our estimation method we use an additive model
where each is a
-vector of natural-spline basis functions, with knots chosen at uniform quantiles of the training values for variable(Hastie et al., 2009, Chapter 5). The total number of parameters or dimension of the model is , with stepping from 1 thru 30. The dimension reaches the training sample size of when .
The black curve shows the test prediction error of the minimum-norm least squares fits, and the orange curve their training error. The training error behaves as expected, reaching and staying at zero as the dimension passes 100. Before 100, the fits are OLS and overfit almost immediately, and increase to a dramatic level (around 2000) before descending down again as the dimension increases. The blue curve corresponds to the optimally tuned ridge estimator which is fairly flat, with a minimum around dimension 20.
The apparent dilemma here is that the black curve does not show the usual bias-variance trade-off as the dimension increases. The explanation for the increase around 100 is that the model has to interpolate the training data to achieve zero training error. As a result the prediction curve has to wiggle a lot between training points, and thenorm of gets large. This does not bode well for out-of-sample prediction, since the test features fall in this in-between region. But as the dimension grows beyond 100, zero error can be achieved more smoothly with decreasing norm, and leads to improved out-of-sample prediction. So the complexity of the model is not determined by dimension alone; the potential for smaller norm of the solution at each dimension plays a role as well. The “objective” that is optimized at each dimension combines loss and complexity (OLS fit with minimum norm), which clouds the usual bias-variance picture.
It is interesting to note that the minimum-norm fitting method is linear in (see (12)). However, as the number of columns (dimension) of the basis matrix increases, the small singular values increase leading to the potential for more stable and smaller -norm solutions. Figure 5 illustrates this for our example.
9 Ridge + ridge = lasso
Suppose is an matrix. Consider the optimization problem
Even though the rank constraint makes this problem non-convex, the solution is well known and easy to compute. We compute the SVD of , set all but the top singular values of to zero, and reconstruct. There is a convex relaxation of this problem (Fazel, 2002)
where is the nuclear norm of — the sum of the singular values. The solution is again via the SVD of . Now we replace the singular values by their soft-thresholded values , and reconstruct. This is the lasso version of rank selection for matrices. Suppose is such that all but of the are greater than zero, and hence the solution has rank . Now consider the doubly ridged problem (Srebro et al., 2005)
over matrices and . This biconvex problem has a solution that coincides with that of (45): . Quite remarkable that a biconvex -regularized problem is equivalent to a convex regularized problem! At this point these connections may seem somewhat academic, since the SVD of provides solutions to all the problems above. These connections show their strength in variations of these problems.
If is massive and sparse, we can compute a low-rank matrix approximation by alternating ridge regressions. Given , we obtain via , which is a dense skinny matrix multiplying a sparse matrix. Likewise for given .
When has missing values, we can solve the matrix completion objective
where projects onto the observed values of (i.e. the Frobenius norm ignores the entries corresponding to missing values in ). Objective (47) is convex in , and for large solutions can be obtained again using the alternating ridged version (46) (Hastie et al., 2015a).
We see through these examples that ridge regularization and its extensions are pervasive in applied statistics. Although the lasso enjoys widespread popularity in wide-data scenarios, both ridge and elastic-net claim some of the territory. In document and web-page classification using bag-of-words and n-grams, synonyms create problems for methods that select variables. Ridge includes them all, and suffers less. Furthermore, classifiers are hurt less by bias than quantitative regressors. What started off as a simple fix for wayward linear regression models has evolved into a large collection of tools for data modeling. It would be hard to imagine the life of a data scientist without them.
Thanks to Rob Tibshirani for helpful comments. This research was partially supported by grants DMS-1407548 and IIS 1837931 from the National Science Foundation, and grant 5R01 EB 001988-21 from the National Institutes of Health.
- Reconciling modern machine learning and the bias-variance trade-off. Note: arXiv: 1812.11118 Cited by: §8.
- Random forests. Machine Learning 45, pp. 5–32. Cited by: §5.
- Deep learning with r. 1st edition, Manning Publications Co., USA. External Links: Cited by: §4.
- Generalized Additive Model Selection. ArXiv e-prints. External Links: Cited by: 2nd item.
- Computer age statistical inference; algorithms, evidence and data science. Cambridge University Press. Cited by: §3.
- Matrix rank minimization with applications. Ph.D. Thesis, Department of Electrical Engineering, Stanford University. Cited by: §9.
- Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, pp. 215–224. Cited by: §7.
- The elements of statistical learning: prediction, inference and data mining. Second edition, Springer Verlag, New York. Cited by: §6, §8.
- Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research 16, pp. 3367–3402. Cited by: 2nd item.
- Surprises in high-dimensional ridgeless least squares interpolation. External Links: Cited by: §7, §8.
- Efficient quadratic regularization for expression arrays. Biostatistics 5 (3), pp. 329–340. Cited by: §6.
- Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall, CRC Press. Cited by: 2nd item.
- Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, pp. 55–67. Cited by: §1.
- Group Lasso with overlap and graph Lasso. In Proceeding of the 26th International Conference on Machine Learning, Montreal, Canada, Cited by: 2nd item.
- Estimation with quadratic loss. In Proc. 4th Berkeley Symposium on Mathematical Statistics and Probability, Vol. I, pp. 361–379. External Links: Cited by: §3.
- Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics 24 (3), pp. 627–654. Cited by: 2nd item.
- Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, pp. 1329–1336. Cited by: §9.
- Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: Cited by: §5.
- Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, pp. 267–288. Cited by: §1.
- Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26, C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), pp. 351–359. External Links: Cited by: §5.
- Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B 68 (1), pp. 49–67. Cited by: 2nd item.
- Understanding deep learning requires rethinking generalization. Note: arXiv: 1611.03530 Cited by: §8, §8.
- Regression shrinkage and selection via the elastic net. JRSS B 67 (2), pp. 301–320. Cited by: 1st item.