1 Ridge and linear regression
We first learn of ridge when we study linear regression. Ridge provides a remedy for an illconditioned
matrix. If our regression matrix has column rank less than(or nearly so in terms of its condition number, the ratio of largest to smallest singular value), then the usual leastsquares regression equation is in trouble:
(1) 
The poor (large) condition number of is inherited by , which is either singular or nearly so, and here we try to invert it. The problem is that
has some eigenvalues of zero or nearly zero, so inverting is not a good idea. So what we do is add a
ridge on the diagonal — with — which increases all the eigenvalues by and takes the problem away:(2) 
This is the ridge regression solution proposed by
Hoerl and Kennard (1970) 50 years ago, and as we will see is alive and strong today.We can write out the optimization problem that ridge is solving,
(3) 
where is the (Euclidean) norm. Some simple matrix calculus gets us from (3) to (2).
What value of shall we use? If our main concern is resolving the numerical issues with solving the least squares equation, then a small value suffices — say or perhaps that fraction of the largest eigenvalue of .
The ridge remedy (2
) comes with consequences. Under a true linear model, the ridge estimate is biased toward zero. It also has smaller variance than the OLS estimate. Selecting
amounts to a biasvariance tradeoff. We address this further in Section 3.The ridge modification works in many situations where we fit linear models, and the effect is not as transparent as in (2).

With GLMs we model and , and fit by maximum likelihood. If is ill conditioned, this will lead to a Hessian that is flat (near zero) in some directions, and the Newton algorithm will be unstable. We can instead maximize
(4) which, as in (2), adds
to the Hessian, and removes the problem. One example is logistic regression. Here even if
is well behaved, if the classes are separated in space, the usual maximumlikelihood estimator is undefined — some coefficients are infinite. A little bit of ridge resolves the issue. 
The same is true in the Cox model, multiclass logistic regression, and any other model linear in the parameters.

In widedata situations where , for example in genomics where the variables are SNPs (single nucleotide polymorphisms) and can number in the millions, and in document classification in the tens of thousands. Here regularization is essential, and requires careful tuning.
Typically we do not penalize the intercept in the linear model; in the case of least squares, we can center and upfront, and ignore the intercept. In other GLMs it is handled not quite as simply, but is a detail which for now we will ignore.
We have expressed the ridge problem in Lagrange form in (3), as opposed to the more evocative bound form
(5) 
The two problems are of course equivalent: every solution to problem (3) is a solution to (5) with . The lasso (Tibshirani, 1996) uses an rather than an penalty in linear models to achieve sparsity:
(6) 
This penalty both shrinks coefficients like ridge, but also sets some to zero and thus selects. Figure 1 compares their constraint regions, which explains the ability of lasso to set coefficients to zero. The lasso has inspired ridgelike generalizations, such as the two mentioned here:

elastic net (Zou and Hastie, 2005) which mixes the ridge and lasso penalty
(7) It still selects variables like the lasso, but deals more gracefully with correlated variables.

group lasso (Yuan and Lin, 2007) that does lassolike selection for predefined groups of variables
(8) here each
is a vector of parameters. Notice the lack of the square on the
norms, which changes what would be an overall ridge penalty to a groupwise lasso. The overlap group lasso (Jacob et al., 2009; Hastie et al., 2015b) allows for hierarchy in selection, and has been used (by us, among others) for selecting interactions (Lim and Hastie, 2015) and additive models (Chouldechova and Hastie, 2015).
There is a Bayesian view of ridge regression. We assume , with Here we think of as random as well, and having a prior distribution . Then the negative log posterior distribution is proportional to (3) with , and the posterior mean is the ridge estimator (2). The smaller the prior variance parameter , the more the posterior mean is shrunk toward zero, the prior mean for . The Bayesian version of the lasso uses a Laplace prior, which puts much more mass at and around zero, and has wider tails than the Gaussian, per unit variance. Ridge therefore expects more variables in the model, and shrinks them all to stabilize variance. Lasso expects fewer variables, and hence is able to shrink those included less to stabilize variance.
2 Ridge computations and the SVD
In many widedata and other ridge applications, we need to treat as a tuning parameter, and select a good value for the problem at hand. For this task we have a number of approaches available for selecting from a series of candidate values:

With a validation dataset separate from the training data, we can evaluate the prediction performance at each value of .

Crossvalidation does this efficiently using just the training data, and leaveoneout (LOO) CV is especially efficient; see Section 7.

and unbiased risk estimates, where we correct the bias in the training risk at each .
Whatever the approach, they all require computing a number of solutions at different values of : the ridge regularization path. With squarederror loss as in (3
), we can achieve great efficiency via the SVD (singularvalue decomposition):
. If is , we use the full form of the SVD, with orthogonal, orthogonal and diagonal, with diagonal entries , where . Plugging this into (2) we get(9) 
So once we have the SVD of , we have the ridge solution for all values of .
One can use this representation to show that if two or more variables are identical (a serious problem for linear regression), their ridge coefficients are identical (and sum to what would be the ridge coefficient if just one were included in the model). Similarly, correlated variables have their coefficients shrunk toward each other — the mechanism exploited by the elastic net.
It is also illuminating to look at the fitted values using (9):
(10) 
The form an orthonormal basis for the column space of . If has its column means removed (which we would have done if there were an unpenalized intercept in the model), then the are the principal components of . If , then (10) says the fit is the projection onto the column space of , using the principal components as a basis ( is the coordinate of on the th principal component). With the coordinates of are shrunk by an increasing amount as we go through the succession of principal components.
When the ridge solution with is simply the OLS solution for . When , there are infinitely many least squares solutions for , all leading to a zeroresidual solution. This is because the leastsquares estimating equations
(11) 
are under determined (more unknowns than equations). But evidently from (9) we can get a unique solution
(12) 
This is the leastsquares solution with minimum norm.
3 Ridge and the biasvariance tradeoff
The coefficients of ridge regression are explicitly shrunk toward the origin. If the data arise from a linear model
(15) 
with i.i.d zeromean errors , then will be a biased estimate of . If the are assumed fixed, and has full column rank, we can get an explicit expression for this bias from (9)
(16) 
So the coordinates of the true coefficient along different principal components (of the training data) are differentially shrunk toward zero — the smaller the PC, the more the shrinkage. The bias increases in all components as increases. Note, when , lies in the at most dimensional row space of , and any component of in the orthogonal component will contribute to the bias as well.
Similarly there is a nice expression for the covariance matrix under the sampling model (15):
(17) 
with . With this is the usual OLS covariance in factored form. Here we see that the covariance decreases uniformly (in a PSD sense) as increases.
These play off each other when we make predictions at new locations , leading to a biasvariance tradeoff.
(18) 
For the expected prediction error (EPE) we add , the variance of the target response.
Even when the model is correct, a shrunken estimate can perform considerably better than the OLS solution if the bias and variances are traded off correctly. This is especially true in cases when
is large and/or the signaltonoise ratio is small. Figure
2 shows the results of a linearmodel simulation with and and the (this is a population of 77%). The OLS coefficients (left panel, far left in plot) are wild but unbiased. The ridgeshrunken versions at are much closer to the true coefficients (red marks), and predictions using them achieve minimum meansquared error from the true linear function (right panel). Of course we don’t know the best , and typically use a leftout data set or crossvalidation to estimate the predictionerror curve empirically. Included in the plot is the leaveoneout (LOO) CV curve, discussed in Section 7, which finds a reasonable value for .A general form of shrinkage for any multivariate estimator is given by the celebrated JamesStein formula (James and Stein, 1961). In the context of linear regression we have
(19) 
where is the OLS estimator, and can be estimated in the usual fashion via the residual sumofsquares for linear regression (Efron and Hastie, 2016, Chapter 7). The left panel of Figure 2 includes the JamesStein estimates on the right axis. The corresponding MSE is above 15, and so is not that good on this example.
4 Ridge and data augmentation
There are some interesting tricks for fitting a ridge regression using standard linearmodel software. Suppose we augment and in the following way:
(20) 
Think of the case where and are centered. We have added additional points around the origin, each a distance along a coordinate axis. It is straightforward to check that the OLS coefficient is . Another way to do this approximately is to augment in a similar way with random draws from a distribution (or any multivariate distribution with mean zero and covariance ).
is again padded with
zeros. If we fit the model by weighted least squares, giving weight 1 to the original points, and to the augmentation points, we approximately achieve . Approximate because . See the left plot in Figure 3 for an example.Another way to achieve this is to perturb each observed data vector by a random amount. We make perturbed copies of each : , where . Each of the vectors gets the same response . Then
(21) 
because the zeromean crossterms get averaged away as gets large.
Here the augmentation is somewhat artificial. A more realistic augmentation is used in image classification with deep neural networks
(Chollet and Allaire, 2018, for example). We have a limited number of labeled training images. The idea is to take each image and apply some natural deformations in such a way that humans would have no trouble making the classification. For example, suppose we have an image of a poodle. By making small rotations, scale and location changes, hue and contrast changes, it will still look like a poodle. We can now augment our training set with all these additional versions, creating a cloud of images around the original, as in Figure 3[right panel]. This kind of regularization prevents the model from overtraining on the original image, and by analogy is a form of ridge regularization.5 Ridge and dropout regularization
Two forms of regularization have arisen in recent years in the statistical and machine learning communities. They are similar in spirit to each other, and rather closely aligned with ridge regression.
Random forests (Breiman, 2001) have an option whereby each time a tree is grown, and a split is contemplated at a node, only a subset of the variables available are considered as candidates for splitting. This causes variables to stand in for each other, or share the predictive burden — a characteristic shared by ridge regression. The smaller , the more regularized the fit.
Modern deep neural networks employ a similar mechanism: dropout learning (Srivastava et al., 2014). Neural networks have layers of activations or transformations that feed forward into successive layers through a series of linear models followed by nonlinear transformations. For example going from layer with units to layer with units, consider computing the activation for a single observation during the feedforward stage of training.
(22) 
The idea is to randomly set each of the activations
to zero with probability
, and inflate the remaining ones by a factor . Hence, for this observation, those nodes that survive have to stand in for those omitted. This is done independently for each observation, and can also be seen as be a form of ridge regularization, and when done correctly improves performance. The fraction omitted is a tuning parameter, and for deep networks it appears to be better to use different values at different layers.We illustrate using a simple version of dropout for linear regression (see also Wager et al. (2013)). For simplicity we assume all variables have mean zero, so we can ignore intercepts. Consider the following random leastsquares criterion:
(23) 
Here the are i.i.d variables for all with
(24) 
(this particular form is used so that ). Using simple probability it can be shown that the expected score equations can be written
(25) 
with , where is the th column of . Hence the solution is given by
(26) 
a generalized ridge regression. If the variables are standardized, the term becomes a scalar, and the solution is identical to ridge regression.
6 Ridge and the kernel trick
We start with the ridge problem (3) for the case and write out the score equation (up to a factor 2):
(27) 
For it is easy to see that the solution must satisfy for some vector . In other words, the solution lies in the row space of , an dimensional subspace of (we assume has full row rank, for ease of exposition).
From this we can easily derive
(28) 
where is the gram matrix of pairwise inner products. Likewise the fit vector is given by
(29) 
So even though can be very large, all we need to do are dimensional calculations to compute the solution (although the computations needed to produce here are ). This is the kernel trick in its simplest form.
This is also the case for GLMs, and in fact any linear model fit obtained by optimizing a quadratically penalized objective. We will use a GLM as an example. Denoting the vector of fits by , we can write the quadratically penalized loglikelihood problem as
(30) 
Again it can be shown (Hastie and Tibshirani, 2004) that the solution has the form , the fit vector , and the optimization in becomes
(31) 
Here the linear kernel matrix corresponds to ridged linear models, but the formulation opens the door to the rich world of function fitting in reproducing kernel Hilbert spaces (RKHS). For any positive definite bivariate function , we can think of it computing inner products in an implicit feature space : . We end up fitting functions of the form , by solving problems of the form (31) where . See Hastie et al. (2009, Chapter 6)
for details. The supportvector machine for twoclass classification is of this form. With
and , the optimization problem is(32) 
For GLMs we can solve (31) using standard ridge software. We compute the Cholesky decomposition of , and reparametrize via (again an vector). Then the objective in becomes
(33) 
another ridged linear GLM, but in rather dimensions. For the linear kernel , it is easy to see that the solution for the original is , where
is the QR decomposition of
(same ). The beauty here is that we reduce our wide matrix to an matrix , and then perform our ridge MLE with it instead. We can perform CV to select in this space as well, and need only map back if we want to make predictions on new wide vectors .7 Ridge and leaveoneout cross validation
We have already talked of using the SVD to compute the ridge path of solutions efficiently. This eases the burden when computing the solution paths times during fold crossvalidation. For fold (LOO) CV, we have another beautiful result for ridge and other linear operators.
(34) 
Here is the ridge estimate computed using the observation dataset with the pair omitted, and
(35) 
is the ridge operator matrix for the original observation matrix. The equation says we can compute all the LOO residuals for ridge from the original residuals, each scaled up by . From (10) we see we can obtain efficiently for all via
(36) 
with the diagonal shrinkage matrix with elements .
One can derive this result using the ShermanMorrisonWoodbury identity, but a more general and elegant derivation due to Golub et al. (1979) is as follows. For each pair left out we are required to solve
(37) 
with solution . Let . If we insert the pair back into the size dataset, it will not change the solution to (37), since this point is on the solution surface (and hence has zero loss at .) Back at a full dataset, and using the linearity of the ridge operator, we have
(38) 
from which we see that with .
The LOO formula (34) appears to break down when in the limit as toward the minimumnorm solution. It turns out we can get an equally elegant solution in this case as well (Hastie et al., 2019).
For from (29) we get
(39)  
Hence the summands in formula (34) can be written as
(40) 
(the multiplier from (39) in the numerator and denominator cancel). Now we can set to obtain
(41) 
This formula does not apply if there is an unpenalized intercept in the model. One can show in this case that the corresponding formula is
(42) 
where is the doublycentered kernel matrix (with rank at most ), is the mean projection operator, and the is a pseudo inverse.
8 Ridge, minimum norm and double descent
There has been a burst of activity in the machine learning community over some surprising behavior of a particular class of estimators. Before we get into detail, a bit of context is in order. Deep learning models are dominant in certain high SNR domains, such as image classification. Practitioners have found that fitting a deep convolutional network with many layers by gradient descent all the way down to zero training error performs very well on test data. These networks typically have more parameters than training observations and so this would normally be classed as severe
overfitting. What has also been observed is that increasing the number of parameters (via more hidden units) can improve performance in some instances (Zhang et al., 2016; Belkin et al., 2018; Hastie et al., 2019).We first make the simple observation (Zhang et al., 2016) that with and squarederror loss, gradient descent starting at down to zero training error gives the minimum norm solution. This is because the gradient (11) is in the row space of , and hence all the gradient updates remain there. This characterizes the minimumnorm solution (see end of Section 2).
Figure 4 is a simple demonstration of this double descent behavior. We generated data from a model with . The true function is nonlinear and nonadditive, is distributed , is , and is chosen so the SNR (signaltonoise ratio) . The sample size is , and we use a very large test set (10K) to measure outofsample error. For our estimation method we use an additive model
(43) 
where each is a
vector of naturalspline basis functions, with knots chosen at uniform quantiles of the training values for variable
(Hastie et al., 2009, Chapter 5). The total number of parameters or dimension of the model is , with stepping from 1 thru 30. The dimension reaches the training sample size of when .The black curve shows the test prediction error of the minimumnorm least squares fits, and the orange curve their training error. The training error behaves as expected, reaching and staying at zero as the dimension passes 100. Before 100, the fits are OLS and overfit almost immediately, and increase to a dramatic level (around 2000) before descending down again as the dimension increases. The blue curve corresponds to the optimally tuned ridge estimator which is fairly flat, with a minimum around dimension 20.
The apparent dilemma here is that the black curve does not show the usual biasvariance tradeoff as the dimension increases. The explanation for the increase around 100 is that the model has to interpolate the training data to achieve zero training error. As a result the prediction curve has to wiggle a lot between training points, and the
norm of gets large. This does not bode well for outofsample prediction, since the test features fall in this inbetween region. But as the dimension grows beyond 100, zero error can be achieved more smoothly with decreasing norm, and leads to improved outofsample prediction. So the complexity of the model is not determined by dimension alone; the potential for smaller norm of the solution at each dimension plays a role as well. The “objective” that is optimized at each dimension combines loss and complexity (OLS fit with minimum norm), which clouds the usual biasvariance picture.It is interesting to note that the minimumnorm fitting method is linear in (see (12)). However, as the number of columns (dimension) of the basis matrix increases, the small singular values increase leading to the potential for more stable and smaller norm solutions. Figure 5 illustrates this for our example.
9 Ridge + ridge = lasso
Suppose is an matrix. Consider the optimization problem
(44) 
Even though the rank constraint makes this problem nonconvex, the solution is well known and easy to compute. We compute the SVD of , set all but the top singular values of to zero, and reconstruct. There is a convex relaxation of this problem (Fazel, 2002)
(45) 
where is the nuclear norm of — the sum of the singular values. The solution is again via the SVD of . Now we replace the singular values by their softthresholded values , and reconstruct. This is the lasso version of rank selection for matrices. Suppose is such that all but of the are greater than zero, and hence the solution has rank . Now consider the doubly ridged problem (Srebro et al., 2005)
(46) 
over matrices and . This biconvex problem has a solution that coincides with that of (45): . Quite remarkable that a biconvex regularized problem is equivalent to a convex regularized problem! At this point these connections may seem somewhat academic, since the SVD of provides solutions to all the problems above. These connections show their strength in variations of these problems.

If is massive and sparse, we can compute a lowrank matrix approximation by alternating ridge regressions. Given , we obtain via , which is a dense skinny matrix multiplying a sparse matrix. Likewise for given .

When has missing values, we can solve the matrix completion objective
(47) where projects onto the observed values of (i.e. the Frobenius norm ignores the entries corresponding to missing values in ). Objective (47) is convex in , and for large solutions can be obtained again using the alternating ridged version (46) (Hastie et al., 2015a).
10 Discussion
We see through these examples that ridge regularization and its extensions are pervasive in applied statistics. Although the lasso enjoys widespread popularity in widedata scenarios, both ridge and elasticnet claim some of the territory. In document and webpage classification using bagofwords and ngrams, synonyms create problems for methods that select variables. Ridge includes them all, and suffers less. Furthermore, classifiers are hurt less by bias than quantitative regressors. What started off as a simple fix for wayward linear regression models has evolved into a large collection of tools for data modeling. It would be hard to imagine the life of a data scientist without them.
Acknowledgements
Thanks to Rob Tibshirani for helpful comments. This research was partially supported by grants DMS1407548 and IIS 1837931 from the National Science Foundation, and grant 5R01 EB 00198821 from the National Institutes of Health.
References
 Reconciling modern machine learning and the biasvariance tradeoff. Note: arXiv: 1812.11118 Cited by: §8.
 Random forests. Machine Learning 45, pp. 5–32. Cited by: §5.
 Deep learning with r. 1st edition, Manning Publications Co., USA. External Links: ISBN 161729554X Cited by: §4.
 Generalized Additive Model Selection. ArXiv eprints. External Links: 1506.03850 Cited by: 2nd item.
 Computer age statistical inference; algorithms, evidence and data science. Cambridge University Press. Cited by: §3.
 Matrix rank minimization with applications. Ph.D. Thesis, Department of Electrical Engineering, Stanford University. Cited by: §9.
 Generalized crossvalidation as a method for choosing a good ridge parameter. Technometrics 21, pp. 215–224. Cited by: §7.
 The elements of statistical learning: prediction, inference and data mining. Second edition, Springer Verlag, New York. Cited by: §6, §8.
 Matrix completion and lowrank SVD via fast alternating least squares. Journal of Machine Learning Research 16, pp. 3367–3402. Cited by: 2nd item.
 Surprises in highdimensional ridgeless least squares interpolation. External Links: arXiv:1903.08560 Cited by: §7, §8.
 Efficient quadratic regularization for expression arrays. Biostatistics 5 (3), pp. 329–340. Cited by: §6.
 Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall, CRC Press. Cited by: 2nd item.
 Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, pp. 55–67. Cited by: §1.
 Group Lasso with overlap and graph Lasso. In Proceeding of the 26th International Conference on Machine Learning, Montreal, Canada, Cited by: 2nd item.
 Estimation with quadratic loss. In Proc. 4th Berkeley Symposium on Mathematical Statistics and Probability, Vol. I, pp. 361–379. External Links: MathReview (J. Kiefer) Cited by: §3.
 Learning interactions via hierarchical grouplasso regularization. Journal of Computational and Graphical Statistics 24 (3), pp. 627–654. Cited by: 2nd item.
 Maximummargin matrix factorization. In Advances in Neural Information Processing Systems 17, pp. 1329–1336. Cited by: §9.
 Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: ISSN 15324435 Cited by: §5.
 Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, pp. 267–288. Cited by: §1.
 Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26, C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), pp. 351–359. External Links: Link Cited by: §5.
 Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B 68 (1), pp. 49–67. Cited by: 2nd item.
 Understanding deep learning requires rethinking generalization. Note: arXiv: 1611.03530 Cited by: §8, §8.
 Regression shrinkage and selection via the elastic net. JRSS B 67 (2), pp. 301–320. Cited by: 1st item.
Comments
There are no comments yet.