Consider a standard prediction problem in which a dataset is employed to learn a model for inferring information about new datapoints that are yet to be observed. One of the most popular classes of learning schemes, specially in high-dimensional settings, studies the following optimization problem:
where is a convex loss function, is a convex regularizer, is a closed convex set and is the tuning parameter that specifies the amount of regularization. By applying an appropriate regularizer in (2
), we are able to achieve better bias-variance trade-off and pursue special structures such as sparsity and low rank structure. However, the performance of such techniques hinges upon the selection of tuning parameters.
The most generally applicable tuning method is cross validation . One common choice is -fold cross validation. This method presents potential bias issues in high-dimensional settings where is comparable to
, specially when the number of folds is not very large. For instance, the phase transition phenomena that happen in such regimes[3, 15, 16, 54] indicate that any data splitting may cause dramatic effects on the solution of (2) (see Figure 1 for an example). Hence, the risk estimates obtained from -fold cross validation may not be reliable. The bias issues of -fold cross validation may be alleviated by choosing the number of folds
to be large. This makes LOOCV particularly appealing, since it offers an approximately unbiased estimate of the risk. However, the computation of LOOCV requires training the modeltimes, which is unaffordable for large datasets.
The high computational complexity of LOOCV has motivated researchers to propose computationally less demanding approximations of the quantity. Early examples offered approximations for the case and the loss function being smooth [1, 39, 30, 11, 33, 38]. In , the authors considered such approximations for smooth loss functions and smooth regularizers. In this line of work, the accuracy of the approximations was either not studied or was only studied in the large, fixed regime. In a recent paper,  employed a similar approximation strategy to obtain approximate leave-one-out formulas for smooth loss functions and smooth regularizers. They show that under some mild conditions, such approximations are accurate in high-dimensional settings. Unfortunately, the approximations offered in  only cover twice differentiable loss functions and regularizers. On the other hand, numerous modern regularizers, such as generalized LASSO and nuclear norm, and also many loss functions, such as hinge loss, are not smooth.
In this paper, we propose three powerful frameworks for calculating an approximate leave-one-out estimator (ALO) of the LOOCV risk that are capable of offering accurate parameter tuning even for non-differentiable losses and regularizers. Our first approach is based on the approximation of the dual of (2). Our second approach is based on the smoothing and quadratic approximation of the primal problem (2). The third approach is based on the proximal formulation of (2). While the three approaches consider different approximations that happen in different domains, we will show that when both and are twice differentiable, the three frameworks produce the same ALO formulas, which are also the same as the formulas proposed in .
We use our platforms to obtain concise formulas for several popular examples including generalized LASSO, support vector machine (SVM) and nuclear norm minimization. As will be clear from our examples, despite the equivalence of the three frameworks for smooth loss functions and regularizers, the technical aspects of deriving ALO formulas have major variations in different examples. In Remark 5.3
we have a short discussion about the strength of different approaches on different problems. Finally, we present extensive simulations to confirm the accuracy of our formulas on various important machine learning models.
1.2 Other Related Work
The importance of parameter tuning in learning systems has encouraged many researchers to study this problem from different perspectives. In addition to cross validation, several other approaches have been proposed including Stein’s unbiased risk estimate (SURE), Akaike information criterion (AIC), and Mallow’s
. While AIC is designed for smooth parametric models, SURE has been extended to emerging optimization problems, such as generalized LASSO and nuclear norm minimization[10, 17, 51, 52, 57].
Unlike cross validation which approximates the out-of-sample prediction error, SURE, AIC, and offer estimates for in-sample prediction error . This makes cross validation more appealing for many learning systems. Furthermore, unlike ALO, both SURE and only work on linear models (and not generalized linear models) and their unbiasedness is only guaranteed under the Gaussian model for the errors. There has been little success in extending SURE beyond this model .
Another class of parameter tuning schemes are based on approximate message passing framework [4, 36, 37]. As pointed out in , this approach is intuitively related to LOOCV. It offers consistent parameter tuning in high-dimensions [36, 53], but the results strongly depend on the independence of the elements of . This limits to application of this approach to very specific problems.
1.3 Organization of the Paper
Our paper is organized as follows: Section 2 contributes to some preliminaries which will be uesd later. Section 3, 4, 5 introduce respectively the dual approach, primal approach and proximal approach to obtain the ALO formula. Then in Section 6 we prove the equivalence of the three approaches under the smoothness conditions, followed by a corollary related to accuracy. All the above sections discuss ALO without including the intercept term in the model. Thus in Section 7 we address the case when the intercept is contained. We then apply the ALO approaches introduced in previous sections to several models and obtain their specific ALO formula in Section 8. Experimental results are presented in Section 9. Finally, after a short discussion in Section 10, we present all the proofs in Section 11.
Lowercase and uppercase bold letters denote vectors and matrices, respectively. For subsets and of indices and a matrix , let and denote the submatrices that include only rows of in , and columns of in respectively. Let denote the vector whose components are for . We may omit , in which case we consider all indices valid in the context. For a function , let , denote its 1st and 2nd derivatives. For a vector , we use to denote a diagonal matrix with . Finally, let and denote the gradient and Hessian of a function .
In this section we describe the problem to be studied in this paper and some preliminary knowledge needed for subsequent analyses. We start with the unconstrained learning problems. In Section 5.3, we will discuss the generalization to the constrained ones.
2.1 Problem Description
In this paper, we study the statistical learning models in form (2). For each value of , we evaluate the following LOOCV risk estimate with respect to some error function :
where is the solution of the leave--out problem
Calculating (4) requires training the model times, which may be time-consuming in high-dimensions. As an alternative, we propose an estimator to approximate based on the full-data estimator to reduce the computational complexity. We consider three frameworks for obtaining , and denote the corresponding risk estimate by:
The estimates we obtain will be called approximated leave-one-out (ALO) throughout the paper.
2.2 Primal and Dual Correspondence
The objective function of penalized regression problem with loss and regularizer is given by:
Here and subsequently, unless necessary, we absorb the value of into to simplify the notation. We also consider the Lagrangian dual problem, which can be written in the form:
where and denote the Fenchel conjugates111The Fenchel conjugate of a function is defined as . of and respectively. See the derivation in Appendix A. It is known that under mild conditions, (5) and (6) are equivalent . In this case, we have the primal-dual correspondence relating the primal optimal and the dual optimal :
where denotes the set of subgradients of a function with respect to its first argument. These relations will help us approximate from primal and dual perspectives.
2.3 Proximal Formulation
In this section, we review another characterization of that will be used for approximating . Consider the following definition:
The proximal operator of a function is defined as
When , we will write instead of for notational simplicity. For many modern regularizers , such as LASSO and nuclear norm, has an explicit expression. We summarize some of the properties of the proximal operator in the following lemma:
The proximal operator satisfies the following properties:
The proximal operator is nonexpansive, i.e.,
Let be a convex and piecewise smooth function with number of zeroth-order singularities222 A singular point of a function is called th order, if at this point the function is times differentiable, but its th order derivative does not exist. , then takes constant value when with denoting the left-derivative and for the right. Note that for different value of , the convexity guarantees these intervals do not overlap with each other. Further, is differentiable as long as does not lie on the boundaries of these intervals;
If is a twice differentiable convex function, then the Jacobian of
exists. In addition, the Jacobian matrix is symmetric and its eigenvalues are all between zero and one.
A function is a proximal operator of a convex function if and only if is nonexpansive and a gradient of a convex function;
Our interest in the proximal operator stems from the fact that it provides another formulation for evaluating . More specifically, under some mild conditions, the solution of the primal problem is the unique fixed point of the following equation:
3 Approximation in the Dual Domain
In this section, we introduce the dual approach to obtain the ALO formula. We first explain the idea using LASSO as an example. Then the approach is extended to general regularzers and general smooth losses.
3.1 The First Example: LASSO
Let us first start with a simple example that illustrates our dual method in deriving an approximate leave-one-out (ALO) formula for the standard LASSO. The LASSO estimator, first proposed in , can be formulated as the penalized regression framework in (5) by setting , and . We recall the general formulation of the dual for penalized regression problems (6), and note that in the case of the LASSO we have:
In particular, we note that the solution of the dual problem (6) can be obtained from:
Here denotes the projection onto , where is the polytope given by:
Let us now consider the leave--out problem. Unfortunately, the dimension of the dual problem is reduced by 1 for the leave--out problem, making it difficult to leverage the information from the full-data solution to help approximate the leave--out solution. We propose to augment the leave--out problem with a virtual th observation which does not affect the result of the optimization, but restores the dimensionality of the problem.
More precisely, let be the same as , except that its th coordinate is replaced by , the leave--out predicted value. We note that the leave--out solution is also the solution for the following augmented problem:
Additionally, the primal-dual correspondence (7) gives that , which is the residual in the augmented problem, and hence that . These two features allow us to characterize the leave--out predicted value , as satisfying:
where denotes the th standard vector. Solving exactly for the above equation is in general a procedure that is computationally comparable to fitting the model, which may be expensive. However, we may attempt to obtain an approximate solution of (11) by linearizing the projection operator at the full data solution . The approximate leave--out fitted value is thus given by:
where denotes the Jacobian of the projection operator at the full data problem . The nonexpansiveness of guarantees the almost everywhere existence of . Note that is a polytope, and thus the projection onto is almost everywhere locally affine . Furthermore, it is straightforward to calculate the Jacobian of . Let be the equicorrelation set (where denotes the column of ), then we have that the projection at the full data problem is locally given by a projection onto the orthogonal complement of the span of , thus giving . We can then obtain by plugging in (12). The risk of LASSO can be estimated through
3.2 General Case
In this section we extend the dual approach outlined in Section 3.1 to more general loss functions and regularizers.
Let us first extend the dual approach to other regularizers, while the loss function remains . In this case the dual problem (6) has the following form:
Note that the optimal value of is by definition the value of the proximal operator of at :
Following the argument of Section 3.1, we obtain
with now denoting the Jacobian of . We note that the Jacobian matrix exists almost everywhere, because the non-expansiveness of the proximal operator guarantees its almost-everywhere differentiability . In particular, if the distribution of is absolutely continuous with respect to the Lebesgue measure,
exists with probability 1. This approach is particularly useful whenis a norm, as its Fenchel conjugate is then the convex indicator of the unit ball of the dual norm, and the proximal operator reduces to a projection operator.
In summary, since , the risk of can be estimated through the following formula:
where is the Jacobian of . We calculate for several popular regularizers in Section 8.
General smooth loss
Let us now assume we have a convex smooth loss in (5), such as those that appear in generalized linear models. As we are arguing from a second-order perspective by considering Newton’s method, we will attempt to expand the loss as a quadratic form around the full data solution. We will thus consider the approximate problem obtained by expanding around the dual optimal of (6):
The constant term has been removed from (16) for simplicity. We note that we have reduced the problem to a problem with a weighted loss which may be further reduced to a simple problem by a change of variable and a rescaling of . Indeed, let be the diagonal matrix such that , and note that we have: by the primal-dual correspondence (7). Consider the change of variable to obtain:
We may thus reduce to the loss case in (13) with a modified and :
Similar to (14), the ALO formula in the case of general smooth loss can be obtained as , with
where is the Jacobian of .
4 Approximation in the Primal Domain
The dual approach is typically powerful for models with smooth losses and norm-type regularizers, such as the LASSO. However, it might be difficult to carry out the calculations for other problems. Hence, in this section we introduce our second method for finding .
4.1 Smooth Loss and Smooth Regularizer
Recall that to obtain we need to solve
Assuming is close to , we can take one Newton step from towards to obtain its approximation as:
By employing the matrix inversion lemma  we obtain:
This is the formula reported in . By calculating and in advance, we can cheaply approximate the leave--out prediction for all and efficiently evaluate the LOOCV risk. On the other hand, in order to use the above strategy, twice differentiability of both the loss and the regularizer is necessary in a neighborhood of . However, this assumption is violated for many machine learning models including LASSO, nuclear norm, and SVM. In the next two sections, we introduce a smoothing technique which lifts the scope of the above primal approach to nondifferentiable losses and regularizers.
4.2 Nonsmooth Loss and Smooth Regularizer
In this section we study the piecewise smooth loss functions and twice differentiable regularizers. Such problems arise for instance in SVM  and robust regression . Below we assume the loss is piecewise twice differentiable with zeroth-order singularities . The existence of singularities prohibits us from directly applying strategies in (20) and (21), where twice differentiability of and is necessary. A natural solution is to first smooth out the loss function , then apply the framework in the previous section to the smoothed version and finally reduce the smoothness to recover the ALO formula for the original nonsmooth problem. As the first step, consider the following smoothing idea:
where is a parameter controlling the smoothness of and is a symmetric, infinitely many times differentiable function with the following properties:
Normalization: , , ;
Compact support: for some .
where is the minimizer on the full data from loss and . is a good approximation to the leave--out estimator based on smoothed loss .
Setting , we have that converges to uniformly in the region of interest (see Appendix 11.2.1 for the proof), implying that serves as a good estimator of
, which is heuristically close to the true leave--out . Equation (23) can be simplified in the limit . We define the sets of indices and for the samples at singularities and smooth parts respectively:
The following assumptions are necessary to derive the limit as . We need the following assumptions on , and :
is locally Lipschitz, that is, for any , for any , we have , where is a constant depends only on .
is the unique minimizer.
Whenever , the subgradient of at , satisfies .
is coercive in the sense that as .
We characterize the limit of below.
Under Assumptions 4.2, as ,
For , , and for , we have:
4.3 Nonsmooth Separable Regularizer and Smooth Loss
The smoothing technique proposed in the last section can also handle many nonsmooth regularizers. In this section we focus on separable regularizers , defined as , where is piecewise twice differentiable with finite number of zeroth-order singularities in (examples on non-separable regularizers are studied in Section 8.) We further assume the loss function to be twice differentiable and denote by the active set.
For the coordinates of that lie in , our objective function, constrained to these coordinates, is locally twice differentiable. Hence we expect to be well approximated by the ALO formula using only . On the other hand, components not in are trapped at singularities. Thus as long as they are not on the boundary of being in or out of the singularities, we expect these locations of to stay at the same values. Technically, consider a similar smoothing scheme for :
and let . We then consider the ALO formula of Model (19) with regularizer :
We need the following assumptions to obtain the limiting case as . We will need the following assumptions on the problem.
is locally Lipschiz in the sense that, for any , and for any , we have , where is a constant that only depends on ;
is the unique minimizer of (65);
When , the subgradient of at satisfies .
is coercive in the sense that as .
Under Assumption 4.3, as ,
For nonsmooth problems, higher order singularities do not cause issues: the set of tuning values which cause (for regularizer) or (for loss) to fall at those higher order singularities has measure zero.
For both nonsmooth losses and regularizers, we need to invert some matrices in the ALO formula. Although the invertibility does not seem guaranteed in the general formula, as we apply ALO to specific models, the structures of the loss and/or the regularizer ensures this invertibility. For example, for LASSO, we have the size of the equi-correlation set under weak conditions on and . .
5 Approximation with Proximal Formulation
The primal and dual formulas for approximating cover a large number of optimization problems. However, carrying out the calculations involved in these two methods is still challenging for certain classes of optimization problems, such as constrained optimization problems we discussed in the introduction. Hence, in this section, we introduce our third approach which is based on the proximal formulation. We will later prove that for smooth losses and regularizers this method is equivalent to the primal formulation and the dual formulation.
5.1 Smooth Loss and Regularizer
In this section, we start with twice differentiable loss functions and regularizers. As discussed in Section 2, is the unique solution of the following fixed point equation:
Since is close to , we can obtain a good approximation of by linearizing at . Since the regularizer is twice differentiable, according to Lemma 2.1, is a differentiable function. Let denote the Jacobian of at . The following Newton step for finding root of equation systems enables us to obtain an approximation of .
Using this heuristic argument we obtain the following approximation for :
Assuming is invertible, one can use the matrix inversion lemma to obtain
Hence, our final approximation of is given by
In summary, the formula is given by
Even though we used several heuristic steps to obtain this formula, in Section 6, we will connect this formula with those derived from the primal and dual perspectives and prove the accuracy of this formula.
5.2 Generalization to Nonsmooth Regularizer
In this section, we handle non-differentiable regularizers using the approach developed in Section 5.1. Here we consider separable nonsmooth regularizers where , while similar technique can be used in more general scenarios. Suppose that has zeroth-order singularities . To use (29) and (30), we apply the same smoothing scheme introduced in Section 4.3 to and obtain its smoothed version :