Consider a standard prediction problem in which a dataset is employed to learn a model for inferring information about new datapoints that are yet to be observed. One of the most popular classes of learning schemes, specially in high-dimensional settings, studies the following optimization problem:
where is the loss function, is the regularizer, and is the tuning parameter that specifies the amount of regularization. By applying an appropriate regularizer in (2
), we are able to achieve better bias-variance trade-off and pursue special structures such as sparsity and low rank structure. However, the performance of such techniques hinges upon the selection of tuning parameters.
The most generally applicable tuning method is cross validation . One common choice is -fold cross validation, which however presents potential bias issues in high-dimensional settings where is comparable to
. For instance, the phase transition phenomena that happen in such regimes[3, 13, 14] indicate that any data splitting may cause dramatic effects on the solution of (2) (see Figure 1 for an example). Hence, the risk estimates obtained from -fold cross validation may not be reliable. The bias issues of -fold cross validation may be alleviated by choosing the number of folds to be large. However, such schemes are computationally demanding and may not be useful for emerging high-dimensional applications. An alternative choice of cross validation is LOOCV, which is unbiased in high-dimensional problems. However, the computation of LOOCV requires training the model times, which is unaffordable for large datasets.
The high computational complexity of LOOCV has motivated researchers to propose computationally less demanding approximations of the quantity. Early examples offered approximations for the case and the loss function being smooth [1, 32, 25, 9, 27, 31]. In , the authors considered such approximations for smooth loss functions and smooth regularizers. In this line of work, the accuracy of the approximations was either not studied or was only studied in the large, fixed regime. In a recent paper,  employed a similar approximation strategy to obtain approximate leave-one-out formulas for smooth loss functions and smooth regularizers. They show that under some mild conditions, such approximations are accurate in high-dimensional settings. Unfortunately, the approximations offered in  only cover twice differentiable loss functions and regularizers. On the other hand, numerous modern regularizers, such as generalized LASSO and nuclear norm, and also many loss functions are not smooth.
In this paper, we propose two powerful frameworks for calculating an approximate leave-one-out estimator (ALO) of the LOOCV risk that are capable of offering accurate parameter tuning even for non-differentiable losses and regularizers. Our first approach is based on the smoothing and quadratic approximation of the primal problem (2). The second approach is based on the approximation of the dual of (2). While the two approaches consider different approximations that happen in different domains, we will show that when both and are twice differentiable, the two frameworks produce the same ALO formulas, which are also the same as the formulas proposed in .
We use our platforms to obtain concise formulas for several popular examples including generalized LASSO, support vector machine (SVM) and nuclear norm minimization. As will be clear from our examples, despite of the equivalence of the two frameworks for smooth loss functions and regularizers, the technical aspects of the derivations involved for obtaining ALO formulas have major variations in different examples. Finally, we present extensive simulations to confirm the accuracy of our formulas on various important machine learning models. Code is available atgithub.com/wendazhou/alocv-package.
1.2 Other Related Work
The importance of parameter tuning in learning systems has encouraged many researchers to study this problem from different perspectives. In addition to cross validation, several other approaches have been proposed including Stein’s unbiased risk estimate (SURE), Akaike information criterion (AIC), and Mallow’s
. While AIC is designed for smooth parametric models, SURE has been extended to emerging optimization problems, such as generalized LASSO and nuclear norm minimization[8, 15, 42, 43, 46].
Unlike cross validation which approximates the out-of-sample prediction error, SURE, AIC, and offer estimates for in-sample prediction error . This makes cross validation more appealing for many learning systems. Furthermore, unlike ALO, both SURE and only work on linear models (and not generalized linear models) and their unbiasedness is only guaranteed under the Gaussian model for the errors. There has been little success in extending SURE beyond this model .
Another class of parameter tuning schemes are based on approximate message passing framework [4, 29, 30]. As pointed out in , this approach is intuitively related to LOOCV. It offers consistent parameter tuning in high-dimensions , but the results strongly depend on the independence of the elements of .
Lowercase and uppercase bold letters denote vectors and matrices, respectively. For subsets and of indices and a matrix , let and denote the submatrices that include only rows of in , and columns of in respectively. Let denote the vector whose components are for . We may omit , in which case we consider all indices valid in the context. For a function , let , denote its 1st and 2nd derivatives. For a vector , we use to denote a diagonal matrix with . Finally, let and denote the gradient and Hessian of a function .
2.1 Problem Description
In this paper, we study the statistical learning models in form (2). For each value of , we evaluate the following LOOCV risk estimate with respect to some error function :
where is the solution of the leave--out problem
Calculating (4) requires training the model times, which may be time-consuming in high-dimensions. As an alternative, we propose an estimator to approximate based on the full-data estimator to reduce the computational complexity. We consider two frameworks for obtaining , and denote the corresponding risk estimate by:
The estimates we obtain will be called approximated leave-one-out (ALO) throughout the paper.
2.2 Primal and Dual Correspondence
The objective function of penalized regression problem with loss and regularizer is given by:
Here and subsequently, we absorb the value of into to simplify the notation. We also consider the Lagrangian dual problem, which can be written in the form:
where and denote the Fenchel conjugates 111The Fenchel conjugate of a function is defined as . of and respectively. See the derivation in Appendix A.
where denotes the set of subgradients of a function . Below we will use both primal and dual perspectives for approximating .
3 Approximation in the Dual Domain
3.1 The First Example: LASSO
Let us first start with a simple example that illustrates our dual method in deriving an approximate leave-one-out (ALO) formula for the standard LASSO. The LASSO estimator, first proposed in , can be formulated as the penalized regression framework in (6) by setting , and .
We recall the general formulation of the dual for penalized regression problems (7), and note that in the case of the LASSO we have:
In particular, we note that the solution of the dual problem (7) can be obtained from:
Here denotes the projection onto , where is the polytope given by:
Let us now consider the leave--out problem. Unfortunately, the dimension of the dual problem is reduced by 1 for the leave--out problem, making it difficult to leverage the information from the full-data solution to help approximate the leave--out solution. We propose to augment the leave--out problem with a virtual th observation which does not affect the result of the optimization, but restores the dimensionality of the problem.
More precisely, let be the same as , except that its th coordinate is replaced by , the leave--out predicted value. We note that the leave--out solution is also the solution for the following augmented problem:
Additionally, the primal-dual correspondence (8) gives that , which is the residual in the augmented problem, and hence that . These two features allow us to characterize the leave--out predicted value , as satisfying:
where denotes the th standard vector. Solving exactly for the above equation is in general a procedure that is computationally comparable to fitting the model, which may be expensive. However, we may attempt to obtain an approximate solution of (11) by linearizing the projection operator at the full data solution , or equivalently performing a single Newton step to solve the leave--out problem from the full data solution. The approximate leave--out fitted value is thus given by:
where denotes the Jacobian of the projection operator at the full data problem . Note that is a polytope, and thus the projection onto is almost everywhere locally affine . Furthermore, it is straightforward to calculate the Jacobian of . Let be the equicorrelation set (where denotes the column of ), then we have that the projection at the full data problem is locally given by a projection onto the orthogonal complement of the span of , thus giving . We can then obtain by plugging in (12). Finally, by replaceing with in (5) we obtain an estimate of the risk.
3.2 General Case
In this section we extend the dual approach outlined in Section 3.1 to more general loss functions and regularizers.
Let us first extend the dual approach to other regularizers, while the loss function remains . In this case the dual problem (7) has the following form:
We note that the optimal value of is by definition the value of the proximal operator of at :
Following the argument of Section 3.1, we obtain
with now denoting the Jacobian of . We note that the Jacobian matrix exists almost everywhere, because the non-expansiveness of the proximal operator guarantees its almost-everywhere differentiability . In particular, if has distribution which is absolutely continuous with respect to the Lebesgue measure,
exists with probability 1. This approach is particularly useful whenis a norm, as its Fenchel conjugate is then the convex indicator of the unit ball of the dual norm, and the proximal operator reduces to a projection operator.
General smooth loss
Let us now assume we have a convex smooth loss in (6), such as those that appear in generalized linear models. As we are arguing from a second-order perspective by considering Newton’s method, we will attempt to expand the loss as a quadratic form around the full data solution. We will thus consider the approximate problem obtained by expanding around the dual optimal :
The constant term has been removed from (15) for simplicity. We note that we have reduced the problem to a problem with a weighted loss which may be further reduced to a simple problem by a change of variable and a rescaling of . Indeed, let be the diagonal matrix such that , and note that we have: by the primal-dual correspondence (8). Consider the change of variable to obtain:
We may thus reduce to the loss case in (13) with a modified and :
Similar to (14), the ALO formula in the case of general smooth loss can be obtained as , with
where is the Jacobian of .
4 Approximation in the Primal Domain
4.1 Smooth Loss and Regularizer
To obtain we need to solve
Assuming is close to , we can take a Newton step from towards to obtain its approximation as:
By employing the matrix inversion lemma  we obtain:
This is the formula reported in . By calculating and in advance, we can cheaply approximate the leave--out prediction for all and efficiently evaluate the LOOCV risk. On the other hand, in order to use the above strategy, twice differentiability of both the loss and the regularizer is necessary in a neighborhood of . However, this assumption is violated for many machine learning models including LASSO, Nuclear norm, and SVM. In the next two sections, we introduce a smoothing technique which lifts the scope of the above primal approach to nondifferentiable losses and regularizers.
4.2 Nonsmooth Loss and Smooth Regularizer
In this section we study the piecewise smooth loss functions and twice differentiable regularizers. Such problems arise in SVM  and robust regression . Before proceeding further, we clarify our assumptions on the loss function. A singular point of a function is called th order, if at this point the function is times differentiable, but its th order derivative does not exist.
Below we assume the loss is piecewise twice differentiable with zero-order singularities . The existence of singularities prohibits us from directly applying strategies in (19) and (20), where twice differentiability of and is necessary. A natural solution is to first smooth out the loss function , then apply the framework in the previous section to the smoothed version and finally reduce the smoothness to recover the ALO formula for the original nonsmooth problem.
As the first step, consider the following smoothing idea:
where is a fixed number and is a symmetric, infinitely many times differentiable function with the following properties:
Normalization: , , ;
Compact support: for some .
where is the minimizer on the full data from loss and . is a good approximation to the leave--out estimator based on smoothed loss .
Setting , we have that converges to uniformly in the region of interest (see Appendix C.1 for the proof), implying that serves as a good estimator of
, which is heuristically close to the true leave--out . Equation (22) can be simplified in the limit . We define the sets of indices and for the samples at singularities and smooth parts respectively:
We characterize the limit of below.
Under some mild conditions, as ,
For , , and for , we have:
4.3 Nonsmooth Regularizer and Smooth Loss
The smoothing technique proposed in the last section can also handle many nonsmooth regularizers. In this section we focus on separable regularizers , defined as , where is piecewise twice differentiable with finite number of zero-order singularities in . (Examples on non-separable regularizers are studied in Section 6.) We further assume the loss function to be twice differentiable and denote by the active set.
For the coordinates of that lie in , our objective function, constrained to these coordinates, is locally twice differentiable. Hence we expect to be well approximated by the ALO formula using only . On the other hand, components not in are trapped at singularities. Thus as long as they are not on the boundary of being in or out of the singularities, we expect these locations of to stay at the same values.
Technically, consider a similar smoothing scheme for :
and let . We then consider the ALO formula of Model (18) with regularizer .
Setting , (23) reduces to a simplified formula which heuristically serves as a good approximation to the true leave--out estimator , stated as the following theorem: Under some mild conditions, as ,
For nonsmooth problems, higher order singularities do not cause issues: the set of tuning values which cause (for regularizer) or (for loss) to fall at those higher order singularities has measure zero.
For both nonsmooth losses and regularizers, we need to invert some matrices in the ALO formula. Although the invertibility does not seem guaranteed in the general formula, as we apply ALO to specific models, the structures of the loss and/or the regularizer ensures this invertibility. For example, for LASSO, we have that the size of the active set .
We note that the dual approach is typically powerful for models with smooth losses and norm-type regularizers, such as the SLOPE norm and the generalized LASSO. On the other hand, the primal approach is valuable for models with nonsmooth loss or when the Hessian of the regularizer is feasible to calculate. Such regularizers often exhibit some type of separability or symmetry, such as in the case of SVM or nuclear norm.
5 Equivalence Between Primal and Dual Methods
Although the primal and dual methods may be harder or easier to carry out depending on the specific problem at hand, one may wonder if they always obtain the same result. In this section, we outline a unifying view for both methods, and state an equivalence theorem.
As both the primal and dual methods are based on a first-order approximation strategy, we will study them not as approximate solutions to the leave--out problem, but will instead show that they are exact solutions to a surrogate leave--out problem. Indeed, recall that the leave--out problem is given by (4), which cannot be solved in closed form. However, we note that the solution does exist in closed form in the case where both and are quadratic functions.
We may thus consider the approximate leave--out problem, where both and have been replaced in the leave--out problem (4) by their quadratic expansion at the full data solution:
When both and are twice differentiable at the full data solution, and can be taken to simply be their respective second order Taylor expansions at . When or is not twice differentiable at the full data solution, we have seen that it is still possible to obtain an ALO estimator through the proximal map (in the case of the dual) or through smoothing arguments (in the case of the primal). The corresponding quadratic surrogates may then be formulated as partial quadratic functions, that is, convex quadratic functions restricted to an affine subspace. However, due to space limitations we only focus on twice differentiable losses and regularizers here.
The way we obtain in (19) indicates that the primal formula in (20) and (21) are the exact leave--out solution of the surrogate primal problem (24). On the other hand, we may also wish to consider the surrogate dual problem, by replacing and by their quadratic expansion at full data dual solution in the dual problem (7). One may possibly worry that the surrogate dual problem is then different from the dual of the surrogate primal problem (24). This does not happen, and we have the following theorem. Let and be twice differentiable convex functions. Let and denote the quadratic surrogates of the loss and regularizer at the full data solution , and let and denote the quadratic surrogates of the conjugate loss and regularizer at the dual full data solution . We have that the following problems are equivalent (have the same minimizer):
Additionally, we note that the dual method described in Section 3 solves the surrogate dual problem (26). Let , be as in (16), and let be the transformed ALO obtained in (17). Let be the same as except . Then satisfies
where and denotes the quadratic surrogate of the regularizer.
In particular, is the exact leave--out predicted value for the surrogate problem described in Theorem 5.
We refer the reader to the Appendix B for the proofs. These two theorems imply that for twice differentiable losses and regularizers, the frameworks we laid out in Sections 3 and 4 lead to exactly the same ALO formulas. This equivalence theorem reflects the deep connections between the primal and dual optimization problem. The central property used by the proof is captured in the following lemma: Let be a proper closed convex function, such that both and are twice differentiable. Then, we have for any in the domain of :
By combining this lemma with the primal dual correspondence (8), we obtain a relation between the curvature of the primal and dual problems at the optimal value, ensuring that the approximation is consistent with the dual structure.
6.1 Generalized LASSO
The generalized LASSO  is a generalization of the LASSO problem which captures many applications such as the fused LASSO , trend filtering  and wavelet smoothing in a unified framework. The generalized LASSO problem corresponds to the following penalized regression problem:
where the regularizer is parameterized by a fixed matrix which captures the desired structure in the data. We note that the regularizer is a semi-norm, and hence we can formulate the dual problem as a projection. In fact, a dual formulation of (28) can be obtained as (see Appendix D):
The dual optimal solution satisfies , where is the polytope given by:
The projection onto the polytope is given in  as locally being the projection onto the affine space orthogonal to the nullspace of , where and . Since is the inverse image of under the linear map given by , the projection onto is given locally by the projection onto the affine space normal to the space spanned by the columns of , provided has full column rank. Here, denotes the Moore-Penrose pseudoinverse of . Finally, to obtain a spanning set of this space, we may consider , where is a set of vectors spanning the nullspace of . This allows us to compute , the projection onto the normal space required to compute the ALO.
6.2 Nuclear Norm
Consider the following matrix sensing problem
with . denotes the inner product. We use
for nuclear norm, which is defined as the sum of the singular values of a matrix. The nuclear norm is a unitarily invariant function of the matrix. Such functions are only indirectly related to the components of the matrix, making their analysis difficult even when they are smooth, and exacerbating the difficulties when they are non-smooth such as in the case of the nuclear norm. In particular, the smoothing framework described in Section 4.3 cannot be applied directly.
We are nonetheless able to leverage the specific structure of such functions to obtain the following theorem. Let be a smooth unitarily invariant matrix function, with:
where denotes the th singular value of . Consider the following matrix penalized regression problem:
Without loss of generality, below we assume . Let
be the singular value decomposition (SVD) of the full data estimator, where , . Let , be the th and th column of and respectively. in this section is a matrix with on the diagonal of its upper square sub-matrix and 0 elsewhere. In addition, we assume all the ’s are nonzero. We then have the following ALO formula:
Here is a matrix and is a symmetric square matrix given by:
Note that the rows of and the index of are vectorized in a consistent way. The proof can be found in Appendix E.2. A nice property of this result is that the effect on singular values decouples from the original matrix, enabling us to apply our smoothing strategy in Section 4.3 to function when it is nonsmooth. This leads to the following theorem for nuclear norm. For more details on the derivation, please refer to Appendix E.3. Consider the nuclear-norm penalized matrix regression problem (29), and let be the SVD of the full data estimator , with , . Let be the number of nonzero ’s for . Let denote the approximate of obtained from the smoothed problem. Then, as
with as defined in (30) and with given by:
where for , and is the corresponding subgradient at this singular value, which can be obtained through the SVD of . The set is then defined as:
Note that the indices of and the index set are consistent.
6.3 Linear SVM
The linear SVM optimization can be written as