In many applications, a dataset with and is modeled as
denotes the vector of unknown parameters, anddenotes the error or noise. is typically estimated by the solution to the following optimization problem
is the loss function,is the regularizer, and is a tuning parameter. The performance of depends heavily on . Hence, finding the ‘optimal’
is of major interest in machine learning and statistics. In most applications, one would ideally like to find thethat minimizes the out-of-sample prediction error:
where is a new data point generated (independently of ) from the same distribution as .
The problem of estimating from has been studied for (at least) the past 50 years, and the corresponding literature is too vast to be covered here. Methods such as cross validation (CV) Stone (1974); Geisser (1975), Allen’s PRESS statistic Allen (1974), generalized cross validation (GCV) Craven and Wahba (1979); Golub et al. (1979), and bootstrap Efron (1983) are seminal ways to estimate .
Since the past studies have focused on the data regime , reliable risk estimates supported by rigorous theory are lacking in high-dimensional settings. In this paper, we study the problem of risk estimation under the high-dimensional asymptotic setting where both the number of features and observation go to infinity, while their ratio remains constant, i.e., as . Suppose that is an estimate of obtained from dataset . The fundamental consistency property we want for an estimate of is:
in probability, asand .
As is clear from Figure 1, standard techniques such as -fold and -fold cross validation exhibit large biases and do not satisfy .
The first contribution of this paper is to prove that the following three risk estimation techniques, which have been successful in numerical studies, satisfy :
Leave-one-out cross validation (LOOCV): Given its negligible bias shown in Figure 1, it is expected that , the estimate given by LOOCV, satisfies . It is noted that LOOCV is computationally demanding and hence impractical in many applications.
Approximate leave-one-out (ALO): The high computational complexity of
prompted several authors to adapt the existing heuristic arguments(Stone, 1977; Allen, 1974) to approximate and obtain another risk estimate called Beirami et al. (2017); Rad and Maleki (2018); Giordano et al. (2018); Wang et al. (2018). We formally present ALO in Section 1.4
Approximate message passing (AMP): Assuming that , estimators of the extra-sample prediction error have been presented using the approximate message passing (AMP) framework (Weng et al., 2016; Mousavi et al., 2013; Obuchi and Kabashima, 2016; Donoho and Montanari, 2016; Bayati et al., 2013). In particular Mousavi et al. (2013); Bayati et al. (2013) showed that AMP-based estimate satisfies for squared loss and bridge regularizers. In this paper, we first generalize AMP-based method to other loss functions and regularizers in Section 1.5. Then, we prove that this estimate satisfies .
The consistency is a minimum requirement a risk estimate should satisfy in high-dimensional settings; if the convergence is slow, then the risk estimate will not be useful in practice. This leads us to the next question we would like to address in this paper:
Is the convergence fast?
The second contribution of this paper is to answer this question for the three estimates mentioned above. To answer this question, we develop tools which are expected to be used in the study of other risk estimates or in other applications. For instance, the connection we derive between the residuals of the leave-one-out and AMP has provided a more refined information on the estimates that are obtained from AMP. Such connections can be useful for the analysis of estimates that are obtained from the empirical risk minimization Donoho and Montanari (2015).
1.2 Related work
The asymptotic regime of this paper was first considered in Huber (1973), but only received a considerable attention in the past fifteen years Donoho and Montanari (2015); El Karoui et al. (2013); Bean et al. (2013); El Karoui (2018); Sur et al. (2017); Weng et al. (2018); Johnstone (2001); Bayati and Montanari (2012); Thrampoulidis et al. (2015); Amelunxen et al. (2014); Chandrasekaran et al. (2012); Cai et al. (2016). The inaccuracy of the standard estimates of in high-dimensional settings has been recently noticed by many researchers, see e.g. Rad and Maleki (2018) and the references therein. Hence, several new estimates have been proposed from different perspectives. For instance, Beirami et al. (2017); Rad and Maleki (2018); Giordano et al. (2018); Wang et al. (2018) used different approximations of the leave-one-out cross validation to obtain computationally efficient risk estimation techniques. In another line of work, Obuchi and Kabashima (2018); Bayati et al. (2013); Mousavi et al. (2018) used either statistical physics heuristic arguments or the framework of message passing to obtain more accurate risk estimates.
While most of these proposal have been successfully used in empirical studies, their theoretical properties have not been studied. The only exceptions are Bayati et al. (2013); Mousavi et al. (2018), in which the authors have shown that the AMP-based estimate satisfy when and the regularizer is bridge, i.e., for . Furthermore, the convergence rate is not know for any risk estimate. In this paper, we study the three most promising proposals, and present a detailed theoretical analysis under the asymptotic and . The tools we develop here are expected to be used in the study of other risk estimates or in other applications. For instance, the connection we derive between the leave-one-out estimate and that of approximate message passing has enabled the message passing framework to provide more refined information (such as convergence rate) for different estimates.
Let stand for a vector filled with zeros except for the th element which is one. Let stand for the th row of . Let and stand for and , excluding the th entry and the th row , respectively. Moreover, let stand for the th column of and stand for , excluding the th column. Further let denote the corresponding vectors without th component and . For a vector , we use to denote its th entry. For any function , we use to indicate the vector . The vector is filled with zeros except for the th entry which is one. The diagonal matrix whose diagonal elements are is referred to as . The component-wise ratio of two vectors and is denoted by . Moreover, stands for the mean of the components of . We define
where is the full model and data estimate, and is the leave observation- out estimate. We refer to as the leave predictor- out estimate. Further, , , , stand for the first and second derivatives for and respectively. Finally, and
denote the maximum and minimum eigenvalues of a matrixrespectively.
1.4 The estimates and
Leave-one-out cross validation () offers the following estimate for :
is computationally infeasible when both and are large. To alleviate this problem, Rahnama Rad and Maleki (2018) used the following single step of the Newton algorithm (with initialization ) to approximate the solution of (3):
Then, using Woodbury matrix inversion lemma, in Rahnama Rad and Maleki (2018) the following approximate leave-one-out formula was derived:
where is the th diagonal element of defined by:
Despite the existing numerical results that exhibit the accuracy of , its theoretical analysis is lacking. The accuracy of this approximation was confirmed in Rahnama Rad and Maleki (2018) under certain ‘stability’ assumptions. However, they could not show whether such ‘stability’ assumptions hold for in high-dimensional settings. Furthermore, their results are only concerned with the discrepancy , and does not provide any information on the consistency of each estimate.
1.5 Risk estimation with AMP
Besides and , another risk estimation technique we study in this paper is based on the approximate message passing (AMP) framework Maleki (2011). For squared loss and bridge regularizers, Mousavi et al. (2017) used the AMP framework to obtain consistent estimates of . In this section, we explain how an estimate of can be obtained for the more general class of estimators we consider in (1). The heuristic approach that leads to the following construction of is explained in Appendix D.
Compute from (1).
Find that satisfies the following equation:
Using from step 1, and from step 2, define
Finally, the AMP-based risk estimator is given by
To ensure the existence of in step 2, we will show in Lemma 2 that there is a one-to-one relationship between and . Further, we will show in Section 2.2.1 that is close to . In other words, the term corrects the optimistic training error , and pushes it closer to the out-of-sample error.
2 Main results
In this section, we present and discuss the assumptions used in this paper. Note that we do not require all of the assumptions for any individual result, and some of the assumptions can be weakened or replaced by other assumptions, as we also discuss below. The first few assumptions are about the structural properties of the loss function and regularizer.
Assumption O. 1.
Loss function and regularizer are convex and have continuous second order derivatives. Moreover, the minimizer of is finite.
Assumption O. 2.
(Hölder Assumption) The second derivatives of the loss function and regularizer are Hölder continuous: there exists constants and such that for all , we have
This implies that there exists constants and such that for all , we have
Assumption O.2 ensures that the second derivatives is locally smooth. Given that the original assumptions in the derivation of and AMP is twice differentiability of the loss function and regularizer, Assumption O.2 is only slightly stronger than the twice differentiability assumptions that were used in deriving formula (7). Note that for non-differentiable cases, one can easily apply a smoothing scheme similar to the ones proposed in Koh and Liang (2017); Mousavi et al. (2018), and still use these risk estimates.
Assumption O. 3.
There exists such that
Assumption O.3 ensures the uniqueness of the solutions of our optimization problems. In that vain, even if we replace Assumption O.3 with , then most of our results will still hold. The only exceptions is Lemma 7. As will become clear, the proof of Lemma 7 requires . This in turn, only requires that a constant fraction of the residual fall in the regions at which the curvature of is positive.
Finally, we should emphasize that since in our proofs we calculate the curvature at and around , we only require a lower bound in a neighborhood of these estimates. Furthermore, if the curvature in such neighborhoods goes to zero ‘slowly’, still our risk estimates will be consistent. We will keep the dependency of our bounds on for the readers who are interested in the cases where is not constant and goes to zero. However, for notational simplicity we have considered a global lower bound for the curvature in Assumption O.3, and in almost all the results will see as a constant.
So far, the assumptions have been concerned with the geometric properties of the loss function and the regularizer. The rest of our assumptions are about the statistical properties of the problem.
Assumption O. 4.
Let and be mutually independent random variables. Furthermore, we assume each data point are independent zero mean Gaussian random variables with variance
be mutually independent random variables. Furthermore, we assume each data pointis i.i.d. generated, , and the elements of
are independent zero mean Gaussian random variables with variance. We assume the entries of are i.i.d. random variables. Finally, we assume that the entries of and are subGaussian random variables respectively, i.e., there exists a constant such that for any fixed , and for all , we have
is a standard assumption in the high-dimensional asymptotic analysis of regularized estimatorsDonoho and Montanari (2015); Bradic and Chen (2015); El Karoui et al. (2013); Bean et al. (2013); El Karoui (2018); Sur et al. (2017); Weng et al. (2018); Johnstone (2001); Bayati and Montanari (2012); Thrampoulidis et al. (2015); Amelunxen et al. (2014); Chandrasekaran et al. (2012); Cai et al. (2016).
Suppose that we have a sequence of problem instances indexed with (with fixed ), and each problem instance satisfies Assumption O.4. Then, solving (1) for the sequence of problem instances leads to a sequence of estimates . Our last assumption is about this sequence.
Assumption O. 5.
Every component of remains bounded by a sufficiently small power of . More specifically,
where is a constant that satisfies . are the constants stated in O.2.
Note that in this paper, we use and interchangeably. Assumption O.5 requires every component of the original estimate to be bounded. One can heuristically argue that this assumption holds given Assumption O.1-O.4. Let us mention a heuristic argument here. Suppose that Assumptions O.1-O.4 hold. We can show that (See (36) and (40) in the proof for Lemma 3 in Section A.3)
Hence, on average, the component-wise distance between and should be . Note that according to Assumption O.4, every component of the true signal can be bounded by (See Lemma 11). Therefore, intuitively, every component of should be bounded by as well. In fact, we can show that O.5 holds with , if we assume O.1-O.4 hold, and the regularizer satisfies an extra condition. This is described in the following lemma:
Since the proof of this lemma uses some of the results we will prove in later sections, we postpone it to Appendix C.
2.2 Main results
In this section, we address question about the convergence rate mentioned in the introduction for , and . Toward this goal, we first bound the discrepancy between and in Section 2.2.1. The connections we derive between AMP estimates and leave-one-out estimates are instrumental in the rest of the proofs. Then, we find an upper bound for in Section 2.2.2. Finally, we obtain an upper bound for the discrepancy between and in Section 2.2.3.
2.2.1 The discrepancy of and
Our first result bounds as mentioned in the next theorem.
Proof sketch. Below, we sketch the proof. Details are in Section A. As the first step, in Lemma 2, we show that , introduced in (8), is uniquely defined. Hence, the heuristic recipe we mentioned in Section 1.5 leads to a well-defined estimate for . Note that Lemma 2.2.1 does not provide any information on the quality of this estimate.
Under Assumption O.1, for any ,
defines a one-to-one mapping between and .
The proof is given in Section A.2. According to this lemma, a unique value of satisfies (8). Using this unique value we can calculate the unique that satisfies (9), and obtain the following estimate of :
To compare this risk estimate with , we first simplify in the following proposition.
The proof of Proposition 1 is given in Section A.6. For the special case of regularizer, a similar upper bound is obtain for in Theorem 2.2 of El Karoui (2018). We employ a similar proof strategy. However, due to the lack of lower bound for the curvature of the regularizer, our argument is more involved. Given the definitions of and , we have
where the last inequality is due to Proposition 1. Recall that based on Assumption 4, the entries of are i.i.d zero mean Gaussian random variables with variance , resulting in , as proved in Lemma 11. Hence, our next main objective is to bound
Towards this goal, we prove
Our first lemma bounds .
Hence, (15) holds. The final step of the proof is to bound . Toward this goal we first want to prove that
Note that and are independent and are i.i.d. . Hence, the following lemma, which is a standard concentration result, can address this issue:
Let be mean-zero Gaussian random vectors with covariance matrix . Let be random matrices. Each is independent of . Further, let be an upper bound for the the maximum eigenvalues of all with probability . Then with probability , we have
where Inequality (i) is due to Assumption O.1 and O.3 and Inequality (ii) is due to Lemma 10 (stated in Section A.1). Hence, the maximum eigenvalue of is upper bounded by . Therefore Lemma 4 implies that
Hence, in order to prove (14) we need to prove that
To achieve this goal, we first introduce an extra quadratic term to the regularizer
where . This quadratic term offers a lower bound for the curvature of the regularizer. The existence of such a lower bounds simplifies our analysis. To see the exact place this lower bound has simplified our analysis, we refer to Lemma 16 in Appendix A.9.4. Based on regularizer define as a counter part of by replacing with , i.e,