M-estimation is arguably the most popular approach to high-dimensional estimation. Given data-points , , we estimate a parameter vector via
is a loss function, andis a constraint set. Prominent examples of this general framework include maximum likelihood (ML) estimation [Fis22] and empirical risk minimization [Vap98].
Once the objective (1
) is formed, it remains to define a computationally efficient scheme to approximate it. Gradient descent is the most frequently applied idea. Assuming –for the moment–, this takes the form
While a large number of variants and refinements have been developed over the years (projected gradient, accelerated gradient [Nes13b], stochastic gradient [RM51], distributed gradient [TBA84], and so on), these share many of the strengths and weaknesses of the elementary iteration (3).
If gradient descent is adopted, the only freedom is in the choice of the loss function . Convexity has been a major guiding principle in this respect. If the function is convex, then the empirical risk is convex as well and hence gradient descent is globally convergent to an M-estimator (the latter is unique under strict convexity). Also, strong convexity of can be used to prove optimal statistical guarantees for the M-estimator . This line of thought can be traced back as far as Fisher’s argument for the asymptotic efficiency of maximum likelihood estimators [Fis22, Fis25], and originated many beautiful contributions. In recent years, a flourishing line of research addresses the very high-dimensional regime , by leveraging on suitable restricted strong convexity assumptions [CT05, CT07, BRT09, NRWY12].
Despite these successes, many problems of practical interest call for non-convex loss functions. Let us briefly mention a few examples of non-convex M-estimators that are often preferred by practitioners to their convex counterparts. We will revisit these examples in Section 4.
In binary linear classification we are given pairs with , , and would like to learn a model of the form with a parameter vector and a threshold function. The non-linear square loss is commonly used in practice
demonstrate superior robustness and classification accuracy of non-convex losses in contrast to convex losses (e.g. hinge or logistic loss). The same loss function is commonly used used in neural-network models[LBH15].
A similar scenario arises in robust regression. In this case, we are given pairs ,…, with , , and we assume the linear model , where the noise terms are i.i.d. with mean zero. Since Huber’s seminal work [Hub73], M-estimators are the method of choice for this problem:
Robustness naturally suggests to investigate the use of a non-convex function , either bounded or increasing slowly at infinity.
Finally, missing data problems famously lead to non-convex optimization formulations. Consider for instance a mixture-of-Gaussians problems in which we are given data points , (for the sake of simplicity we assume identity covariance and known proportions). The maximum-likelihood problem requires to minimize444Here and below denotes the -dimensional standard Gaussian density.
, tensor estimation problems[MR14], and so on.
M-estimation with non-convex loss functions is far less understood than in the convex case. Empirical process theory guarantees uniform convergence of the sample risk to the population risk [BLM13]. However, this does not provide a computationally practical scheme, since gradient descent can get stuck in stationary points that are not global minimizers.
In this paper, we present several general results on non-convex M-estimation and apply them to develop new analysis in each of the three problems mentioned above. We next overview our main results and the paper’s organization, referring to Section 2 for a discussion of related work.
- Uniform convergence of gradient and Hessian.
We prove that, under technical conditions on the loss function , and (we use to hide constant factors). We refer to Section 3.1 for formal statements.
These results complement the classical analysis that implies uniform convergence of the risk itself, but allow us to control the behavior of stationary points. Note that they guarantee uniform convergence of the gradient and Hessian provided with . Apart from logarithmic factors, this is the optimal condition.
(In this paper we will refer to the asymptotics with roughly of the same order as as high-dimensional regime555The specific asymptotics with converging to a constant is also known as ‘Kolmogorov asymptotics’ [Ser13]., to contrast it with the low-dimensional analysis for . We will refer to the asymptotics under sparsity assumptions as very high-dimensional regime.)
- Topology of the empirical risk.
As an immediate consequence of the previous result, the structure of the empirical risk function is –in many cases– surprisingly simple. Recall that a Morse function is a twice differentiable function whose stationary points are non-degenerate (i.e. have an invertible Hessian). In particular, stationary points are isolated, and have a well-defined index. Assume that the population risk is strongly Morse (i.e., at any stationary point
, all the eigenvalues of the Hessian are bounded away from zero). Then, for , the stationary points of the empirical risk are in one-to-one correspondence with those of the population risk and have the same index (minima correspond to minima, saddles to saddles, and so on). Weaker conditions ensure this correspondence for local minima alone.
- Very high-dimensional regime.
We then extend the above picture to the case in which the number of parameters exceeds the number of samples , under the assumption that the true parameter vector is -sparse. This setting is relevant to a large number of applications, ranging from genomics [PZB10] to signal processing [Don06]. In order to promote sparse estimates, we study the following -regularized non-convex problem, cf. Section 3.2:
minimize (7) subject to
We introduce a generalized gradient linearity condition on the loss function and prove that – under this condition– the above problem has a unique local minimum for . Again this is a nearly optimal scaling since no consistent estimation is possible when .
Given a particular M-estimation problem with a suitable statistical model, we combine the above results with an analysis of the population risk to derive precise characterizations of the empirical risk. In Section 4 we demonstrate that this program can be carried out by studying the three problems outlined below:
Binary linear classification. We prove that, for666Recall that, in this case, the number of parameters is equal to the ambient dimension . , the empirical risk has a unique local minimum, that is also the global minimum. Further, gradient descent converges exponentially to this minimizer: , and enjoys nearly optimal estimation error guarantees: . If the true parameter is -sparse, for , the -regularized empirical risk has a unique local minimum, that is also the global minimum. The minimizer enjoys nearly optimal estimation error guarantees: .
Robust regression. We establish similar results for the robust regression model, under technical assumptions on the loss function and on the distribution of the noise . Namely, we prove that the empirical risk has a unique local minimum, that can be found efficiently via gradient descent, provided . If the true parameter is -sparse, for , the -regularized empirical risk has a unique local minimum.
Mixture of Gaussians. We consider the special case of two Gaussians with equal proportions, i.e. with . We prove that, for , the empirical risk has two global minima that are related by exchange of the two Gaussian components and , connected via saddle points. The trust region algorithm converges to one of these two minima when initialized at random. Also the two minima are within nearly optimal statistical errors from the true centers.
We use normal font for scalars (e.g. ) and boldface for vectors (
). We will typically reserve capital letters for random variables (and capital bold for random vectors). Given, their standard scalar product is denoted by . The norm of a vector is –as usual– indicated by . The identity matrix is denoted by .
Given a matrix , we denote by , its eigenvalues in decreasing order, and by its operator norm. Finally, we shall occasionally consider third order tensors . In this case the operator (or injective) norm is defined as , where .
We let be the ball in with center and radius . We will often omit the dimension superscript when clear from the context, the subscript when , and the center when . In particular is the euclidean ball of radius . For any set , we let be the boundary of the set.
We will generally use upper case letters for random variables and lower case for deterministic values (unless the latter are matrices).
2 Related literature
While developing a theory on non-convex M-estimators is an outstanding challenge, several important facts are by now well understood thanks to a stream of beautiful works. We will provide a necessarily incomplete summary in the next paragraphs.
Uniform convergence of the empirical risk. Let denote the population risk. Under mild conditions on the loss function
and on the sample size, it is known that with high probability
for some small [VdG00, BLM13]. This immediately implies guarantees for the M-estimator in -loss (or prediction error). Under additional conditions on the population risk , bounds in estimation error can be derived as well.
For general non-convex losses, uniform convergence results of the form (8) do not preclude the existence of multiple local minima of the sample risk . Hence, this theory does not provide –by itself– computationally practical methods to compute .
Algorithmic convergence to the ‘statistical neighborhood’. In general, gradient descent and other local optimization procedures are expected to converge to local minima of the empirical risk . In several cases, it is proved that every local minimizer is ‘statistically good’. More precisely, the estimation error (e.g. the error ) is within a constant from the minimax rate for the problem at hand. Also, gradient descent converges to such a neighborhood of the true
within a small number of iterations. Results of this type have been proved, among others, for linear regression with noisy covariates[LW12], generalized linear models with non-convex regularizers [LW13], robust regression [LM13], and sparse regression [YWL15].
While these results are very important, they are not completely satisfactory. For instance, one natural question is whether the statistical error might be improved by finding a better local minimum. If, for instance, the estimation error could be improved by a factor by finding a better local minimum, it would be worth in many applications to restart gradient descent at multiple initializations. Also, since convergence to a fixed point is not guaranteed, these approaches come without a clear stopping criterion. Finally these proofs make use of the restricted strong convexity (RSC) assumption introduced [NRWY12, LW13], but do not provide any general tool to establish this condition. In contrast, we prove uniform convergence results that can be used to ensure a condition similar to RSC.
To the best of our knowledge, the only proof of unique local minimum of the regularized empirical risk is obtained in a recent paper by Po-Ling Loh [Loh15]. This works assumes the linear regression model , and establishes uniqueness for penalized regression with a certain class of bounded regularizers. This result is comparable to our Theorem 4.4, see Section 4.4, which uses regularization instead. Note that, in [Loh15], the sample size is required to scale quadratically in the sparsity: . Our proof technique is substantially different from the one of [Loh15], and we only require .
Hybrid optimization methods. It is often difficult to ensure global convergence to a minimizer of the sample risk or even to a statistical neighborhood of the true parameters. Several papers develop two-stage procedures to overcome this problem. The first stage constructs a smart initialization that is within a certain large neighborhood of the true parameters. Spectral methods are often used to implement this step. In the second stage, the estimate is refined by gradient descent (or another local procedure) initialized at . This general approach was studied in a number of problems including matrix completion [KOM09], phase retrieval [CC15], tensor decomposition [AGJ15].
In some cases, the local optimization stage is only proved to converge to a statistical neighborhood of , and hence this style of analysis shares the shortcomings emphasized in the previous paragraph. In others, it is proven to converge to a single point. Further, in practice, the smart initialization is often not needed, and descent algorithms converge from random initialization as well. Finally, as mentioned above, these analyses are typically carried on in a case-by-case manner.
3 Uniform convergence results
In this section we develop our key tools, that are uniform convergence results on the gradient and Hessian of the empirical risk. We also establish some of the direct implications of our results. Throughout, the data consists of the i.i.d. random variables . We will use if we want to refer to the corresponding realization. The empirical risk is defines by Eq. (2) and the corresponding population risk is . The true parameter vector satisfies the condition .
We consider two regimes, a high dimensional regime in which the number of parameters is allowed to diverge roughly in proportion with the number of samples , and a very high-dimensional regime in which the true parameters’ vector is sparse and the number of parameters can be much larger than . We treat these two cases separately because the theory is simpler and more general in the first regime.
3.1 High-dimensional regime
In order to avoid technical complications, we will limit optimization to a bounded set, i.e. we will let to be the Euclidean ball in dimensions.
We begin by stating our assumptions. Assumptions 3.1 and 3.1 below quantify the amount of statistical noise in the gradient and Hessian of the loss function. [Gradient statistical noise] The gradient of the loss is -sub-Gaussian. Namely, for any , and
[Hessian statistical noise] The Hessian of the loss, evaluated on a unit vector, is -sub-exponential. Namely, for any , and
Our third assumption requires the Hessian of the loss to be a Lipschitz function of the vector of parameters . [Hessian regularity] The Hessian of the population risk is bounded at one point. Namely, there exists and such that .
Further, the Hessian of the loss function is Lipschitz continuous with integrable Lipschitz constant. Namely, there exists (potentially diverging polynomially in ) such that
Further, there exists a constant such that , . Note that has the same units777By this we mean that the two quantities behave in the same way under a rescaling of the parameters . as , and has the same units as . Thus, has the same units as , has the same units as , and has the same units as . This is the reason why we bound and in the form as in Assumption 3.1. In this way, and are dimensionless.
Discrete loss functions (e.g. the loss) are common within the statistical learning literature, but do not satisfy the above assumption because the gradient and Hessian are not defined everywhere. Note however that these can be well approximated by differentiable losses, with little –if any– practical difference.
The sample gradient converges uniformly to the population gradient in Euclidean norm. Namely, if , we have
The sample Hessian converges uniformly to the population Hessian in operator norm. Namely, if , we have
The above theorem immediately implies that the structure of stationary points of the sample risk must reflect that of the population risk. In order to formalize this intuition, we introduce the notion of strongly Morse function. Given a differentiable function , we say that in the interior of the ball is critical (or stationary) if .
Recall that a twice differentiable function is Morse if all its critical points are non-degenerate, i.e. have an invertible Hessian. In other words implies for all . Our next definition provides a quantitative version of this notion. We say that a twice differentiable function is -strongly Morse if for and, for any , , the following holds
Note that, analogously to a Morse function on a compact domain, a strongly Morse function can have only a finite number of critical points which are in the interior of . Also recall that the index of a non-degenerate critical point is the number of negative eigenvalues of the Hessian at (assuming to be twice differentiable).
If the population risk is -strongly Morse in , then the sample risk is -strongly Morse in . Further there is a one-to-one correspondence between the set of critical points of , and the set of critical points of , such that (letting be the point in correspondence with , for any )
The index of coincides with the index of . (In particular, local minima correspond to local minima, and saddles to saddles.)
If we further let , and assume where , we have, for each ,
The strong Morse assumption imposes conditions on all the eigenvalues of the Hessian at near-critical points, and implies a detailed characterization of the empirical risk. If only weaker properties can be established for the population risk, Theorem 3.1 can nevertheless be very useful. For instance, in Section 4.5 we consider an example in which near critical points have a Hessian whose smallest eigenvalue is either positive or negative, but in both cases bounded away from . This weaker condition is sufficient to obtain a characterization of the local minima of the empirical risk.
3.2 Very high-dimensional regime
In the very-high dimensional regime , we will solve the -penalized risk minimization problem
We need some additional assumptions. It is fairly straightforward to check them in specific cases, see e.g. Section 4.1. The first assumption is mainly technical, and not overly restrictive: it requires the loss function to have almost surely bounded gradient, in a suitable sense. [Gradient bounds] There exists a constant such that -almost surely, for all ,
Our key structural assumption is stated next. It requires the gradient of the loss function to depend on the parameters only through a linear function of , possibly dependent on the feature vector . Note that is regarded here as fixed, and hence omitted from the arguments. [Generalized gradient linearity] There exist functions , and , , such that
In addition, is -Lipschitz to its first argument, , and is mean-zero and -sub-Gaussian. As an example, in the case of binary linear classification and robust regression, the data is given as a pair , and there exists a function such that . Assumption 3.2 is satisfied with provided the latter is Lipschitz as a function of .
The sample directional gradient converges uniformly to the population directional gradient, along the direction . Namely, we have
The sample restricted Hessian converges uniformly to the population restricted Hessian in the set for any . Namely, as we have
4.1 Binary linear classification: High dimensional regime
As mentioned in the introduction, in this case we are given pairs with , , whereby (hence in this case). We estimate by minimizing the non-linear square loss (4), which we copy here for the reader’s convenience:
This can be regarded as a smooth version of the loss.
We collect below the technical assumptions on this model. [Binary linear classification]
The activation is three times differentiable with for all , and has bounded first, second and third derivatives. Namely, for some constant :
The feature vector has zero mean and is -sub-Gaussian, that is for all .
The feature vector spans all directions in , that is, for some .
Our main results on binary linear classification are summarized in the theorem below. Under Assumption 1, further assume . There exist positive constants , and depending on parameters and the activation function , but independent of and , such that, if , the following hold with probability at least :
The empirical risk function has a unique local minimizer in , that is the global minimizer .
Gradient descent with fixed step size converges exponentially fast to the global minimizer, for any initialization : .
We have .
The proof of this theorem can be found in Appendix E.1, and is based on the following two-step strategy. First, we study the population risk , and establish its qualitative properties using analysis. Second, we use our uniform convergence result (Theorem 3.1) to prove that the same properties carry over to the sample risk . Figure 1 presents a small numerical example that illustrates how the qualitative features of the population risk apply to the empirical risk as well.
A few remarks are in order. First of all, the convergence rate of gradient descent (at point ) is independent of the dimension and number of samples . In other words, iterations are sufficient to converge within distance from the global minimizer. Classical theory of empirical risk minimization only concerns the statistical properties of the optimum, but does not provide efficient algorithms.
Next, note that our condition on the sample size is nearly optimal. Indeed, it is information-theoretically impossible to estimate from less than binary samples. Finally, the convergence rate at point also nearly matches the optimal (parametric) rate .
4.2 Binary linear classification: Very high-dimensional regime
As in the previous section, we are given pairs with , , and . However is assumed to be sparse, and the number of samples is allowed to be much smaller than the ambient dimension . We adopt again the non-linear square loss (4), but now use a -constrained -regularized risk minimization, as per Eq. (18), which we rewrite here explicitly for the reader’s ease
The very high-dimensional regime
is of interest in many contexts. In machine learning, the number of parameterscan increase when a large number of additional features are added to the model (for instance, nonlinear functions of an original set of features). In signal processing, represents an unknown signal, of which we measure noisy random linear projections , , quantized to one single bit. This scenario is relevant to group testing [AS12] and analog-to-digital conversion [LWYB11, LB12], and has been studied under the name of ‘one-bit compressed sensing’; see [PV13a] and references therein.
In the very high-dimensional regime we need additional assumptions on the distribution of as well as the activation function . [Fast-decaying activation] The activation function satisfy for some absolute constant . [Continuous and bounded features] The feature vector has a density in , that is, for all Borel sets . In addition, the feature vector is bounded: , and almost surely. Here is a dimensionless constant greater than .
Assumption 4.2 holds popular examples of activation functions, such as the logistic or probit . For unbounded sub-Gaussian feature vectors, the next theorem can be supplemented by a truncation argument at level . Hence, the conclusions of this theorem hold, with an additional factor.
In the statement of the following theorem, for convenience, we will also assume . This is a technical assumption so that we can bound . And since we are considering the very high dimensional regime, it is not meaningful to discuss .
Under Assumptions 1, 4.2 and 4.2, further assume , , and . Then there exist constants , , , and depending on and the activation function , but independent of , , , and , such that as and , the following hold with probability at least :
Any stationary point of problem (25) is in .
As long as is large enough such that and , the problem has a unique local minimizer which is also the global minimizer.
As in the previous section, our proof makes a crucial use of the sparse uniform convergence result, Theorem 3.2, together with an analysis of the population risk.
Let us emphasize that Theorem 4.2 leaves open the existence of a fast algorithm to find the global optimizer . However [Nes13a, Theorem 3] implies that, by running steps of projected gradient descent, we can find an estimate which has a subgradient of order . While we expect this sequence to converge to , we defer this question to future work.
Theorem 4.2 establishes a nearly optimal upper bound on the estimation error . Indeed this error is within a logarithmic factor from the error achieved by an oracle estimator that is given the exact support of . For comparison, [PV13a, PV13b] proves
for a linear programming formulation, under the more restrictive assumption of Gaussian feature vectors. This analysis was generalized in [ALPV14] to feature vectors with i.i.d. entries, although with the same estimation error bound. The optimal rate was obtained only recently in [PVY14], again for standard Gaussian feature vectors.
Let us finally emphasize that the estimator defined here uses a bounded loss function and is potentially more robust to outliers than other approaches that use a convex loss (e.g. logistic loss).
4.3 Robust regression: High-dimensional regime
In robust regression we are given data with , , and we assume the linear model , where the noise terms are i.i.d. with mean zero. Also in this case we have . We use the loss (5), which we copy here for the reader’s convenience:
Classical choices for loss function are the Huber loss [Hub73] which is convex with for large enough, and Tukey’s bisquare loss, which is bounded and defined as
It is common to define the associated score function as .
We next formulate our assumptions. [Robust regression]
The score function
is twice differentiable and odd inwith for all , and has bounded zero, first, and second derivatives. Namely, for some constant :
The noise has a symmetric distribution, i.e. is such that is distributed as . Further, defining we have for all , as well as .
The feature vector has zero mean and is -sub-Gaussian, that is for all .
The feature vector spans all directions in , that is, for some .
Note that the condition for all and are quite mild, and holds –for instance– if the noise has a density that is strictly positive for all , and decreasing for . Under Assumption 4.3, further assume . Then there exist positive constants , and depending on parameters , the loss function , and the law of noise but independent of and , such that as , the robust regression estimator satisfies the following with probability at least :
The empirical risk function has a unique local minimizer in , that is the global minimizer .
Gradient descent with fixed step size converges exponentially fast to the global minimizer, for any initialization :