1 Introduction
Mestimation is arguably the most popular approach to highdimensional estimation. Given datapoints , , we estimate a parameter vector via
(1)  
(2) 
Here
is a loss function, and
is a constraint set. Prominent examples of this general framework include maximum likelihood (ML) estimation [Fis22] and empirical risk minimization [Vap98].Once the objective (1
) is formed, it remains to define a computationally efficient scheme to approximate it. Gradient descent is the most frequently applied idea. Assuming –for the moment–
, this takes the form(3) 
While a large number of variants and refinements have been developed over the years (projected gradient, accelerated gradient [Nes13b], stochastic gradient [RM51], distributed gradient [TBA84], and so on), these share many of the strengths and weaknesses of the elementary iteration (3).
If gradient descent is adopted, the only freedom is in the choice of the loss function . Convexity has been a major guiding principle in this respect. If the function is convex, then the empirical risk is convex as well and hence gradient descent is globally convergent to an Mestimator (the latter is unique under strict convexity). Also, strong convexity of can be used to prove optimal statistical guarantees for the Mestimator . This line of thought can be traced back as far as Fisher’s argument for the asymptotic efficiency of maximum likelihood estimators [Fis22, Fis25], and originated many beautiful contributions. In recent years, a flourishing line of research addresses the very highdimensional regime , by leveraging on suitable restricted strong convexity assumptions [CT05, CT07, BRT09, NRWY12].
Despite these successes, many problems of practical interest call for nonconvex loss functions. Let us briefly mention a few examples of nonconvex Mestimators that are often preferred by practitioners to their convex counterparts. We will revisit these examples in Section 4.
In binary linear classification we are given pairs with , , and would like to learn a model of the form with a parameter vector and a threshold function. The nonlinear square loss is commonly used in practice
(4) 
Several empirical studies [CDT09, WL12, NS13]
demonstrate superior robustness and classification accuracy of nonconvex losses in contrast to convex losses (e.g. hinge or logistic loss). The same loss function is commonly used used in neuralnetwork models
[LBH15].A similar scenario arises in robust regression. In this case, we are given pairs ,…, with , , and we assume the linear model , where the noise terms are i.i.d. with mean zero. Since Huber’s seminal work [Hub73], Mestimators are the method of choice for this problem:
(5) 
Robustness naturally suggests to investigate the use of a nonconvex function , either bounded or increasing slowly at infinity.
Finally, missing data problems famously lead to nonconvex optimization formulations. Consider for instance a mixtureofGaussians problems in which we are given data points , (for the sake of simplicity we assume identity covariance and known proportions). The maximumlikelihood problem requires to minimize^{4}^{4}4Here and below denotes the dimensional standard Gaussian density.
(6) 
with respect to the cluster centers . Other examples include lowrank matrix completion [KOM09], phase retrieval [SQW16]
, tensor estimation problems
[MR14], and so on.Mestimation with nonconvex loss functions is far less understood than in the convex case. Empirical process theory guarantees uniform convergence of the sample risk to the population risk [BLM13]. However, this does not provide a computationally practical scheme, since gradient descent can get stuck in stationary points that are not global minimizers.
In this paper, we present several general results on nonconvex Mestimation and apply them to develop new analysis in each of the three problems mentioned above. We next overview our main results and the paper’s organization, referring to Section 2 for a discussion of related work.
 Uniform convergence of gradient and Hessian.

We prove that, under technical conditions on the loss function , and (we use to hide constant factors). We refer to Section 3.1 for formal statements.
These results complement the classical analysis that implies uniform convergence of the risk itself, but allow us to control the behavior of stationary points. Note that they guarantee uniform convergence of the gradient and Hessian provided with . Apart from logarithmic factors, this is the optimal condition.
(In this paper we will refer to the asymptotics with roughly of the same order as as highdimensional regime^{5}^{5}5The specific asymptotics with converging to a constant is also known as ‘Kolmogorov asymptotics’ [Ser13]., to contrast it with the lowdimensional analysis for . We will refer to the asymptotics under sparsity assumptions as very highdimensional regime.)
 Topology of the empirical risk.

As an immediate consequence of the previous result, the structure of the empirical risk function is –in many cases– surprisingly simple. Recall that a Morse function is a twice differentiable function whose stationary points are nondegenerate (i.e. have an invertible Hessian). In particular, stationary points are isolated, and have a welldefined index. Assume that the population risk is strongly Morse (i.e., at any stationary point
, all the eigenvalues of the Hessian are bounded away from zero
). Then, for , the stationary points of the empirical risk are in onetoone correspondence with those of the population risk and have the same index (minima correspond to minima, saddles to saddles, and so on). Weaker conditions ensure this correspondence for local minima alone.  Very highdimensional regime.

We then extend the above picture to the case in which the number of parameters exceeds the number of samples , under the assumption that the true parameter vector is sparse. This setting is relevant to a large number of applications, ranging from genomics [PZB10] to signal processing [Don06]. In order to promote sparse estimates, we study the following regularized nonconvex problem, cf. Section 3.2:
minimize (7) subject to We introduce a generalized gradient linearity condition on the loss function and prove that – under this condition– the above problem has a unique local minimum for . Again this is a nearly optimal scaling since no consistent estimation is possible when .
 Applications.

Given a particular Mestimation problem with a suitable statistical model, we combine the above results with an analysis of the population risk to derive precise characterizations of the empirical risk. In Section 4 we demonstrate that this program can be carried out by studying the three problems outlined below:

Binary linear classification. We prove that, for^{6}^{6}6Recall that, in this case, the number of parameters is equal to the ambient dimension . , the empirical risk has a unique local minimum, that is also the global minimum. Further, gradient descent converges exponentially to this minimizer: , and enjoys nearly optimal estimation error guarantees: . If the true parameter is sparse, for , the regularized empirical risk has a unique local minimum, that is also the global minimum. The minimizer enjoys nearly optimal estimation error guarantees: .

Robust regression. We establish similar results for the robust regression model, under technical assumptions on the loss function and on the distribution of the noise . Namely, we prove that the empirical risk has a unique local minimum, that can be found efficiently via gradient descent, provided . If the true parameter is sparse, for , the regularized empirical risk has a unique local minimum.

Mixture of Gaussians. We consider the special case of two Gaussians with equal proportions, i.e. with . We prove that, for , the empirical risk has two global minima that are related by exchange of the two Gaussian components and , connected via saddle points. The trust region algorithm converges to one of these two minima when initialized at random. Also the two minima are within nearly optimal statistical errors from the true centers.

1.1 Notations
We use normal font for scalars (e.g. ) and boldface for vectors (
). We will typically reserve capital letters for random variables (and capital bold for random vectors). Given
, their standard scalar product is denoted by . The norm of a vector is –as usual– indicated by . The identity matrix is denoted by .Given a matrix , we denote by , its eigenvalues in decreasing order, and by its operator norm. Finally, we shall occasionally consider third order tensors . In this case the operator (or injective) norm is defined as , where .
We let be the ball in with center and radius . We will often omit the dimension superscript when clear from the context, the subscript when , and the center when . In particular is the euclidean ball of radius . For any set , we let be the boundary of the set.
We will generally use upper case letters for random variables and lower case for deterministic values (unless the latter are matrices).
2 Related literature
While developing a theory on nonconvex Mestimators is an outstanding challenge, several important facts are by now well understood thanks to a stream of beautiful works. We will provide a necessarily incomplete summary in the next paragraphs.
Uniform convergence of the empirical risk. Let denote the population risk. Under mild conditions on the loss function
and on the sample size, it is known that with high probability
(8) 
for some small [VdG00, BLM13]. This immediately implies guarantees for the Mestimator in loss (or prediction error). Under additional conditions on the population risk , bounds in estimation error can be derived as well.
For general nonconvex losses, uniform convergence results of the form (8) do not preclude the existence of multiple local minima of the sample risk . Hence, this theory does not provide –by itself– computationally practical methods to compute .
Algorithmic convergence to the ‘statistical neighborhood’. In general, gradient descent and other local optimization procedures are expected to converge to local minima of the empirical risk . In several cases, it is proved that every local minimizer is ‘statistically good’. More precisely, the estimation error (e.g. the error ) is within a constant from the minimax rate for the problem at hand. Also, gradient descent converges to such a neighborhood of the true
within a small number of iterations. Results of this type have been proved, among others, for linear regression with noisy covariates
[LW12], generalized linear models with nonconvex regularizers [LW13], robust regression [LM13], and sparse regression [YWL15].While these results are very important, they are not completely satisfactory. For instance, one natural question is whether the statistical error might be improved by finding a better local minimum. If, for instance, the estimation error could be improved by a factor by finding a better local minimum, it would be worth in many applications to restart gradient descent at multiple initializations. Also, since convergence to a fixed point is not guaranteed, these approaches come without a clear stopping criterion. Finally these proofs make use of the restricted strong convexity (RSC) assumption introduced [NRWY12, LW13], but do not provide any general tool to establish this condition. In contrast, we prove uniform convergence results that can be used to ensure a condition similar to RSC.
To the best of our knowledge, the only proof of unique local minimum of the regularized empirical risk is obtained in a recent paper by PoLing Loh [Loh15]. This works assumes the linear regression model , and establishes uniqueness for penalized regression with a certain class of bounded regularizers. This result is comparable to our Theorem 4.4, see Section 4.4, which uses regularization instead. Note that, in [Loh15], the sample size is required to scale quadratically in the sparsity: . Our proof technique is substantially different from the one of [Loh15], and we only require .
Hybrid optimization methods. It is often difficult to ensure global convergence to a minimizer of the sample risk or even to a statistical neighborhood of the true parameters. Several papers develop twostage procedures to overcome this problem. The first stage constructs a smart initialization that is within a certain large neighborhood of the true parameters. Spectral methods are often used to implement this step. In the second stage, the estimate is refined by gradient descent (or another local procedure) initialized at . This general approach was studied in a number of problems including matrix completion [KOM09], phase retrieval [CC15], tensor decomposition [AGJ15].
In some cases, the local optimization stage is only proved to converge to a statistical neighborhood of , and hence this style of analysis shares the shortcomings emphasized in the previous paragraph. In others, it is proven to converge to a single point. Further, in practice, the smart initialization is often not needed, and descent algorithms converge from random initialization as well. Finally, as mentioned above, these analyses are typically carried on in a casebycase manner.
3 Uniform convergence results
In this section we develop our key tools, that are uniform convergence results on the gradient and Hessian of the empirical risk. We also establish some of the direct implications of our results. Throughout, the data consists of the i.i.d. random variables . We will use if we want to refer to the corresponding realization. The empirical risk is defines by Eq. (2) and the corresponding population risk is . The true parameter vector satisfies the condition .
We consider two regimes, a high dimensional regime in which the number of parameters is allowed to diverge roughly in proportion with the number of samples , and a very highdimensional regime in which the true parameters’ vector is sparse and the number of parameters can be much larger than . We treat these two cases separately because the theory is simpler and more general in the first regime.
3.1 Highdimensional regime
In order to avoid technical complications, we will limit optimization to a bounded set, i.e. we will let to be the Euclidean ball in dimensions.
We begin by stating our assumptions. Assumptions 3.1 and 3.1 below quantify the amount of statistical noise in the gradient and Hessian of the loss function. [Gradient statistical noise] The gradient of the loss is subGaussian. Namely, for any , and
(9) 
[Hessian statistical noise] The Hessian of the loss, evaluated on a unit vector, is subexponential. Namely, for any , and
(10)  
(11) 
Our third assumption requires the Hessian of the loss to be a Lipschitz function of the vector of parameters . [Hessian regularity] The Hessian of the population risk is bounded at one point. Namely, there exists and such that .
Further, the Hessian of the loss function is Lipschitz continuous with integrable Lipschitz constant. Namely, there exists (potentially diverging polynomially in ) such that
(12)  
(13) 
Further, there exists a constant such that , . Note that has the same units^{7}^{7}7By this we mean that the two quantities behave in the same way under a rescaling of the parameters . as , and has the same units as . Thus, has the same units as , has the same units as , and has the same units as . This is the reason why we bound and in the form as in Assumption 3.1. In this way, and are dimensionless.
Discrete loss functions (e.g. the loss) are common within the statistical learning literature, but do not satisfy the above assumption because the gradient and Hessian are not defined everywhere. Note however that these can be well approximated by differentiable losses, with little –if any– practical difference.
We are now in position to state our uniform convergence result. Under Assumptions 3.1, 3.1, and 3.1 stated above, there exists a universal constant , such that letting , the following hold:

The sample gradient converges uniformly to the population gradient in Euclidean norm. Namely, if , we have
(14) 
The sample Hessian converges uniformly to the population Hessian in operator norm. Namely, if , we have
(15)
The above theorem immediately implies that the structure of stationary points of the sample risk must reflect that of the population risk. In order to formalize this intuition, we introduce the notion of strongly Morse function. Given a differentiable function , we say that in the interior of the ball is critical (or stationary) if .
Recall that a twice differentiable function is Morse if all its critical points are nondegenerate, i.e. have an invertible Hessian. In other words implies for all . Our next definition provides a quantitative version of this notion. We say that a twice differentiable function is strongly Morse if for and, for any , , the following holds
(16) 
Note that, analogously to a Morse function on a compact domain, a strongly Morse function can have only a finite number of critical points which are in the interior of . Also recall that the index of a nondegenerate critical point is the number of negative eigenvalues of the Hessian at (assuming to be twice differentiable).
Under Assumptions 3.1, 3.1, and 3.1, let , where is as in the statement of Theorem 3.1. Then the following happens with probability at least .
If the population risk is strongly Morse in , then the sample risk is strongly Morse in . Further there is a onetoone correspondence between the set of critical points of , and the set of critical points of , such that (letting be the point in correspondence with , for any )

The index of coincides with the index of . (In particular, local minima correspond to local minima, and saddles to saddles.)

If we further let , and assume where , we have, for each ,
(17)
The strong Morse assumption imposes conditions on all the eigenvalues of the Hessian at nearcritical points, and implies a detailed characterization of the empirical risk. If only weaker properties can be established for the population risk, Theorem 3.1 can nevertheless be very useful. For instance, in Section 4.5 we consider an example in which near critical points have a Hessian whose smallest eigenvalue is either positive or negative, but in both cases bounded away from . This weaker condition is sufficient to obtain a characterization of the local minima of the empirical risk.
3.2 Very highdimensional regime
In the veryhigh dimensional regime , we will solve the penalized risk minimization problem
minimize  (18)  
subject to 
We need some additional assumptions. It is fairly straightforward to check them in specific cases, see e.g. Section 4.1. The first assumption is mainly technical, and not overly restrictive: it requires the loss function to have almost surely bounded gradient, in a suitable sense. [Gradient bounds] There exists a constant such that almost surely, for all ,
(19) 
Our key structural assumption is stated next. It requires the gradient of the loss function to depend on the parameters only through a linear function of , possibly dependent on the feature vector . Note that is regarded here as fixed, and hence omitted from the arguments. [Generalized gradient linearity] There exist functions , and , , such that
(20) 
In addition, is Lipschitz to its first argument, , and is meanzero and subGaussian. As an example, in the case of binary linear classification and robust regression, the data is given as a pair , and there exists a function such that . Assumption 3.2 is satisfied with provided the latter is Lipschitz as a function of .
Under Assumptions 3.1, 3.1, 18 and 3.2 stated above, there exists a constant that depends on , and a universal constant such that letting , the following hold:

The sample directional gradient converges uniformly to the population directional gradient, along the direction . Namely, we have
(21) 
The sample restricted Hessian converges uniformly to the population restricted Hessian in the set for any . Namely, as we have
(22)
4 Applications
4.1 Binary linear classification: High dimensional regime
As mentioned in the introduction, in this case we are given pairs with , , whereby (hence in this case). We estimate by minimizing the nonlinear square loss (4), which we copy here for the reader’s convenience:
minimize  (23)  
subject to 
This can be regarded as a smooth version of the loss.
We collect below the technical assumptions on this model. [Binary linear classification]

The activation is three times differentiable with for all , and has bounded first, second and third derivatives. Namely, for some constant :
(24) 
The feature vector has zero mean and is subGaussian, that is for all .

The feature vector spans all directions in , that is, for some .
Assumption 1.
is satisfied by many classical activation functions, a prominent example being the logistic (or sigmoid) function
.Our main results on binary linear classification are summarized in the theorem below. Under Assumption 1, further assume . There exist positive constants , and depending on parameters and the activation function , but independent of and , such that, if , the following hold with probability at least :

The empirical risk function has a unique local minimizer in , that is the global minimizer .

Gradient descent with fixed step size converges exponentially fast to the global minimizer, for any initialization : .

We have .
The proof of this theorem can be found in Appendix E.1, and is based on the following twostep strategy. First, we study the population risk , and establish its qualitative properties using analysis. Second, we use our uniform convergence result (Theorem 3.1) to prove that the same properties carry over to the sample risk . Figure 1 presents a small numerical example that illustrates how the qualitative features of the population risk apply to the empirical risk as well.
A few remarks are in order. First of all, the convergence rate of gradient descent (at point ) is independent of the dimension and number of samples . In other words, iterations are sufficient to converge within distance from the global minimizer. Classical theory of empirical risk minimization only concerns the statistical properties of the optimum, but does not provide efficient algorithms.
Next, note that our condition on the sample size is nearly optimal. Indeed, it is informationtheoretically impossible to estimate from less than binary samples. Finally, the convergence rate at point also nearly matches the optimal (parametric) rate .
4.2 Binary linear classification: Very highdimensional regime
As in the previous section, we are given pairs with , , and . However is assumed to be sparse, and the number of samples is allowed to be much smaller than the ambient dimension . We adopt again the nonlinear square loss (4), but now use a constrained regularized risk minimization, as per Eq. (18), which we rewrite here explicitly for the reader’s ease
minimize  (25)  
subject to 
The very highdimensional regime
is of interest in many contexts. In machine learning, the number of parameters
can increase when a large number of additional features are added to the model (for instance, nonlinear functions of an original set of features). In signal processing, represents an unknown signal, of which we measure noisy random linear projections , , quantized to one single bit. This scenario is relevant to group testing [AS12] and analogtodigital conversion [LWYB11, LB12], and has been studied under the name of ‘onebit compressed sensing’; see [PV13a] and references therein.In the very highdimensional regime we need additional assumptions on the distribution of as well as the activation function . [Fastdecaying activation] The activation function satisfy for some absolute constant . [Continuous and bounded features] The feature vector has a density in , that is, for all Borel sets . In addition, the feature vector is bounded: , and almost surely. Here is a dimensionless constant greater than .
Assumption 4.2 holds popular examples of activation functions, such as the logistic or probit . For unbounded subGaussian feature vectors, the next theorem can be supplemented by a truncation argument at level . Hence, the conclusions of this theorem hold, with an additional factor.
In the statement of the following theorem, for convenience, we will also assume . This is a technical assumption so that we can bound . And since we are considering the very high dimensional regime, it is not meaningful to discuss .
Under Assumptions 1, 4.2 and 4.2, further assume , , and . Then there exist constants , , , and depending on and the activation function , but independent of , , , and , such that as and , the following hold with probability at least :

Any stationary point of problem (25) is in .

As long as is large enough such that and , the problem has a unique local minimizer which is also the global minimizer.
As in the previous section, our proof makes a crucial use of the sparse uniform convergence result, Theorem 3.2, together with an analysis of the population risk.
Let us emphasize that Theorem 4.2 leaves open the existence of a fast algorithm to find the global optimizer . However [Nes13a, Theorem 3] implies that, by running steps of projected gradient descent, we can find an estimate which has a subgradient of order . While we expect this sequence to converge to , we defer this question to future work.
Theorem 4.2 establishes a nearly optimal upper bound on the estimation error . Indeed this error is within a logarithmic factor from the error achieved by an oracle estimator that is given the exact support of . For comparison, [PV13a, PV13b] proves
for a linear programming formulation, under the more restrictive assumption of Gaussian feature vectors
. This analysis was generalized in [ALPV14] to feature vectors with i.i.d. entries, although with the same estimation error bound. The optimal rate was obtained only recently in [PVY14], again for standard Gaussian feature vectors.Let us finally emphasize that the estimator defined here uses a bounded loss function and is potentially more robust to outliers than other approaches that use a convex loss (e.g. logistic loss).
4.3 Robust regression: Highdimensional regime
In robust regression we are given data with , , and we assume the linear model , where the noise terms are i.i.d. with mean zero. Also in this case we have . We use the loss (5), which we copy here for the reader’s convenience:
minimize  (26)  
subject to 
Classical choices for loss function are the Huber loss [Hub73] which is convex with for large enough, and Tukey’s bisquare loss, which is bounded and defined as
(27) 
It is common to define the associated score function as .
We next formulate our assumptions. [Robust regression]

The score function
is twice differentiable and odd in
with for all , and has bounded zero, first, and second derivatives. Namely, for some constant :(28) 
The noise has a symmetric distribution, i.e. is such that is distributed as . Further, defining we have for all , as well as .

The feature vector has zero mean and is subGaussian, that is for all .

The feature vector spans all directions in , that is, for some .
Note that the condition for all and are quite mild, and holds –for instance– if the noise has a density that is strictly positive for all , and decreasing for . Under Assumption 4.3, further assume . Then there exist positive constants , and depending on parameters , the loss function , and the law of noise but independent of and , such that as , the robust regression estimator satisfies the following with probability at least :

The empirical risk function has a unique local minimizer in , that is the global minimizer .

Gradient descent with fixed step size converges exponentially fast to the global minimizer, for any initialization :
Comments
There are no comments yet.