1.1 Background and Motivation
Nonconvex loss functions arise in many different branches of statistics, machine learning and deep learning. These loss functions entail several advantages from a statistical point of view. For instance, in robust regression, where one requires that the influence function of the loss is bounded, nonconvex losses are widely used. Furthermore, they are unavoidable in areas such as deep learning where they arise as a byproduct of the representation of the data. Despite the exponential increase in methodologies involving nonconvex loss functions, there are still many theoretical questions that need to be answered.
As a matter of fact, the nonconvex optimization problems can usually be solved only via algorithms that guarantee convergence to a so-called stationary point. A stationary point is often not the global minimum. It is almost hopeless to recover the latter. Statistical theory has mostly focused on deriving properties of an incomputable global optimum. We show that under certain circumstances stationary points satisfy sharp oracle results similar to those that were derived for the global optimum.
High-dimensional data (i.e. when the number of parameters to be estimated exceeds the number of observations) represent an additional challenge. A well-established way of tackling this problem is to assume that the number of “active” parameters is smaller than the dimension of the parameter space. This assumption is typically called “sparsity”. Estimators designed under the sparsity assumption are often M-estimators with either an additional constraint or a penalty term. Under convex loss functions these approaches are numerically equivalent. Here we focus on the latter approach. We consider estimators that are composed of a nonconvex differentiable loss and a penalty term. Primarily, the penalty term is chosen to be a “sparsity-inducing” norm.
We now describe the structure of the estimators that we are interested in. Let be independent observations with values in some space stemming from a distribution depending on . Let be a differentiable possibly nonconvex function such that
The function measures the “misfit” that arises by taking the decision in comparison to the given data.
We define as
is named the “empirical risk”. It is a random quantity as it depends on the random observations . The unknown quantity we are interested in estimating is given by the minimizer of the population version:
where is the risk.
Consider a norm on with dual norm of . The subdifferential of the norm is defined as
. We consider empirical risk minimization problems of the form
where is a tuning parameter that needs to be chosen.
To solve optimization problems of the type given in (1.5) one often uses gradient descent algorithms and its modifications. However, algorithms for nonconvex optimization problems typically output a local optimum of the objective function (1.5) but not . In this paper we show that points satisfying
where , enjoy some properties of the (incomputable) estimator . These points are called stationary points.
Using that and that one can see that the two point inequality is indeed satisfied by points that satisfy inequality (1.6).
be a non-random vector with. We think of the vector as of a quantity that already “contains” some additional structural assumption about the estimation problem such as the number of non-zero entries of the target . The vector optimally trades off the approximation and estimation errors. In this paper we show that stationary points (i.e. points obeying inequality (1.6)) also mimic the behavior of the oracle as the optimum does. The oracle inequalities that we derive are typically of the following type:
where is a constant not depending on the sample size nor on the dimension of the estimation problem. Inequalities of this kind are also named sharp since the constant in front of is . This is particularly important if the approximation error is not small. In addition, we also derive rates of convergence for the estimation error measured in different norms. In addition to the Euclidean norm the estimation error can be measured in the -norm.
1.2 Related literature
Nonconvex optimization problems are ubiquitous. The most recent example that makes theoretical understanding of stationary points of nonconvex optimization problems necessary is deep learning. As mentioned at the end of Chapter 4.3 of  the majority of the problems in deep learning cannot be solved via convex optimization.
Another prominent area where statistical nonconvex optimization problems arise is represented by mixture models. Typically, the estimators are computed by a version of the Expectation-Maximization (EM) algorithm or by a (coordinate) gradient descent algorithm. Examples for this can be found in where a finite mixture of regressions is considered in the high-dimensional setting. An EM-type algorithm is proposed and theoretical guarantees for the global minimizer are derived. The question about the statistical properties of stationary points (i.e. what the algorithm actually outputs) is left to future research. In Schelldorfer et al.  linear mixed-effects models in the high-dimensional setting are studied. A coordinate gradient descent algorithm is proposed and convergence to a stationary point is proven. Also in this latter work there is a gap between what the numerical algorithm outputs and the statistical properties that are shown to hold for the global minimum. However, the situation in the two mentioned papers is still more involved as the population version of the problem has several stationary points. For EM-type algorithms the work of  is the first that guarantees theoretical properties for estimates of symmetric mixtures of two Gaussians and two regressions.
Several high-dimensional estimation problems related to regression lead ineluctably to nonconvex optimization problems. In  corrected linear regression is studied. Three additional sources of noise that lead to nonconvex estimators are examined. The case of additive noise in the predictors, the case of missing data, and the case of multiplicative noise in the predictors are studied. The population versions of these estimation problems are convex. However, due to the estimators of the population covariance matrices they become nonconvex in the sample version. A gradient descent algorithm is proposed and theoretical properties of the minimum are described.
In a follow-up work  give theoretical guarantees for the stationary points of nonconvex penalized M-estimators. Their framework also includes nonconvex penalization terms. However, in contrast to the present work they do not provide sharp oracle inequalities. In  the authors give theoretical guarantees for the support recovery using nonconvex penalized M-estimators. The loss function as well as the penalization term are both allowed to be nonconvex.
As far as robust regression is concerned, the use of nonconvex loss functions is particularly appealing. The main robustness-inducing property that is exploited is the boundedness of the gradient/the Lipschitz continuity of the loss. Estimators involving e.g. the Tukey loss function seem therefore particularly well-suited for this task.  gives a general framework for this particular type of regularized M-estimators. The penalty term is allowed to be nonconvex as well.
In  a general framework to analyze the theoretical properties of -penalized and unpenalized M-estimators is proposed. The former is necessary for the high-dimensional setting whereas the latter are used for the case where the number of observations exceeds the number of parameters to be estimated. Rates of convergence are derived for stationary points of several statistical estimation problems such as robust regression, binary linear classification, and Gaussian mixtures. In contrast, we only consider the high-dimensional setting and derive sharp oracle inequalities from which the rates obtained in  can be recovered. Our framework applies also to different types of penalizing norms other than the -norm.
The nonconvex optimization problems that are considered in the present work can be subdivided into the following types:
The quantity to be estimated is the unique global minimizer of the convex risk . The source of nonconvexity stems exclusively from the sample optimization problem. This case has been considered for example in . An example for this type of estimation problems is the corrected linear regression with additive noise in the covariates. It is discussed in Subsection 3.1.
The quantity to be estimated is (a possibly non-unique) global minimizer of the nonconvex risk . The risk is convex in an neighborhood of the target, i.e. on a set of the form
A parallel line of research is concerned with the inspection of the theoretical properties of nonconvex penalization terms. In  a general framework for concave penalization terms is established. In general, it is argued that concave penalties reduce the bias that results from convex procedures such as e.g. the Lasso . We restrict ourselves to the case of norm penalized estimators.
1.3 Organization of the paper
In Section 2 we review the notion of an oracle and discuss the additional properties related to the penalization term that are needed for the sharp oracle inequality. The sharp oracle inequality given in Theorem 2.1 is purely deterministic. In Section 3 we show how the (deterministic) sharp oracle inequality can be applied to specific estimation problems. In Subsection 3.1 the application to corrected linear regression is presented. In Subsection 3.2 we show that the sharp oracle inequality also holds for stationary points of sparse PCA. In Subsections 3.4 and 3.3 we make use of Theorem 2.1 to derive sharp oracle inequalities also for robust regression and binary linear classification. Finally, in Subsection 3.5 we propose a new estimator “Robust SLOPE” and derive a sharp oracle result.
2 Sharp oracle inequality
In this section we mainly discuss the (deterministic) properties of the population version of the general estimation problem. In particular, we first describe the condition on the (population) risk. Then, we specify the kind of regularizers and their characteristics that are covered by our theory. Finally, we state a first general nonrandom sharp oracle inequality.
2.1 Conditions on the risk
In order to guarantee a “sufficient identifiability” of the parameter that is to be estimated, we assume that the risk satisfies a strong convexity condition on the convex set . It is worth noticing that this is a condition on a theoretical quantity that can be verified under the assumptions on the nonconvex loss in the specific examples.
Condition 1 (Two point margin condition).
There is an increasing strictly convex non-negative function with and a semi-norm on such that for all
Condition 1 says essentially that the curvature of the risk is sufficiently large in a certain neighborhood of . As will be demonstrated in the sequel of the paper, there are many examples where the loss function is nonconvex with some additional structural assumptions and yet the population risk is “well-behaved” on .
Condition 1 is a condition on the theoretical risk. In contrast, Restricted Strong Convexity (RSC) that was introduced in  and  combines the curvature empirical risk with the penalty. It was originally designed to analyze the properties of convex regularized M-estimators. In  and  it was further extended to the case of nonconvex M-estimators.  introduces the notion of local Restricted Strong Convexity. The latter one can be seen as a two point margin condition on the sample version of the problem on the set .
2.2 Conditions on the regularization term
In the world one exploits the property that any vector can be decomposed in an “active” and a “non-active” part. For a subset we define the vector such that . Then the following decomposition holds:
The previous equality is a slight abuse of notation: the vectors and lie either in , or and , respectively. This property is usually named “decomposability”.
The present framework can be applied to more general norm penalties. In  the concept of weak decomposability was introduced. It relaxes decomposability by requiring that for all and certain sets the sum of certain norms of and is always smaller than or equal to .
Definition 2.1 (Weakly decomposable norm, Definition 4.1 in ).
For a subset the norm is said to be weakly decomposable if there is a norm on such that for all
Suppose that the norm is weakly decomposable for a subset . Then for all
Equation (2.4) is also named triangle property. It imitates the properties of the -norm.
We insist on the fact that the choice of the regularization term has far-ranging consequences on the properties of the estimator as well as on the techniques that are necessary to analyze the estimator. In  the concept of weak decomposability was further extended to other norms. As a consequence, the triangle property can be shown to hold for many more cases. In the present framework however, we sacrifice some generality for a more clear exposition of our results.
2.3 Effective sparsity
The choice of the penalization deeply influences the estimation performance of the stationary points. In particular, this affects the estimation error part of the oracle inequality. In order to provide a quantitative description of this effect, we first review some concepts introduced in the rich literature about the Lasso. The concepts developed in the -norm are paradigmatic of the more general notions.
A well-studied condition on the design in the -penalized linear regression framework are the
restricted eigenvalue and the more general compatibility constant . As for the well-known framework, we recall the (slightly modified) definition of an -eigenvalue.
Definition 2.2 (-eigenvalue, ).
Let be an allowed subset of and . The -eigenvalue is defined as
where is the (semi)-norm from the two point margin condition (Condition 1).
Definition 2.3 (-effective sparsity, ).
The -effective sparsity is defined as
Effective sparsity can be interpreted as a measure of how well one can distinguish between the active and non-active parts depending on the specific context of the estimation problem. In fact, one can observe that increasing the stretching factor reduces the “distance” between the sets and (as the size of this set increases). In turn, this means that the effective sparsity becomes larger. In particular, the stretching factor is shown to depend on the tuning parameter . As the amount of noise increases it is observed that the tuning parameter increases and therefore also the stretching factor. More noise then translates to less distinguishable active and non-active parts.
2.4 Main result
We denote the oracle by and the corresponding “active” set will be denoted by . The oracle is a nonrandom vector that might be described as an idealized estimator that has additional structural information about the estimation problem. For instance, the oracle could be a vector that “knows” how many non-zero entries the underlying truth has. It then minimizes the upper bound of inequality (1.1). In other terms, it optimally trades-off the approximation and estimation errors.
Let be a stationary point in the sense of inequality (1.6). Suppose that Condition 1 is satisfied. Suppose further that the norm is weakly decomposable. Let be the convex conjugate 111The convex conjugate of is defined as see p. 104 of . of . Let and such that for all and a constant
Let and . Define , , and . Then we have
The proof of this theorem closely follows the proof of Theorem 7.1 in . The main difference lies in the fact that we do not need convexity of the empirical risk . Moreover, we allow for an additional term in the bound for the random part. This is crucial in the examples considered in this paper. The interpretation of the oracle inequality is that a given estimator achieves a rate of convergence that is almost as good (up to an additional constant term that is typically the risk of the oracle) as if it had background knowledge about the sparsity.
The terminology “sharp” is referred to the constant ‘1’ in front of the risk in the upper bound of the inequality below. It also refers to the fact that the upper bound does not involve .
The noise level needs to be chosen depending on the specific structure of the problem. The term is (in an asymptotic sense) of lower order than . Asymptotically, it does not influence the rates.
The estimation error can be measured in the semi-norm by the two point margin condition or in the norm.
3 Applications to specific estimation problems
In this section several applications of Theorem 2.1 are presented. The first part is dedicated to the “usual” entrywise sparsity where the number of active parameters in the target/truth is assumed to be smaller than the problem dimension . In this first part the sparsity inducing norm is taken to be . In the last subsection we introduce a new estimator “Robust SLOPE” to demonstrate that our framework can be applied also to different penalizing norms.
3.1 Corrected linear regression
In this subsection we closely follow the notation in . We consider the linear model for :
is a response variable andare i.i.d. copies of a sub-Gaussian random vector with unknown positive definite covariance matrix , is unknown and
are i.i.d. copies of a sub-Gaussian random variableindependent of . We say that a random vector is sub-Gaussian if where for a real-valued random variable , is the Orlicz norm for the function , .
The matrix with rows may be additionally corrupted by additive noise in which case one would observe
The matrix is independent of and . Its rows are assumed to be i.i.d. copies of a sub-Gaussian random vector with expectation zero and known covariance matrix . Thus, the rows are i.i.d. copies of a random vector .
The estimator in this case is then given by
We assume that so that the vector lies within the region over which we compute the estimator. For ease of notation we define
The empirical risk is then given by
The first and second derivatives of the empirical risk are given by
It can be seen that in a high-dimensional setting () the matrix has negative eigenvalues due to the additional noise. The high-dimensional estimation problem is therefore nonconvex.
On the other hand, the population version of the empirical risk is given by
The first and second derivatives are then given by
The population version of the estimation is therefore convex. The next lemma shows that the risk is not only convex but even strongly convex.
The two point margin condition is satisfied with and , where denotes the square root of .
The connection between the penalty and the norm is established in the following lemma that gives an expression for the effective sparsity (Definition 2.3).
For and we have for any set with that
We now state several lemmas that are used to establish the Empirical Process Condition (2.1).
Define . We then have for all and all
with probability at least .
The following lemma shows how the quadratic form involving the positive definite matrix is related to the (quadratic) margin function.
Define . We have for all
where and are the largest and smallest eigenvalues of the matrices and , respectively.
Define , , and for all , and for
Then we have for all
with probability at least .
Let be a constant. Define
with probability at least . If we choose
and if we assume that
then . Hence, the Empirical Process Condition (2.1) is satisfied.
and . Then, we have with probability at least
As far as the asymptotics is concerned, we consider the case where the oracle is itself. We notice that the choice leads to
We are able to recover the rates obtained also in . Furthermore, we notice that the rates of convergence depend on the smallest eigenvalue of the true covariance matrix . This is not surprising since the smallest eigenvalue measures the curvature of the population risk. The larger is, the higher the curvature, and the “easier” the estimation problem becomes. As far as estimators leading to conex optimization problem are concerned,  propose and analyze a method for the errors-in-variables model called MU-selector, where MU stands for matrix uncertainty, for a deterministic noise matrix . In  the MU-selector is further improved to allow for random noise in the observations. The estimator is called Compensated MU selector and has a better estimation performance similar to the method that is proposed in  and analyzed in the present paper. Two further estimators leading to convex optimization problems based on an , and penalties are proposed in . Finally,  define an estimator that achieves minimax optimal rates up to a logarithmic term.  propose another (convex) method called Convex Conditioned Lasso (CoCoLasso) where the negative definite estimate of the covariance matrix (in a high-dimensional setting) such as in (3.4) is replaced by a positive semidefinite matrix. In addition to the previously mentioned papers, we also account for the case where the underlying regression function/curve is not necessarily a linear combination of the variables. The importance of the sharp oracle inequalities for the estimator given in equation (3.3) is to be seen in this additional property rather than in the derivation that bears the dependence on and .
3.2 Sparse PCA
the aim is to find a one dimensional representation of the data such that the variance explained by this representation is maximized. The empirical covariance matrix is given by. We write that . The target
is then given by the eigenvector corresponding to the maximal eigenvalue of the covariance matrix. An estimator for the first principal component is obtained by maximizing the empirical variance with respect to :
The solution of the optimization problem (3.14) is the eigenvector corresponding to the maximal eigenvalue of the objective function. An equivalent form (after normalization) of the optimization problem (3.14) is the following minimization problem where an objective function is minimized with respect to :
Both optimization problems (3.14) and (3.15) lead to the same solution after normalization. In this case, even if the optimization problem is nonconvex the solution can be easily computed by finding the eigenvector corresponding to the maximal eigenvalue of the sample covariance matrix .
A major drawback of PCA is that the first principal component is typically a linear combination of all the variables in the model. In many applications it is however desirable to sacrifice some variance in order to obtain a sparse representation that is easier to interpret. Furthermore, in a high-dimensional setting PCA has been shown to be inconsistent .  shows that under the spiked covariance model () in a high-dimensional setting the eigenvector corresponding to the largest eigenvalue of is not able to recover the truth when the gap between the largest eigenvalue of and the second-largest is “small”.
We need to restrict to a neighborhood of one of the global optima in order to assure convexity and uniqueness of the minimum of the risk. Define . Let be the “oracle” as given in Section 2.
We consider the penalized optimization problem
where and are tuning parameters. The risk is given by
The first derivative of the risk is given by
The second derivative of the risk is given by
The (strong) convexity of the risk on the neighborhood
depends on the “signal strength”. In this case the latter is given by the largest singular value of the population covariance matrix
. The singular value decomposition ofis given by
where and with .
We assume that the features are i.i.d. copies of a sub-Gaussian random vector with positive definite covariance matrix .
It is assumed that for some
We assume that .
Assumption 1 is often referred to as spikiness condition. It says that the signal should be sufficiently well separated from the other principal components.
What needs to be further explained is the third assumption. In order for the population risk to be convex in the neighborhood we require a sufficiently large gap between the largest eigenvalue of the true covariance matrix and its remaining eigenvalues. One might object that the assumption of starting with a “good” starting value is not realistic. However, a consistent initial estimate with a slow rate of convergence is given in .
The following lemma guarantees that the risk is strictly convex around one of the local minima of the population risk.
Lemma 3.7 (Lemma 12.7 in ).
Suppose that Assumption 1 is satisfied. Then for all we have
where is the smallest eigenvalue of the Hessian on the set .
The next lemma shows that the risk is indeed sufficiently convex.
Suppose that Assumption 1 is satisfied. The two point margin condition is satisfied on with and .
As we now have a different norm as compared to the sparse corrected linear regression case, we also obtain a different effective sparsity:
For and we have for any set with that
The following lemma shows that the Empirical Process Condition 2.1 holds with large probability with appropriate constants.
Define and for
Let be a constant. Then with and
we have for all
with probability at least . If we choose
we have . Hence, the Empirical Process Condition (2.1) is satisfied.
Then we have with probability at least
For the asymptotics we assume that . For simplicity, we take the oracle to be itself. Then and
We see that the rates depend on the gap between the largest eigenvalue of the matrix and the remaining eigenvalues. It is again not surprising since the estimation problem becomes “easier” the larger this gap is.
3.3 Robust regression
We consider the linear model for all and with i.i.d. copies of a sub-Gaussian random vector : .
where we assume that the distribution of the errors is symmetric around . We also assume that the errors are independent of the features
. In case of outliers and heavy-tailed noise in the linear regression model the quadratic loss typically fails due to its unbounded derivative. Alternatives to the quadratic loss are given by e.g. the Cauchy loss.
The empirical risk is given by
Its first derivative is given by
Its second derivative is given by
Lipschitz continuity of the loss: there exists such that
Lipschitz continuity of the first derivative of the loss: there exists such that
Local curvature condition: Define the tail probability as
It is assumed that for
We notice that for our framework we need to assume that also the first derivative of the loss is Lipschitz continuous. In  the assumption is weaker in the sense that it is only required that the second derivative of the loss is not “too negative”.
The usual (typically uncomputable) “argmin”-type estimator is then given by
where and are tuning parameters.
We now cite a proposition from  that establishes the restricted strong convexity conditions. It shows how the different (tuning) parameters are intertwined.