Sharp oracle inequalities for stationary points of nonconvex penalized M-estimators

by   Andreas Elsener, et al.
ETH Zurich

Many statistical estimation procedures lead to nonconvex optimization problems. Algorithms to solve these are often guaranteed to output a stationary point of the optimization problem. Oracle inequalities are an important theoretical instrument to asses the statistical performance of an estimator. Oracle results have focused on the theoretical properties of the uncomputable (global) minimum or maximum. In the present work a general framework used for convex optimization problems to derive oracle inequalities for stationary points is extended. A main new ingredient of these oracle inequalities is that they are sharp: they show closeness to the best approximation within the model plus a remainder term. We apply this framework to different estimation problems.


page 1

page 2

page 3

page 4


Oracle Complexity in Nonsmooth Nonconvex Optimization

It is well-known that given a smooth, bounded-from-below, and possibly n...

Solution of linear ill-posed problems by model selection and aggregation

We consider a general statistical linear inverse problem, where the solu...

Sharp Inequalities for f-divergences

f-divergences are a general class of divergences between probability mea...

The Complexity of Making the Gradient Small in Stochastic Convex Optimization

We give nearly matching upper and lower bounds on the oracle complexity ...

Adaptive Denoising of Signals with Shift-Invariant Structure

We study the problem of discrete-time signal denoising, following the li...

Learning the intensity of time events with change-points

We consider the problem of learning the inhomogeneous intensity of a cou...

Projections onto the canonical simplex with additional linear inequalities

We consider projections onto the canonical simplex with additional linea...

1 Introduction

1.1 Background and Motivation

Nonconvex loss functions arise in many different branches of statistics, machine learning and deep learning. These loss functions entail several advantages from a statistical point of view. For instance, in robust regression, where one requires that the influence function of the loss is bounded, nonconvex losses are widely used. Furthermore, they are unavoidable in areas such as deep learning where they arise as a byproduct of the representation of the data. Despite the exponential increase in methodologies involving nonconvex loss functions, there are still many theoretical questions that need to be answered.

As a matter of fact, the nonconvex optimization problems can usually be solved only via algorithms that guarantee convergence to a so-called stationary point. A stationary point is often not the global minimum. It is almost hopeless to recover the latter. Statistical theory has mostly focused on deriving properties of an incomputable global optimum. We show that under certain circumstances stationary points satisfy sharp oracle results similar to those that were derived for the global optimum.

High-dimensional data (i.e. when the number of parameters to be estimated exceeds the number of observations) represent an additional challenge. A well-established way of tackling this problem is to assume that the number of “active” parameters is smaller than the dimension of the parameter space. This assumption is typically called “sparsity”. Estimators designed under the sparsity assumption are often M-estimators with either an additional constraint or a penalty term. Under convex loss functions these approaches are numerically equivalent. Here we focus on the latter approach. We consider estimators that are composed of a nonconvex differentiable loss and a penalty term. Primarily, the penalty term is chosen to be a “sparsity-inducing” norm.

We now describe the structure of the estimators that we are interested in. Let be independent observations with values in some space stemming from a distribution depending on . Let be a differentiable possibly nonconvex function such that


The function measures the “misfit” that arises by taking the decision in comparison to the given data.

We define as


is named the “empirical risk”. It is a random quantity as it depends on the random observations . The unknown quantity we are interested in estimating is given by the minimizer of the population version:


where is the risk.

Consider a norm on with dual norm of . The subdifferential of the norm is defined as


[2]. We consider empirical risk minimization problems of the form


where is a tuning parameter that needs to be chosen.

To solve optimization problems of the type given in (1.5) one often uses gradient descent algorithms and its modifications. However, algorithms for nonconvex optimization problems typically output a local optimum of the objective function (1.5) but not . In this paper we show that points satisfying


where , enjoy some properties of the (incomputable) estimator . These points are called stationary points.

We extend a general framework introduced in [38] for convex optimization problems. The key property that is needed is called two point inequality in [38]:


Using that and that one can see that the two point inequality is indeed satisfied by points that satisfy inequality (1.6).

Let now

be a non-random vector with

. We think of the vector as of a quantity that already “contains” some additional structural assumption about the estimation problem such as the number of non-zero entries of the target . The vector optimally trades off the approximation and estimation errors. In this paper we show that stationary points (i.e. points obeying inequality (1.6)) also mimic the behavior of the oracle as the optimum does. The oracle inequalities that we derive are typically of the following type:


where is a constant not depending on the sample size nor on the dimension of the estimation problem. Inequalities of this kind are also named sharp since the constant in front of is . This is particularly important if the approximation error is not small. In addition, we also derive rates of convergence for the estimation error measured in different norms. In addition to the Euclidean norm the estimation error can be measured in the -norm.

1.2 Related literature

Nonconvex optimization problems are ubiquitous. The most recent example that makes theoretical understanding of stationary points of nonconvex optimization problems necessary is deep learning. As mentioned at the end of Chapter 4.3 of [11] the majority of the problems in deep learning cannot be solved via convex optimization.

Another prominent area where statistical nonconvex optimization problems arise is represented by mixture models. Typically, the estimators are computed by a version of the Expectation-Maximization (EM) algorithm or by a (coordinate) gradient descent algorithm. Examples for this can be found in

[33] where a finite mixture of regressions is considered in the high-dimensional setting. An EM-type algorithm is proposed and theoretical guarantees for the global minimizer are derived. The question about the statistical properties of stationary points (i.e. what the algorithm actually outputs) is left to future research. In Schelldorfer et al. [31] linear mixed-effects models in the high-dimensional setting are studied. A coordinate gradient descent algorithm is proposed and convergence to a stationary point is proven. Also in this latter work there is a gap between what the numerical algorithm outputs and the statistical properties that are shown to hold for the global minimum. However, the situation in the two mentioned papers is still more involved as the population version of the problem has several stationary points. For EM-type algorithms the work of [3] is the first that guarantees theoretical properties for estimates of symmetric mixtures of two Gaussians and two regressions.

Several high-dimensional estimation problems related to regression lead ineluctably to nonconvex optimization problems. In [17] corrected linear regression is studied. Three additional sources of noise that lead to nonconvex estimators are examined. The case of additive noise in the predictors, the case of missing data, and the case of multiplicative noise in the predictors are studied. The population versions of these estimation problems are convex. However, due to the estimators of the population covariance matrices they become nonconvex in the sample version. A gradient descent algorithm is proposed and theoretical properties of the minimum are described.

In a follow-up work [19] give theoretical guarantees for the stationary points of nonconvex penalized M-estimators. Their framework also includes nonconvex penalization terms. However, in contrast to the present work they do not provide sharp oracle inequalities. In [18] the authors give theoretical guarantees for the support recovery using nonconvex penalized M-estimators. The loss function as well as the penalization term are both allowed to be nonconvex.

As far as robust regression is concerned, the use of nonconvex loss functions is particularly appealing. The main robustness-inducing property that is exploited is the boundedness of the gradient/the Lipschitz continuity of the loss. Estimators involving e.g. the Tukey loss function seem therefore particularly well-suited for this task. [16] gives a general framework for this particular type of regularized M-estimators. The penalty term is allowed to be nonconvex as well.

In [20] a general framework to analyze the theoretical properties of -penalized and unpenalized M-estimators is proposed. The former is necessary for the high-dimensional setting whereas the latter are used for the case where the number of observations exceeds the number of parameters to be estimated. Rates of convergence are derived for stationary points of several statistical estimation problems such as robust regression, binary linear classification, and Gaussian mixtures. In contrast, we only consider the high-dimensional setting and derive sharp oracle inequalities from which the rates obtained in [20] can be recovered. Our framework applies also to different types of penalizing norms other than the -norm.

The nonconvex optimization problems that are considered in the present work can be subdivided into the following types:

  1. The quantity to be estimated is the unique global minimizer of the convex risk . The source of nonconvexity stems exclusively from the sample optimization problem. This case has been considered for example in [17]. An example for this type of estimation problems is the corrected linear regression with additive noise in the covariates. It is discussed in Subsection 3.1.

  2. The quantity to be estimated is (a possibly non-unique) global minimizer of the nonconvex risk . The risk is convex in an neighborhood of the target, i.e. on a set of the form

    for some suitable constant . This case has been studied in [19] and [16]. An example is binary linear classification in Subsection 3.4.

A parallel line of research is concerned with the inspection of the theoretical properties of nonconvex penalization terms. In [41] a general framework for concave penalization terms is established. In general, it is argued that concave penalties reduce the bias that results from convex procedures such as e.g. the Lasso [35]. We restrict ourselves to the case of norm penalized estimators.

1.3 Organization of the paper

In Section 2 we review the notion of an oracle and discuss the additional properties related to the penalization term that are needed for the sharp oracle inequality. The sharp oracle inequality given in Theorem 2.1 is purely deterministic. In Section 3 we show how the (deterministic) sharp oracle inequality can be applied to specific estimation problems. In Subsection 3.1 the application to corrected linear regression is presented. In Subsection 3.2 we show that the sharp oracle inequality also holds for stationary points of sparse PCA. In Subsections 3.4 and 3.3 we make use of Theorem 2.1 to derive sharp oracle inequalities also for robust regression and binary linear classification. Finally, in Subsection 3.5 we propose a new estimator “Robust SLOPE” and derive a sharp oracle result.

2 Sharp oracle inequality

In this section we mainly discuss the (deterministic) properties of the population version of the general estimation problem. In particular, we first describe the condition on the (population) risk. Then, we specify the kind of regularizers and their characteristics that are covered by our theory. Finally, we state a first general nonrandom sharp oracle inequality.

2.1 Conditions on the risk

In order to guarantee a “sufficient identifiability” of the parameter that is to be estimated, we assume that the risk satisfies a strong convexity condition on the convex set . It is worth noticing that this is a condition on a theoretical quantity that can be verified under the assumptions on the nonconvex loss in the specific examples.

Condition 1 (Two point margin condition).

There is an increasing strictly convex non-negative function with and a semi-norm on such that for all


Condition 1 says essentially that the curvature of the risk is sufficiently large in a certain neighborhood of . As will be demonstrated in the sequel of the paper, there are many examples where the loss function is nonconvex with some additional structural assumptions and yet the population risk is “well-behaved” on .

Condition 1 is a condition on the theoretical risk. In contrast, Restricted Strong Convexity (RSC) that was introduced in [22] and [1] combines the curvature empirical risk with the penalty. It was originally designed to analyze the properties of convex regularized M-estimators. In [17] and [19] it was further extended to the case of nonconvex M-estimators. [16] introduces the notion of local Restricted Strong Convexity. The latter one can be seen as a two point margin condition on the sample version of the problem on the set .

2.2 Conditions on the regularization term

In the world one exploits the property that any vector can be decomposed in an “active” and a “non-active” part. For a subset we define the vector such that . Then the following decomposition holds:


The previous equality is a slight abuse of notation: the vectors and lie either in , or and , respectively. This property is usually named “decomposability”.

The present framework can be applied to more general norm penalties. In [37] the concept of weak decomposability was introduced. It relaxes decomposability by requiring that for all and certain sets the sum of certain norms of and is always smaller than or equal to .

Definition 2.1 (Weakly decomposable norm, Definition 4.1 in [37]).

For a subset the norm is said to be weakly decomposable if there is a norm on such that for all

Lemma 2.1.

Suppose that the norm is weakly decomposable for a subset . Then for all


Equation (2.4) is also named triangle property. It imitates the properties of the -norm.

We insist on the fact that the choice of the regularization term has far-ranging consequences on the properties of the estimator as well as on the techniques that are necessary to analyze the estimator. In [38] the concept of weak decomposability was further extended to other norms. As a consequence, the triangle property can be shown to hold for many more cases. In the present framework however, we sacrifice some generality for a more clear exposition of our results.

2.3 Effective sparsity

The choice of the penalization deeply influences the estimation performance of the stationary points. In particular, this affects the estimation error part of the oracle inequality. In order to provide a quantitative description of this effect, we first review some concepts introduced in the rich literature about the Lasso. The concepts developed in the -norm are paradigmatic of the more general notions.

A well-studied condition on the design in the -penalized linear regression framework are the

restricted eigenvalue

[6] and the more general compatibility constant [36]. As for the well-known framework, we recall the (slightly modified) definition of an -eigenvalue.

Definition 2.2 (-eigenvalue, [37]).

Let be an allowed subset of and . The -eigenvalue is defined as


where is the (semi)-norm from the two point margin condition (Condition 1).

Definition 2.3 (-effective sparsity, [37]).

The -effective sparsity is defined as

Remark 1.

Effective sparsity can be interpreted as a measure of how well one can distinguish between the active and non-active parts depending on the specific context of the estimation problem. In fact, one can observe that increasing the stretching factor reduces the “distance” between the sets and (as the size of this set increases). In turn, this means that the effective sparsity becomes larger. In particular, the stretching factor is shown to depend on the tuning parameter . As the amount of noise increases it is observed that the tuning parameter increases and therefore also the stretching factor. More noise then translates to less distinguishable active and non-active parts.

2.4 Main result

We denote the oracle by and the corresponding “active” set will be denoted by . The oracle is a nonrandom vector that might be described as an idealized estimator that has additional structural information about the estimation problem. For instance, the oracle could be a vector that “knows” how many non-zero entries the underlying truth has. It then minimizes the upper bound of inequality (1.1). In other terms, it optimally trades-off the approximation and estimation errors.

Theorem 2.1.

Let be a stationary point in the sense of inequality (1.6). Suppose that Condition 1 is satisfied. Suppose further that the norm is weakly decomposable. Let be the convex conjugate 111The convex conjugate of is defined as see p. 104 of [28]. of . Let and such that for all and a constant


Let and . Define , , and . Then we have


The proof of this theorem closely follows the proof of Theorem 7.1 in [38]. The main difference lies in the fact that we do not need convexity of the empirical risk . Moreover, we allow for an additional term in the bound for the random part. This is crucial in the examples considered in this paper. The interpretation of the oracle inequality is that a given estimator achieves a rate of convergence that is almost as good (up to an additional constant term that is typically the risk of the oracle) as if it had background knowledge about the sparsity.

Remark 2.

Condition (2.1) is a bound for the difference between averages and means

. We refer to it as the ‘Empirical Process Condition’. Main theme in the applications is to show that this condition holds with high probability, for suitable constants

and .

Remark 3.

The terminology “sharp” is referred to the constant ‘1’ in front of the risk in the upper bound of the inequality below. It also refers to the fact that the upper bound does not involve .

Remark 4.

The noise level needs to be chosen depending on the specific structure of the problem. The term is (in an asymptotic sense) of lower order than . Asymptotically, it does not influence the rates.

Remark 5.

The estimation error can be measured in the semi-norm by the two point margin condition or in the norm.

3 Applications to specific estimation problems

In this section several applications of Theorem 2.1 are presented. The first part is dedicated to the “usual” entrywise sparsity where the number of active parameters in the target/truth is assumed to be smaller than the problem dimension . In this first part the sparsity inducing norm is taken to be . In the last subsection we introduce a new estimator “Robust SLOPE” to demonstrate that our framework can be applied also to different penalizing norms.

3.1 Corrected linear regression

In this subsection we closely follow the notation in [17]. We consider the linear model for :



is a response variable and

are i.i.d. copies of a sub-Gaussian random vector with unknown positive definite covariance matrix , is unknown and

are i.i.d. copies of a sub-Gaussian random variable

independent of . We say that a random vector is sub-Gaussian if where for a real-valued random variable , is the Orlicz norm for the function , .

The matrix with rows may be additionally corrupted by additive noise in which case one would observe


The matrix is independent of and . Its rows are assumed to be i.i.d. copies of a sub-Gaussian random vector with expectation zero and known covariance matrix . Thus, the rows are i.i.d. copies of a random vector .

The estimator in this case is then given by


We assume that so that the vector lies within the region over which we compute the estimator. For ease of notation we define


The empirical risk is then given by


The first and second derivatives of the empirical risk are given by


It can be seen that in a high-dimensional setting () the matrix has negative eigenvalues due to the additional noise. The high-dimensional estimation problem is therefore nonconvex.

On the other hand, the population version of the empirical risk is given by


The first and second derivatives are then given by


The population version of the estimation is therefore convex. The next lemma shows that the risk is not only convex but even strongly convex.

Lemma 3.1.

The two point margin condition is satisfied with and , where denotes the square root of .

The connection between the penalty and the norm is established in the following lemma that gives an expression for the effective sparsity (Definition 2.3).

Lemma 3.2.

For and we have for any set with that


We now state several lemmas that are used to establish the Empirical Process Condition (2.1).

Lemma 3.3.

Define . We then have for all and all

with probability at least .

The following lemma shows how the quadratic form involving the positive definite matrix is related to the (quadratic) margin function.

Lemma 3.4.

Define . We have for all


where and are the largest and smallest eigenvalues of the matrices and , respectively.

Lemma 3.5.

Define , , and for all , and for

Then we have for all


with probability at least .

Lemma 3.6.

Let be a constant. Define


with probability at least . If we choose

and if we assume that

then . Hence, the Empirical Process Condition (2.1) is satisfied.

Combining Lemma 3.6 with Theorem 2.1 we obtain the following corollary.

Corollary 3.1.

Suppose that the assumptions in Lemma 3.6 hold. Let be a stationary point of the optimization problem (3.3). Let be defined as

and . Then, we have with probability at least

As far as the asymptotics is concerned, we consider the case where the oracle is itself. We notice that the choice leads to


We are able to recover the rates obtained also in [17]. Furthermore, we notice that the rates of convergence depend on the smallest eigenvalue of the true covariance matrix . This is not surprising since the smallest eigenvalue measures the curvature of the population risk. The larger is, the higher the curvature, and the “easier” the estimation problem becomes. As far as estimators leading to conex optimization problem are concerned, [29] propose and analyze a method for the errors-in-variables model called MU-selector, where MU stands for matrix uncertainty, for a deterministic noise matrix . In [30] the MU-selector is further improved to allow for random noise in the observations. The estimator is called Compensated MU selector and has a better estimation performance similar to the method that is proposed in [17] and analyzed in the present paper. Two further estimators leading to convex optimization problems based on an , and penalties are proposed in [4]. Finally, [5] define an estimator that achieves minimax optimal rates up to a logarithmic term. [10] propose another (convex) method called Convex Conditioned Lasso (CoCoLasso) where the negative definite estimate of the covariance matrix (in a high-dimensional setting) such as in (3.4) is replaced by a positive semidefinite matrix. In addition to the previously mentioned papers, we also account for the case where the underlying regression function/curve is not necessarily a linear combination of the variables. The importance of the sharp oracle inequalities for the estimator given in equation (3.3) is to be seen in this additional property rather than in the derivation that bears the dependence on and .

3.2 Sparse PCA

Principal component analysis is a widely used dimension reduction technique. Its origins go back to [24] and [12]. Given an matrix with i.i.d. rows

the aim is to find a one dimensional representation of the data such that the variance explained by this representation is maximized. The empirical covariance matrix is given by

. We write that . The target

is then given by the eigenvector corresponding to the maximal eigenvalue of the covariance matrix

. An estimator for the first principal component is obtained by maximizing the empirical variance with respect to :


The solution of the optimization problem (3.14) is the eigenvector corresponding to the maximal eigenvalue of the objective function. An equivalent form (after normalization) of the optimization problem (3.14) is the following minimization problem where an objective function is minimized with respect to :


Both optimization problems (3.14) and (3.15) lead to the same solution after normalization. In this case, even if the optimization problem is nonconvex the solution can be easily computed by finding the eigenvector corresponding to the maximal eigenvalue of the sample covariance matrix .

A major drawback of PCA is that the first principal component is typically a linear combination of all the variables in the model. In many applications it is however desirable to sacrifice some variance in order to obtain a sparse representation that is easier to interpret. Furthermore, in a high-dimensional setting PCA has been shown to be inconsistent [14]. [21] shows that under the spiked covariance model ([13]) in a high-dimensional setting the eigenvector corresponding to the largest eigenvalue of is not able to recover the truth when the gap between the largest eigenvalue of and the second-largest is “small”.

We need to restrict to a neighborhood of one of the global optima in order to assure convexity and uniqueness of the minimum of the risk. Define . Let be the “oracle” as given in Section 2.

We consider the penalized optimization problem


where and are tuning parameters. The risk is given by


The first derivative of the risk is given by


The second derivative of the risk is given by


The (strong) convexity of the risk on the neighborhood

depends on the “signal strength”. In this case the latter is given by the largest singular value of the population covariance matrix

. The singular value decomposition of

is given by


where and with .

Assumption 1.
  1. We assume that the features are i.i.d. copies of a sub-Gaussian random vector with positive definite covariance matrix .

  2. It is assumed that for some

  3. We assume that .

Remark 6.

Assumption 1 is often referred to as spikiness condition. It says that the signal should be sufficiently well separated from the other principal components.

Remark 7.

What needs to be further explained is the third assumption. In order for the population risk to be convex in the neighborhood we require a sufficiently large gap between the largest eigenvalue of the true covariance matrix and its remaining eigenvalues. One might object that the assumption of starting with a “good” starting value is not realistic. However, a consistent initial estimate with a slow rate of convergence is given in [40].

The following lemma guarantees that the risk is strictly convex around one of the local minima of the population risk.

Lemma 3.7 (Lemma 12.7 in [38]).

Suppose that Assumption 1 is satisfied. Then for all we have


where is the smallest eigenvalue of the Hessian on the set .

The next lemma shows that the risk is indeed sufficiently convex.

Lemma 3.8.

Suppose that Assumption 1 is satisfied. The two point margin condition is satisfied on with and .

As we now have a different norm as compared to the sparse corrected linear regression case, we also obtain a different effective sparsity:

Lemma 3.9.

For and we have for any set with that


The following lemma shows that the Empirical Process Condition 2.1 holds with large probability with appropriate constants.

Lemma 3.10.

Define and for

Let be a constant. Then with and

we have for all


with probability at least . If we choose

we have . Hence, the Empirical Process Condition (2.1) is satisfied.

By combining Lemma 3.10 and Theorem 2.1 we obtain the following corollary.

Corollary 3.2.

Let be a stationary point of the optimization problem (3.16). Suppose that the conditions of Lemma 3.10 are satisfied. Let in particular be as in Lemma 3.10. Define

Then we have with probability at least


For the asymptotics we assume that . For simplicity, we take the oracle to be itself. Then and


We see that the rates depend on the gap between the largest eigenvalue of the matrix and the remaining eigenvalues. It is again not surprising since the estimation problem becomes “easier” the larger this gap is.

3.3 Robust regression

We consider the linear model for all and with i.i.d. copies of a sub-Gaussian random vector : .


where we assume that the distribution of the errors is symmetric around . We also assume that the errors are independent of the features

. In case of outliers and heavy-tailed noise in the linear regression model the quadratic loss typically fails due to its unbounded derivative. Alternatives to the quadratic loss are given by e.g. the Cauchy loss.

The empirical risk is given by


Its first derivative is given by


Its second derivative is given by

Assumption 2.
  1. Lipschitz continuity of the loss: there exists such that

  2. Lipschitz continuity of the first derivative of the loss: there exists such that

  3. Local curvature condition: Define the tail probability as

    It is assumed that for

We notice that for our framework we need to assume that also the first derivative of the loss is Lipschitz continuous. In [16] the assumption is weaker in the sense that it is only required that the second derivative of the loss is not “too negative”.

The usual (typically uncomputable) “argmin”-type estimator is then given by


where and are tuning parameters.

We now cite a proposition from [16] that establishes the restricted strong convexity conditions. It shows how the different (tuning) parameters are intertwined.

Proposition 3.1 (Adapted from Proposition 2 in [16]).

Suppose that are i.i.d. copies of a sub-Gaussian random vector with positive definite covariance matrix . Assume also that


and that the loss function satisfies Assumption 2 and that . Then we have with probability at least for all