Implicit Regularization via Hadamard Product Over-Parametrization in High-Dimensional Linear Regression

03/22/2019 ∙ by Peng Zhao, et al. ∙ 0

We consider Hadamard product parametrization as a change-of-variable (over-parametrization) technique for solving least square problems in the context of linear regression. Despite the non-convexity and exponentially many saddle points induced by the change-of-variable, we show that under certain conditions, this over-parametrization leads to implicit regularization: if we directly apply gradient descent to the residual sum of squares with sufficiently small initial values, then under proper early stopping rule, the iterates converge to a nearly sparse rate-optimal solution with relatively better accuracy than explicit regularized approaches. In particular, the resulting estimator does not suffer from extra bias due to explicit penalties, and can achieve the parametric root-n rate (independent of the dimension) under proper conditions on the signal-to-noise ratio. We perform simulations to compare our methods with high dimensional linear regression with explicit regularizations. Our results illustrate advantages of using implicit regularization via gradient descent after over-parametrization in sparse vector estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In high dimensional linear regression

the goal is to parsimoniously predict the response as a linear combination of a large number of covariates , and conduct statistical inference on the linear combination coefficients (Tibshirani, 1996; Donoho, 2006). By leveraging on certain lower dimensional structure in the regression coefficient vector such as a sparsity constraint , where counts the number of nonzeros in , the number of covariates is allowed to be substantially larger than the sample size . Due to the intrinsic computational hardness in dealing with the metric reflecting sparsity, people instead use different metrics as surrogates, and cast the estimation problem into various convex or nonconvex optimization problems. Many approaches have been proposed for high dimensional regression by solving certain penalized optimization problem, including basis pursuit (Chen et al., 2001), the Lasso (Tibshirani, 1996), the Dantzig selector (Candes and Tao, 2007), SCAD (Fan and Li, 2001), MCP (Zhang, 2010) and so on. In this work, we focus on the recovery of without explicitly specifying a penalty.

Recent work (Hoff, 2017) shows that through a change-of-variable (over-parametrization) via Hadamard product parametrization, the Lagrangian (dual) form of the non-smooth convex optimization problem for the Lasso (1):

(1)

can be reformulated as a smoothed optimization problem at a cost of introducing non-convexity. Due to the smoothness feature, simple and low-cost first-order optimization methods such as gradient descent and coordinate descent can be applied to recover . Despite the non-convexity and exponentially many stationary points induced by the change-of-variable, these first-order algorithms exhibit encouraging empirical performance (Hoff, 2017).

In this work, we consider the same Hadamard product over-parametrization as in Hoff (2017), where and denotes the Hadamard product (element-wise product). Instead of solving the penalized optimization problem (1

), we consider directly applying the gradient descent to the quadratic loss function

(2)

In the noiseless case where , minimizing jointly over is a highly non-convex optimization problem with exponentially many saddle points. To see this, notice that each non-zero pattern of corresponds to at least one saddle point by choosing for each such that . In addition, due to the column rank deficiency of the design matrix (for example, when ), there are infinitely many global minimizers of (2) as potential convergent points of the gradient descent. Interestingly, we show that despite these seemingly hopeless difficulties, in the noiseless case if we initialize the gradient descent arbitrarily close to , then under the prominent Restricted Isometry Property (RIP) condition (Candes, 2008) on the design matrix , a properly tuned gradient descent converges to least -norm solution within error in iterations, where constant depends on the RIP constant, step size of the gradient descent, and some other characteristics of the problem. Our proofs borrow ideas from Li et al. (2018), where they prove the algorithmic convergence of matrix factorized gradient descent in the context of noiseless matrix sensing under the RIP.

In high dimensional regression, the usual regularized least square is known to suffer from the so-called saturation phenomenon (Vito et al., 2005; Yao et al., 2007), where the overall estimation error is dominated by a bias term due to the penalty. In particular, since regularization is artificially introduced for restricting the “effective size” of the parameter space, the resulting estimator may be deteriorated and the estimation error cannot fall below the penalty level to adapt to a possibly faster convergence rate. For example, the estimator by solving the Lasso achieves the minimax rate of . However, this worse-case rate only happens when there exist weak signals, meaning that some nonzero ’s have a borderline magnitude of order . In fact, if all signals are sufficiently strong, or significantly larger this borderline magnitude, then the faster dimension-independent parametric rate is attainable. For regularized approaches such the Lasso, one possible way to remove the penalty-induced bias term (whose order is ) is to refit the model with the selected variables. However, this two stage procedure requires stringent assumptions on the minimal signal strength to guarantee variable selection consistency for the first stage, and will suffer from weak signals. Interestingly, we show that by combining the Hadamard product over-parametrization with early stopping, a widely used regularization technique in boosting (Zhang and Yu, 2005) and nonparametric regression (Raskutti et al., 2014), our method can overcome the saturation issue and lead to more accurate estimation. More precisely, in the presence of random noise in the linear model, the solution path of minimizing the quadratic loss function (2) as we increase the gradient descent iteration still tends to converge to the least -norm solution that will overfit the data. Fortunately, by terminating the gradient descent updating procedure earlier within a proper number of iterations, we may find a solution that optimally balances between the model complexity (reflected by the increasing

-norm of the iterate) and goodness fit of the model, akin the bias-variance trade-off. In particular, we show that the estimator can adapt to an optimal convergence rate of

when all signals are relatively strong. Generally, when both strong signals and weak signals exist, our estimator attains the rate (with denoting thenumber of strong signals and weak signals, respectively).

Our result also complements the recent surge of literature on over-parametrization for implicit regularization of the first-order iterative method for non-convex optimization in machine learning.

Gunasekar et al. (2017)

introduce the phenomenon of implicit regularization in matrix factorization, where they empirically observe the convergence of gradient methods in matrix factorization problem to the minimal nuclear norm solution as the the initial value tends to zero. However, they only provide some heuristic illustration under some hard-to-check assumptions such as the continuity of the solution relative to the change in the initialization. Later,

Li et al. (2018) rigorously prove the implicit regularization in matrix sensing problem under a matrix RIP condition. Some other very recent works such Gunasekar et al. (2018) and Soudry et al. (2018) study implicit regularizations in mirror descent and in classification problems. Note that all above implicit regularization literatures only consider data that are either noiseless (regression) or perfectly separable (classification). To our best knowledge, we are the first to rigorously study and utilize implicit regularization in high dimensional linear regression where responses are noisy.

In a nutshell, we show that through a simple change-of-variable, the non-smooth - penalized optimization problem (1) can be transformed to an unconstrained smooth quadratic loss minimization one; moreover, a simple gradient descent initialized near zero efficiently solves this non-convex optimization problem with provable guarantees. Furthermore, our method enjoy several advantages over existing procedures for high dimensional linear regression under sparsity constraints. First, our method is computationally efficient — its time complexity is , which is linear in both and . Second, despite the non-convexity nature, our method has a natural initialization that provably leads the optimal solution. In comparison, penalized -estimators based on non-convex penalties such as SCAD and MCP require stringent conditions on their initializations: to obtain good estimators, they require good initial values that are sufficiently closed to the truth (theoretically) or satisfy some restricted strong convexity conditions (Zhao et al., 2018), otherwise their optimization algorithms will suffer from bad local minima with bad generalization errors. In contrast, our algorithm only requires the initialization to be closed to zero. Moreover, unlike penalized approaches such as SCAD and MCP, where both parameters for the noise level and the concavity of the penalty need to be tuned, our method only need to tune the number of iterations.

To conclude, our main contributions with respect to the relative literatures are as follows:

  1. We propose an estimator by combining early stopping with implicit regularization to overcome the saturation issues in high dimensional regression with explicit regularizations;

  2. Unlike recent implicit regularization literatures that exclusively focus on noiseless data, we are the first to rigorously study the effect of implicit regularization for noisy data;

  3. From computational perspective, we transform the non-smooth optimization problem to an unconstrained smooth quadratic loss minimization problem for which standard optimization tools can be applied.

2 Background and Our Method

To begin with, we formally introduce the setup and notations used throughout the paper. After that, we introduce the intuition for our new implicit regularized algorithm for high dimensional linear regression via Hadamard product parameterization.

2.1 Setup and notations

Recall that is the unknown -sparse signal in to be recovered. Let denote the index set that corresponds to the nonzero components of , and the size of is then . For two vectors , we call as their Hadamard product, whose components are for . For two vectors , we use the notation to indicate element-wise “great than or equal to”. When there is no ambiguity, we use to denote the self-Hadamard product of . For a function , , we use and to denote its partial derivative relative to and , respectively. For any index set and vector , we use to denote the sub-vector of formed by concatenating the components indexed by . Let denote the vector with all entries as , and

as the identity matrix in

. Let be the diagonal matrix with one on the th diagonal for and elsewhere. For a vector , we use to denote its vector--norm, and its -norm. Let

to denote the uniform distribution over interval

. For a symmetric matrix , let

denote its smallest eigenvalue. For two sequences

and , we use the notation or to mean there exist some constant and independent of such that or for all , respectively, and to mean and .

2.2 Gradient descent with Hadamard product parametrization

As we mentioned in the introduction, we consider augmenting the -dimensional vector into two -dimensional vectors through . Instead of solving the Lasso problem (1) with replaced with , we consider directly applying gradient descent to the quadratic loss function . In particular, we apply the updating formula , , with random initial values and chosen close enough to (notice that is a saddle point of the objective function, so we need to apply a small perturbation on the initial values). This leads to the following algorithm:

Data: Design matrix ,  measurement vector , initialization magnitude , step size , and stopping threshold ;
Initialize variables , for , and iteration number ;
while  do
       ;
       ;
      [0.2em] ;
      
end while
Result: Output the final estimate ;
Algorithm 1 Gradient Descent for linear regression

Algorithm 1 is the standard gradient descent algorithm, and the iterates tend to converge to a stationary point of that satisfies the first order optimality condition and . However, stationary points of can be local minimum, local maximum, or saddle points (when the Hessian matrix contains both positive and negative eigenvalues). The following result provides the optimization landscape of , showing that does not have local maximum, all its local minimums are global minimum, and all saddle points are strict. The strict saddle points are saddle points with negative smallest eigenvalues for Hessian matrix.

Lemma 2.1.

does not have local maximum, and all its local minimums are global minimum. In particular, is a global minimum of if and only if

In addition, any saddle point of is a strict saddle, that is, .

According to the first order condition associated with

there could be exponentially many (at least ) saddle points as a solution to this equation, for example, for those satisfying

for any non-empty subset of . Consequently, the gradient descent algorithm may converge to any of these bad saddle points. To see this, if we initialize in a way such that the components in the index set are zero, or , then these components will remain zero forever in the gradient iterations, or for all . Fortunately, the following result implies that as long as we use a random initialization for with continuous pdf over as in Algorithm 1, then the gradient descent almost surely converges to a global minimum.

Lemma 2.2.

Suppose the step size

is sufficiently small. Then with probability one, Algorithm 

1 converges to a global minimum of .

In the low-dimensional regime where the design matrix has full column rank, the solution to the normal equation is unique, which is also the least squares estimator. Under this scenario, Lemma 2.1 and Lemma 2.2 together certify that Algorithm 1 will converge to this optimal least squares estimator. However, in the high-dimensional regime which is the main focus in the paper, the normal equation have infinitely many solutions, and it is not clear which solution the algorithm tends to converge to. For example, if we consider instead applying the gradient descent to the original parameter in the objective function with initialization , then the iterates will converge to the minimal -norm solution of the normal equation (see below for details). Interestingly, as we will illustrate in the following, under the Hadamard parametrization the gradient descent Algorithm 1 now tends to converge to the minimal -norm solution under certain conditions for initialization, thereby inducing sparsity and naturally facilitating variable selection.

2.3 Gradient descent converges to sparse solution

In this subsection, we provide two different perspectives for understanding the following informal statement on the behavior of simple gradient descent for the loss function defined in (2) under the Hadamard product parameterization . For simplicity, we assume that the response in the linear model is perfect, that is, the noise variance , throughout this subsection. Then in the next subsection, we turn to general noisy observations, and propose methods that lead to optimal estimation when the true regression coefficient is sparse.

Informal Statement:

If we initialize the algorithm to be arbitrarily close to , then under suitable conditions on design , a simple gradient descent converges to a solution of basis pursuit problem:

(3)

Our first perspective is based on the -norm implicit regularization in linear system, and the second is based on analyzing the gradient dynamical system as the limit of the gradient descent algorithm as the step size . However, both perspectives in this section are heuristic, and formal statements and proofs (based on a different strategy) will be provided in Section 3.

-norm implicit regularization perspective:

Consider the under-determined system , where has full row rank. Our first intuition comes from the fact that a zero-initialized gradient descent algorithm over for solving

finds a minimal -norm solution to .

In fact, we know that any solution to the linear system takes the form of

where is the Moore-Penrose inverse of . Since , we have

implying that is the unique solution of in the column space of , which is also the minimal -norm solution. Now since the gradient updating formula for , , implies that the difference always lies in the column span of . Let be the limiting point of the gradient algorithm. Then must be a solution to . On the other hand , when is initialized at zero, should also belong to the column span of . These two properties combined imply that must be the minimal -norm solution .

In high dimensional linear regression with perfect observations, a popular class of penalization methods attempts find the minimal -norm solution to . Under the Hadamard product parameterization , the fact that gradient descent tends to find the minimal -norm solution suggests (this is not rigorous) that the gradient descent algorithm for jointly minimizing over tends to converge to a solution to with a minimal -norm . However, a minimal -norm solution to must satisfy for each (otherwise we can always construct another solution with strictly smaller -norm), which implies . As a consequence, should be the minimal -norm solution to .

Another way to understand the difference in the evolutions of gradient descents for and is by noticing that the gradient in the new parametrization, for each , has an extra multiplicative factor of than the gradient in the usual least squares of minimizing . It is precisely this extra multiplicative factor that helps select important signals (nonzero regression coefficients) and prevent unimportant signals (zero regression coefficients) to grow too fast at the early stage of the evolution when both and are close to zero. Precisely, as we will show in our theory part (Section 3), under suitable conditions on the model, all unimportant signals remain to stay in a neighbourhood of zero, while important ones tend to grow exponentially fast.

Gradient dynamical system perspective:

Our second perspective comes from considering the limiting gradient dynamical system of the problem (i.e. gradient descent with an infinitesimally small step size), which is motivated by the interpretation for matrix factorization problems in Gunasekar et al. (2017) and Gunasekar et al. (2018)

. In particular, the behavior of this limiting dynamical system is captured by the ordinary differential equations

(4)

where , and for simplicity we fixed the initialization. To emphasize the dependence of the solution on , we instead write as . For illustration purposes, we assume that the limiting point of this system is continuous and bounded as the initialization value , that is, both limits and exist in and are finite.

Let , then simple calculation leads to the relation

Under the aforementioned assumption on the existence of limits as and , the preceding display implies one of the following three situations for each :

  1. , and .

  2. , and .

  3. , and .

Denote as the limit . Recall , and the previous three cases can be unified into

This identity together with the limiting point condition coincides with the KKT condition for the basis pursuit problem (3).

Again, this argument is based on the hard-to-check solution continuity assumption. In the next section, we provide a formal proof of the result without making this assumption, albeit under a somewhat strong RIP condition on . We conjecture this gradient descent implicit regularization property to be generally true for a much wider class of design matrices (see our simulation section for numerical evidences).

2.4 Gradient descent with early stopping

In this subsection, we consider the general case where the response contains noise, or . In particular, we propose the use of early stopping, a widely used implicit regularization technique (Zhang and Yu, 2005; Raskutti et al., 2014), to the gradient descent Algorithm 1. As the name suggests, we will use certain criterion to decide whether to terminate the gradient descent updating to prevent overfitting of the data. Algorithm 2

below summarizes a particular stopping criterion widely-used in the machine learning community via validation. In principal, we can also treat the number of iterations as a hyperparameter and repeat this procedure multiple times in same spirits as data splitting and cross validation.

Data: Training design matrix ,  measurement vector , validation data , , initialization magnitude , step size , and maximal number of iterations ;
Initialize variables , for , and iteration number ;
while  do
       ;
       ;
      [0.2em] ;
      
end while
Result: Choose the first such that or is minimized over all iterations, then output the final estimate .
Algorithm 2 Gradient Descent for Linear Regression with Validation Data

Recall that in the introduction, we discussed about the saturation issue suffered by explicit penalized methods such as the Lasso. Now we turn to our method and illustrate that it is unaffected, or at least less suffered from the saturation issue. In the next section, we will provide rigorous theory showing that our method can achieve a faster rate of convergence then the typical rate when all nonzero signals are relatively large.

Due to the connection of our method with the basis pursuit problem (3), one may naturally think that our method in the noisy case should be equivalent to a basis pursuit denoising problem:

(5)

with some error tolerance level depending on the stopping criterion, and therefore is equivalent to the Lasso. Surprisingly, a simulation example below shows that the iterate path of the gradient descent Algorithm 1 contains estimates with much smaller error than the Lasso. Precisely, we adopt the simulation setting S2 in section 4.2 . As comparisons, we also report the Lasso solution path (as a function of the regularization parameter ) solved by ISTA and FISTA (Beck and Teboulle, 2009). For our gradient descent algorithm, we set in the random initialization. From figure 1, when the iteration number is around , even though the prediction error in panel (c) of our algorithm and the Lasso (with an optimally tuned , see panel (b) for the entire Lasso solution path), the estimation error in panel (a) of our method is significantly lower than that of the Lasso, illustrating the occurrence of the saturation phenomenon of the Lasso. Moreover, the stabilized region (iterations ) of our method GD in panel (a) is relatively wide, and therefore the performance tends to be robust to the stopping criterion.

(a) Estimation error vs Iteration
(b) Estimation error vs Regularization for Lasso
(c) Prediction error vs Iteration
Figure 1: Panel (a) is a log-log plot of standardized estimation error versus iteration number . Panel (b) is a log-log plot of standardized estimation error versus regularization parameter for Lasso. Panel (c) is a log-log plot of mean prediction error versus iteration number .

Next, let us briefly illustrate why implicit regularization with early stopping works, while explicit regularized methods may fail. We know that early stopping, serving as algorithmic regularization, is based on the intuition that as the iteration number grows, the bias keeps decreasing while the variance increasing. Consequently, the iteration number , acting as an implicit regularization parameter, aims to optimally balance between the bias and the variance, akin to the bias-variance trade-off. In our parametrization, the iteration number controls the norm of the solution, reflecting the model space complexity. To see this, we plot the norm versus the iteration number, and also the estimation errors versus the norm, all in logarithmic scales, in figure 2. As we expected, as the number of iterations increases, the norm of the iterate also increases. When the logarithm of the iteration number is within , the norm of the estimated coefficients tends to be stabilized at the norm of the true as , corresponding to the most accurate estimation region in panel (a) of figure 1. In contrast, as we can see from panel (b) of figure 2, the estimation error is very sensitive in the regularization parameter domain — the region corresponds to smallest estimation accuracy is very narrow, and a small change in the norm in the solution leads to a drastic deterioration in the estimation accuracy. This numerical example provides evidences of why explicit regularized approaches may suffer from large bias and low accuracy.

(a) norm vs Iteration
(b) Prediction error vs Iteration Comparison
Figure 2: Panel (a) is a log-log plot of norm of the estimated coefficients versus iteration number . Panel (b) is a log-log plot of standardized estimation error versus norm of the estimated coefficients.

Finally, we discuss several commonly used early stopping rules by treating the iteration number as a tuning parameter.

Hold-out or cross validation:

The simplest method is to use hold-out data as validation: for example, we can split the training data into half-half, and then run gradient descent on the first half of the data while calculate the prediction error for all on the second half , then the final iteration number is decided by (cf. Algorithm 2):

(6)
(7)

To make use of the whole dataset, we can perform cross validation: first split data into folds, then apply gradient descent on all possible combinations of folds without replacement and evaluate at the corresponding rest folds. The final risk can be the sum of all the evaluations on each fold, and the criterion (6) or (7) can be used to obtain the iteration number. Finally we can apply the same iteration number obtained from cross validation to approximate the optimal one for the entire training data.

Stein’s unbiased risk estimate (SURE):

Stein (1981)

suggested the use of degrees of freedom as surrogate to the true prediction error given the standard derivation

. Under our settings, ignoring the second order term of step size , the updating of the prediction error (up to rescaling) through gradient descent can be approximated by (by ignoring second order terms of order ):

(8)

where for a vector , diag denotes the diagonal matrix with components of in the its diagonals. Define , then the estimated degrees of freedom at time can be approximated by . Consequently, the risk based on the -type statistic (Efron, 2004) is

(9)

The total iteration number as a tunign parameter can then be selected by minimizing in equation (6) or (7) . In practice, we can use the plug-in estimator to replace the unknown in . According to our simulation studies (for example, see figure 3), early stopping based on SURE generally works not as good as the hold-out or cross validation method.

Measure of model complexity:

(Raskutti et al., 2014) proposed an early stopping rule based on local empirical Rademacher complexity of the Reproducing kernel Hilbert space. However, their method can not be directly applied in our case: their stopping rule is based on the eigenvalues of empirical kernel matrix, which is in our settings. Since our empirical kernel matrix keeps updated throughout the iterations, their method is not directly applicable.

In the end of this subsection, we adopt the simulation framework S1-S4 (only change the standard derivation to ) in section 4.2 to compare different early stopping criteria. We record the mean estimation errors averaging over trials and report the errors in figure 3. From the figure, we can see that cross-validation tends to be more robust than SURE. Therefore, we recommend using hold-out or cross validation to select the iteration number, and will stick to this method in the rest of the paper.

Figure 3: Comparison of the estimation errors for different early stopping rules, ‘Oracle’ stands for the optimal early stopping rule with knowldege on the truth. ’CV’ stand for early stopping through fold cross validation.

2.5 Adaptive step size and variable selection

A nature extension of gradient descent algorithm 1 is to assign different weights (step sizes) to different coordinates of , which is related to the adaptive Lasso (Zou, 2006). It can be seen from the differential equation interpretation: by inserting a constant weighting matrix into the equation (4), we obtain the limiting dynamical system as

Based on similar heuristic analysis as in Section 2.3 for the noiseless case, the limiting point of the dynamic system satisfies:

which is the KKT condition for the dual form of the adaptive Lasso

In the limiting case when the step size of a particular component tends to , we are equivalently adding an when . In contrast, if we apply a larger step size to , then tend to move faster and more freely in the parameter space, which is equivalent to a smaller penalty on . The original paper in Zou (2006)

constructed the weights based on the ordinary least square solution, which requires

. In practice when , we can construct weights through a preprocessing step. For example, variable screening can be applied to introduce sparse weights.

To enable variable selection in our method, we can perform a component-wise hard thresholding operation to the final estimator . Based on our theoretical analysis, since our method tries to shrink both weak signals and errors into very small order , it is more robust to false detections than other explicit regularizations when the same tuning parameter for noise level is applied. Let us consider a simple example to illustrate the basic idea: we set , , for and , , , and all other components are zeros in the data generating model with . Since the strength of the first components of truth is weak, it is hard to be detected by all methods we have tried. However, the effect of the weak signals on still pertains. In particular, when applying the cross validation, traditional penalized methods tends to over-select the predictors, leading a lot of false discoveries. In comparison, due to the implicit regularization our method tend to be more robust to the weak signals—our estimate is typically non-sparse, the effect of the non-detected weak signals can be distributed to all components of the estimated vector, and no component is particularly spiked. As a consequence, our method tends to be more robust to false discoveries after applying the hard thresholding. The variable selection results are shown in figure 4. As we can see, the Lasso can not detect the weak signal, and two components, indexed by , appear to be false detected through cross validation (note that in Lasso, a soft thresholding has already been applied). In contrast, in our method most unimportant signals remains small. Performing a hard thresholding with the same regularization parameter selected by the Lasso can erase all false detections, leading to the selection of strong signal only.

Figure 4: The values versus index for truth , estimator through lasso by minimizing the cross validation error in prediction, estimator through gradient descent by minimizing the cross validation error in prediction and estimator through ‘post estimation’ selection for gradient descent.

2.6 Related literatures

Li et al. (2018) studies the theory for implicit regularization in matrix sensing, which requires the data to be perfect measured and has different geometric structures as linear regression. Hoff (2017) considers the Hadamard product parametrization to optimize the parameters in high-dimensional linear regression. However, his method is computational-wise in order to reformulate the non-smooth Lasso optimization problem into a smooth one. In particular, their objective function involves an penalty on (which is equivalent to the penalty on ), and the solution is precisely the Lasso solution.

3 Theoretical Analysis

In this section, we provide formal statements for characterizing the behavior of the gradient descent algorithm for minimizing the Hadamard product parametrized quadratic loss as defined in (2). We start with a description of our assumptions. Then we turn to the case of non-negative parameters, where a simpler parametrization can be applied, as a warm-up to convey the main proof ideas. Finally, we consider the general signal case and illustrate when the fast parametric root- rate independent of the dimension can be achieved. All proofs are deferred to the appendix in the supplementary material of the paper.

3.1 Assumptions

Recall that the underlying true data generating model is with , where the true parameter is -sparse. Within the nonzero signal components of , we define the index set of strong signals as and weak signals as , where , . According to the information-theoretic limits from Wainwright (2009), weak signals of order in sparse linear regression are generally impossible to be jointly recovered or selected (but can be detected in terms of the type I/II error control in hypothesis testings, e.g. see Jin and Ke (2016)). Therefore, our primary focus would be the estimation and selection of strong signals. We use the notation to denote the -th largest absolute component value of , and let , which reflects the minimal strength for strong signals. We also use to denote the strong signal-condition number as the ratio between the largest absolute signal value to the smallest strong signal. We will also make use of the notation of Restricted Isometry Property (RIP, Candes (2008)), which is a commonly used assumption (e.g. Candes and Tao (2007)) in the high dimensional linear regression literatures.

Definition 3.1 (Restricted Isometry Property).

A matrix is said to satisfy the -Restricted Isometry Property (RIP) if for any -sparse vector in , we have:

As an easy consequence, if matrix satisfies -RIP, then Euclidean inner-product is also approximately preserved, that is, holds for any two -sparse vectors .

With these preparations, we make the following assumptions on the true parameter , design matrix , initialization parameter and step size in Algorithm 1.

Assumption (A):

The true parameter is -sparse, and , that is, each nonzero signal in is either weak or strong. In addition, .

Assumption (B):

The design matrix satisfies -RIP condition with .

Assumption (C):

The initialization parameter satisfies , and the step size satisfies .

Some remarks are in order. First, our current proof heavily relies on the RIP condition as in Assumption (B), which is satisfied if the predictors are iid and . However, the extensive simulation studies in the next section provide a strong numerical evidence suggesting that our conclusions remain valid even when the RIP condition is violated. We leave the formal theoretical investigation as an open problem for our future studies. Second, Assumption (A) is made mainly for illustrating the possibility of achieving the fast parametric root rate of convergence when . In fact, our current proof can still lead to the typical high-dimensional rate of without introducing the notion of strong and weak signals. And due to space constraint we omit the details.

3.2 Non-negative Signals

To start with, we demonstrate the key ideas of our proofs and give an analysis of the non-negative case as a warm-up. More specifically, we consider the case when all components of true signal are non-negative. To exploit this non-negativeness, we may instead use the self-Hadamard product parametrization for , and write . Now, we have the optimization problem:

with gradient descent updating formula . For technical simplicity, we instead focus on the deterministic initialization . This case is simpler for the analysis than the general case since components of will not change sign during the iterations, and will always stay away from saddle points. We summarize our result in the following main theorem. Since the non-negative signal constraint resembles the positive semi-definiteness constraint in matrix sensing, our proof utilizes the proof strategy in Li et al. (2018) for analyzing matrix factorized gradient descent for noiseless matrix sensing by dividing the convergence into two stages (more details are provided after the theorem).

Theorem 3.1.

Under the above assumptions (A), (B) and (C). Let , and any . Then there exist positive constants such that for every time satisfying , with probability at least , the time -iterate satisfies

This theorem tells us in high dimension linear regression, combining early stopping with implicit regularization can significantly improve the estimation accuracy. In particular, when all signals are strong ( and ), the estimate attains a parametric rate of convergence that is independent of the dimension . In general when weak signals exist, then the overall rate depends on the number of weak (strong) signals, which is still minimax-optimal (Zhao et al., 2018). The same remark also applies to our theory in the general case.

Our proof strategy is to divide the convergence into two stages. Recall that is the support set of true signal , and corresponds to the subset of all strong signals. In the first “burn-in” stage, we show that each component of the strong signal part increases at an exponential rate in until hitting , while the weak signal and error part remains bounded by . In the second stage, iterate enters a geometric convergence region where converges towards at a linear rate up to some high-order error term, and then stay in a neighborhood of up to the time . Therefore, the time interval would be the theoretical “best solution region” corresponding to the stabilized region in figure 1.

More specifically, in the proof we consider the decomposition of into three parts: strong signal part , weak signal part and error part :

where recall that is the diagonal matrix with ones on the positions indexed by subset and zero elsewhere. We use induction to prove the following results characterizing the gradient dynamics in the first stage. Recall that denote the -th largest absolute component value of vector and is the minimal strength of the strong signals.

Proposition 3.1 (Stage one dynamics).

Under assumptions of theorem 3.1, there are constants and , such that for each , we have:

strong signal dynamics:
weak signal dynamics:
error dynamics:

Define as a good parameter region for . Proposition 3.1 tells that for all , iterate satisfies all constraints of except , but will enter this good region in at most iterations. The second stage starts when first enters . The next result summarizes the behavior of the gradient dynamics in the second stage—once it enters , it will stay in up to time , where converges towards at a linear rate up to the statistical error .

Proposition 3.2 (Stage two dynamics).

Under assumptions of theorem 3.1, there exists some constant such that if , then