# Implicit Sparse Regularization: The Impact of Depth and Early Stopping

In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more realistic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon that we call implicit sparse regularization. This result is in sharp contrast to known results for noiseless and uncorrelated-design cases. We characterize the impact of depth and early stopping and show that for a general depth parameter N, gradient descent with early stopping achieves minimax optimal sparse recovery with sufficiently small initialization and step size. In particular, we show that increasing depth enlarges the scale of working initialization and the early-stopping window, which leads to more stable gradient paths for sparse recovery.

## Authors

• 2 publications
• 9 publications
• 50 publications
• 16 publications
01/14/2022

### The Implicit Regularization of Momentum Gradient Descent with Early Stopping

The study on the implicit regularization induced by gradient-based optim...
03/22/2019

### Implicit Regularization via Hadamard Product Over-Parametrization in High-Dimensional Linear Regression

We consider Hadamard product parametrization as a change-of-variable (ov...
09/11/2019

### Implicit Regularization for Optimal Sparse Recovery

We investigate implicit regularization schemes for gradient descent meth...
09/26/2019

### The Implicit Bias of Depth: How Incremental Learning Drives Generalization

A leading hypothesis for the surprising generalization of neural network...
02/01/2020

### The Statistical Complexity of Early Stopped Mirror Descent

Recently there has been a surge of interest in understanding implicit re...
05/08/2021

### Nearly Minimax-Optimal Rates for Noisy Sparse Phase Retrieval via Early-Stopped Mirror Descent

This paper studies early-stopped mirror descent applied to noisy sparse ...
11/25/2021

### Predicting the success of Gradient Descent for a particular Dataset-Architecture-Initialization (DAI)

Despite their massive success, training successful deep neural networks ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Motivation. Central to recent research in learning theory is the insight that the choice of optimization algorithms plays an important role in model generalization Zhang et al. (2016); Hardt et al. (2016)

. A widely adopted view is that (stochastic) gradient descent — the most popular optimization algorithm in machine learning — exhibits some implicit form of regularization. Indeed for example, in the classical under-determined least squares setting, gradient descent (with small step size) starting from the origin converges to the model with minimum Euclidean norm. Similar implicit biases are also observed in deep neural network training in which the networks typically have many more parameters than the sample size. There, gradient descent without explicit regularization finds solutions that not only interpolate the training data points but also generalize well on test sets

Hardt et al. (2016); Neyshabur et al. (2017); Soudry et al. (2018); Belkin et al. (2019).

This insight, combined with the empirical success stories of deep learning, has sparked significant interest among theoretical researchers to rigorously understand implicit regularization. The majority of theoretical results focus on well-understood problems such as regression with linear models

Saxe et al. (2013); Soudry et al. (2018); Gunasekar et al. (2018a, b); Vaskevicius et al. (2019); Zhao et al. (2019); Gissin et al. (2019) and matrix factorization Gidel et al. (2019); Gunasekar et al. (2018c); Li et al. (2018); Arora et al. (2019); Woodworth et al. (2020), and show that the parametrization (or architecture) of the model plays a crucial role. For the latter, Gunasekar et al. Gunasekar et al. (2018c) conjectured that gradient descent on factorized matrix representations converges to the solution with minimum nuclear norm. The conjecture was partially proved by Li et al. Li et al. (2018) under the Restricted Isometry Property (RIP) and the absence of noise. Arora et al. Arora et al. (2019) further show the same nuclear-norm implicit bias using depth- linear networks (i.e., the matrix variable is factorized into components).

Parallel work on nonlinear models and classification Soudry et al. (2018); Gunasekar et al. (2018a) has shown that gradient descent biases the solution towards the max-margin/minimum -norm solutions over separable data. The scale of initialization in gradient descent leads to two learning regimes (dubbed “kernel" and “rich") in linear networks Woodworth et al. (2020)

, shallow ReLU networks

Williams et al. (2019)

and deep linear classifiers

Moroshko et al. (2020).

Contributions. Our focus in this paper is the implicit regularization of (standard) gradient descent for high dimensional sparse regression, namely implicit sparse regularization. Let us assume a ground-truth sparse linear model and suppose we observe noisy samples , such that ; a more formal setup is given in Section 2. Using the samples, we consider gradient descent on a squared loss

with no explicit sparsity regularization. Instead, we write the parameter vector

in the form with . Now, the regression function can be viewed as a depth- diagonal linear network Woodworth et al. (2020). Minimizing the (now non-convex) loss over and with gradient descent is then analogous to training this depth- network.

Our main contributions are the following. We characterize the impact of both the depth and early stopping for this non-convex optimization problem. Along the way, we also generalize the results of Vaskevicius et al. (2019) for . We show that under a general depth parameter and an incoherence assumption on the design matrix, gradient descent with early stopping achieves minimax optimal recovery with sufficiently small initialization and step size . The choice of step size is of order . Moreover, the upper bound of the initialization, as well as the early-stopping window, increase with , suggesting that depth leads to more stable trajectories in terms of generalization performance.

Techniques. At a high level, our work continues the line of work on implicit bias initiated in Vaskevicius et al. (2019); Zhao et al. (2019); Woodworth et al. (2020) and extends it to the deep setting. Table 1 highlights key differences between our work and Vaskevicius et al. (2019); Gissin et al. (2019); Woodworth et al. (2020). Specifically, Woodworth et al. Woodworth et al. (2020)

study the gradient flow of the squared-error loss function in the noiseless case

. Vaskevicius et al. Vaskevicius et al. (2019) analyze the finite gradient descent and characterize the implicit sparse regularization with . Lastly, Gissin et al. Gissin et al. (2019) discover the incremental learning dynamic of gradient flow for general but in an idealistic model setting where , , uncorrelated design and with infinitely many samples.

At first glance, one could attempt a straightforward extension of the proof techniques in Vaskevicius et al. (2019) to general settings of . However, this turns out to be very challenging. Consider even the simplified case where the true model is non-negative, the design matrix is unitary (i.e., ), and the noise is absent (); this is the setting studied in Gissin et al. (2019). For each entry of , the iterate of gradient descent over the depth- reparametrized model is given by:

 wi,t+1=wi,t(1+w1−2Ni,t(w⋆i−wi,t))N,

which is no longer a simple multiplicative update. As pointed out in Gissin et al. (2019) (see their Appendix C), the recurrence relation is not analytically solvable due to the presence of the (pesky) term when .

Moreover, this extra term leads to widely divergent growth rates of weights with different magnitudes, which further complicates analytical bounds.

To resolve this and rigorously analyze the dynamics for , we rely on a novel first order, continuous approximation to study growth rates without requiring additional assumptions on gradient flow, and carefully bound the approximation error due to finite step size; see Section 4.

## 2 Setup

Sparse regression/recovery. Let be a -dimensional sparse vector with non-zero entries. Assume that we observe data points such that for , where is the noise vector. We do not assume any particular scaling between the number of observations and the dimension . Due to the sparsity of , however, we allow .

The linear model can be expressed in the matrix-vector form:

 y=Xw⋆+ξ, (1)

with the design matrix , where denotes the row of . We also denote , where denotes the column of .

The goal of sparse regression is to estimate the unknown, sparse vector

from the observations. Over the past two decades, this problem has been a topic of active research in statistics and signal processing Tibshirani (1996). A common approach to sparse regression is penalized least squares with sparsity-induced regularization such as or penalties/constraints, leading to several well-known estimators Tibshirani (1996); Chen et al. (2001); Candes et al. (2007) and algorithms Bredies and Lorenz (2008); Agarwal et al. (2012). Multiple estimators enjoy optimal statistical and algorithmic recovery guarantees under some conditions of the design matrix (e.g., RIP Candes and Tao (2005)) and the noise .

We deviate from the standard penalized least squares formulation and instead learn via a polynomial parametrization:

 w=uN−vN,u,v∈Rp,

where and for any . The regression function induced by such a parametrization is equivalent to a -layer diagonal linear network Woodworth et al. (2020) with

hidden neurons and the diagonal weight matrix shared across all layers.

Given the data observed in (1), we analyze gradient descent with respect to the new parameters and over the mean squared error loss without explicit regularization:

 L(u,v)=1n∥∥X(uN−vN)−y∥∥22,u,v∈Rp.

Even though the loss function yields the same value for the two parametrizations, is non-convex in and . Unlike several recent studies in implicit regularization for matrix factorization and regression Li et al. (2018); Woodworth et al. (2020); Gissin et al. (2019), we consider the noisy setting, which is more realistic and leads to more insights into the bias induced during the optimization. Because of noise, the loss evaluated at the ground truth (i.e., any such that ) is not necessarily zero or even minimal.

 u0=v0=α1, (ut+1,vt+1)=(ut,vt)−η∂L(ut,vt)∂(ut,vt),t=0,1,…. (2)

Here, is the step size and is the initialization of . In general, we analyze the algorithm presented in (2), and at each step , we can estimate the signal of interest by simply calculating .

Vaskevicius et al. Vaskevicius et al. (2019) establish the implicit sparse regularization of gradient descent for and show minimax optimal recovery, provided sufficiently small and early stopping. Our work aims to generalize that result to and characterize the role of in convergence.

Notation. We define and . The largest and smallest absolute value on the support is denoted as and . We use to denote the vector of all ones and denotes the vector whose elements on are all one and 0 otherwise. Also, denotes coordinate-wise multiplication. We denote and meaning the signal part and error part at each time step . We use and to denote the pointwise maximum and minimum. The coordinate-wise inequalities are denoted as . We denote inequalities up to multiplicative absolute constants by , which means that they do not depend on any parameters of the problem.

###### Definition 1.

Let be a matrix with -normalized columns , i.e., for all . The coherence of the matrix is defined as

 μ\coloneqqmax1≤i≠j≤p|⟨Xi,Xj⟩|.

The matrix is said to be satisfying -incoherence.

The coherence is a measure for the suitability of the measurement matrix in compressive sensing Foucart and Rauhut (2013). In general, the smaller the coherence, the better the recovery algorithms perform. In contrast to the coherence, the Restricted Isometry Property (RIP) is a powerful performance measure for guaranteeing sparse recovery and has been widely used in many contexts. However, verifying the RIP for deterministically constructed design matrices is NP-hard. On the other hand, coherence is a computationally tractable measure and its use in sparse regression is by now classical Donoho et al. (2005). Therefore, in contrast with previous results Vaskevicius et al. (2019) (which assume RIP), the assumptions made in our main theorems are verifiable in polynomial time.

## 3 Main Results

We now introduce several quantities that are relevant for our main results. First, the condition number plays an important role when we work on the incoherence property of the design matrix. Next, we require an upper bound on the initialization , which depends on the following terms:

 Φ(w⋆max,w⋆min,ϵ,N)\coloneqq(18)2/(N−2)∧⎛⎜⎝(w⋆max)(N−2)/Nlogw⋆maxϵ⎞⎟⎠2/(N−2)∧⎛⎜⎝(w⋆min)(N−2)/Nlogw⋆minϵ⎞⎟⎠4/(N−2), Ψ(w⋆min,N)\coloneqq(2−2N−2N)1N−2(w⋆min)1N∧23N(21N−1)1N−2(w⋆min)1N.

Finally, define

 ζ\coloneqq15w⋆min∨200n∥∥XTξ∥∥∞∨200ϵ.

We are now ready to state the main theorem:

###### Theorem 1.

Suppose that and satisfies -incoherence with . Take any precision , and let the initialization be such that

 0<α≤(ϵp+1)4/N∧Φ(w⋆max,w⋆min,ϵ,N)∧Ψ(w⋆min,N). (3)

For any iteration that satisfies

 Tl(w⋆,α,N,η,ζ,ϵ)≤t≤Tu(w⋆,α,N,η,ζ,ϵ), (4)

where and are given in (25) of the Appendix, the gradient descent algorithm (2) with step size yields the iterate with the following property:

 (5)

In the special case , if , and , then we have .

Theorem 1 states the convergence of the gradient descent algorithm (2) in -norm. The error bound on the signal is invariant to the choice of , and the overall bound generalizes that of Vaskevicius et al. (2019) for . We also establish the convergence result in -norm in the following corollary:

###### Corollary 1.

Suppose the noise vector has independent -sub-Gaussian entries and . Under the assumptions of Theorem 1, the gradient descent algorithm (2) would produce iterate satisfying

with probability at least

.

Let us discuss the implications of Theorem 1 and the role of initialization and early stopping:

(a) Requirement on initialization. To roughly understand the role of initialization and the effect of , we look at the non-negative case where and . This simplifies our discussion while still capturing the essential insight of the general setting. At each step, the “update” on can be translated from the corresponding gradient update of as

 w0 =αN1, (6) wt+1 =wt⊙(1−2Nηn(XTX(wt−w∗)−XTξ)⊙w(N−2)/Nt)N.

In order to guarantee the convergence, we require the initialization to be sufficiently small so the error outside the support can be controlled. On the other hand, too small initialization slows down the convergence of the signal. Interestingly, the choice of affects the allowable initialization that results in guarantees on the entries inside and outside the support.

Specifically, the role of is played by the term in (6), which simply disappears as . Since this term only affects the update of entry-wise, we only look at a particular entry of . Let represent an entry outside the support. For , the term is increasingly small as increases and . Therefore, with a small initialization, it remains true that for the early iterations. Intuitively, this suggests that the requirement on the upper bound of the initialization would become looser when gets larger. This indeed aligns with the behavior of the upper bound we derive in our theoretical results. Since increases naturally with , we fix instead of to mimic the same initialization in terms of , for the following comparison.

We formalize this insight in Theorem 3 in Appendix A and show the convergence of (6) under the special, non-negative case. Note that, in terms of initialization requirement, the only difference from Theorem 1 is that we no longer require the term in (3).

###### Remark 1.

We investigate how the depth influences the requirement on initialization due to the change on gradient dynamics. We rewrite in terms of , and therefore the upper bound for under the simplified setting of non-negative signals (Theorem 3) is

 w0≤(18)2N/(N−2)∧⎛⎜⎝(w⋆max)(N−2)/Nlogw⋆maxϵ⎞⎟⎠2N/(N−2)∧⎛⎜⎝(w⋆min)(N−2)/Nlogw⋆minϵ⎞⎟⎠4N/(N−2).

We start by analyzing each term in the upper bound. First, we notice that is increasing with respect to . For the second term,

 ⎛⎜⎝(w⋆max)(N−2)/Nlogw⋆maxϵ⎞⎟⎠2N/(N−2)=(w⋆max)2(logw⋆maxϵ)2N/(N−2),

the denominator gets smaller as increases when we pick the error tolerance parameter small. Therefore, we get that the second term is getting larger as increases. The last term follows a similar argument. We see that it is possible to pick a larger initialization for larger . We will demonstrate that below in our experiments.

(b) Early stopping. Early stopping is shown to be crucial, if not necessary, for implicit sparse regularization Vaskevicius et al. (2019); Zhao et al. (2019). Interestingly, (Gissin et al., 2019; Woodworth et al., 2020) studied the similar depth- polynomial parametrization but did not realize the need of early stopping due to an oversimplification in the model. We will discuss this in details in Section 4.1. We are able to explicitly characterize the window of the number of iterations that are sufficient to guarantee the optimal result. In particular, we get a lower bound of the window size for early stopping to get a sense of how it changes with different .

###### Theorem 2 (Informal).

Define the early stopping window size as , the difference between the upper bound and lower bound of the number of iterations in (4) of Theorem 1. Fixing and for all , the early stopping window size is increasing with under mild conditions.

We defer the formal argument and proof of Theorem 2 to Appendix D.3. We note that the window we obtain in Theorem 1 is not necessarily the largest window that allows the guarantee, and hence the early stopping window size can be effectively regarded a lower bound of that derived from the largest window. We note that a precise characterization of the largest window is difficult. Although we only show that this lower bound increases with , we see that the conclusion matches empirically with the largest window. Fix the same initialization and step size for , we show the coordinate path in Figure 1. We can see that as increases, the early stopping window increases and the error bound captures the time point that needs stopping quite accurately. The experimental details and more experiments about early stopping is presented in Section 5.

## 4 Proof Ingredients

The goal of this paper is to understand how generalization and gradient dynamics change with different . For , gradient descent yields both statistically and computationally optimal recovery under the RIP assumption Vaskevicius et al. (2019). The matrix formulation of the same type of parametrization is considered in the setting of low-rank matrix recovery, and exact recovery can be achieved in the noiseless setting Gunasekar et al. (2018c); Li et al. (2018). The key proof ingredient is to reduce the convergence analysis to one-dimensional iterates and differentiate the convergence on the support from the error outside the support. Before we get into that, we conduct a simplified gradient flow analysis.

### 4.1 A Simplified Analysis

Consider a simplified problem where the target signal is non-negative, and the noise is absent. We omit the reparametrization of like before and the gradient descent updates on will be independent for each coordinate. The gradient flow dynamics of is derived as

 ∂wi∂t=∂wi∂ui∂ui∂t=−∂wi∂ui∂L∂ui=2N2(w⋆i−wi)w2−2Ni, (7)

for all . Notice that increases monotonically and converges to if is positive or otherwise keeps decreasing and converges to if . As such, we can easily distinguish the support and non-support. In fact, gradient flow with dynamics as in (7) would exhibit a behavior of “incremental learning” — the entries are learned separately, one at a time Gissin et al. (2019). However, with the presence of noise and perturbation arising from correlated designs, the gradient flow may end up over-fitting the noise. Therefore, early stopping as well as the choice of step size is crucial for obtaining the desired solution Vaskevicius et al. (2019). We use (7) to obtain a gradient descent update:

 wi,t+1=wi,t(1+2N2η(w⋆i−wi,t)w1−2Ni,t). (8)

The gradient descent with is analyzed in Vaskevicius et al. (2019). However, when , the presence of imposes an asymmetrical effect on the gradient dynamics. The difficulty of analyzing such gradient descent (8) is pointed out in Gissin et al. (2019). More specifically, the recurrence relation is not solvable. However, gradient descent updates still share similar dynamics with the idealized gradient flow in (7). Inspired by this effect, we are able to show that the entries inside the support and those outside the support are learned separately with a practical optimization algorithm shown in (2) and (12). As a result, we are able to explore how the depth affects the choice of step size and early stopping criterion.

### 4.2 Proof Sketch

Growth rate of gradient descent. We adopt the same decomposition as illustrated in Vaskevicius et al. (2019), and define the following error sequences:

 (9)

We can then write the updates on and as

 st+1 =st⊙(1−2Nη(st−w⋆+pt+bt)⊙s(N−2)/Nt)N, (10) et+1 =et⊙(1−2Nη(pt+bt)⊙e(N−2)/Nt)N.

To illustrate the idea, we think of the one-dimensional updates and , ignore the error perturbations and in the signal updates , and treat in the error updates .

 st+1=st(1−2Nη(st−w⋆)s(N−2)/Nt)N,et+1=et(1−2NηBe(N−2)/Nt)N. (11)

We use the continuous approximation to study the discrete updates. Therefore, we can borrow many insights from the analysis about gradient flow to overcome the difficulties caused by as pointed out in equation (8). With a proper choice of step size , the number of iterations for converging to is derived as

 Tl ≤Tl−1∑t=0st+1−st2N2η(w⋆−st)s(2N−2)/Nt ≤1N2ηw⋆∫w⋆αN1s(2N−2)/Nds+O(w⋆−αNα2N−2).

The number of iterations for staying below some threshold is derived as

 Tu≥Tu−1∑t=0et+1−et4N2ηBe(2N−2)/Nt≥14N2ηB∫αN/4αN1e(2N−2)/Nde.

With our choice of coherence in Theorem 1, we are able to control to be small so that is smaller than . This means the entries on the support converge to the true signal while the entries outside the support stay around 0, and we are able to distinguish signals and errors.

Dealing with negative targets. We now illustrate the idea about how to generalize the result about non-negative signals to general signals. The exact gradient descent updates on and are given by:

 ut+1 =ut⊙(1−2Nη(1nXT(X(wt−w⋆)−ξ)⊙uN−2t)), (12) vt+1 =vt⊙(1+2Nη(1nXT(X(wt−w⋆)−ξ)⊙vN−2t)).

The basic idea is to show that when is positive, remains small up to the early stopping criterion, and when is negative, remains small up to the early stopping criterion. We turn to studying the gradient flow of such dynamics. Write . It is easy to verify that the gradient flow has a solution:

 u(t)=(α2−N1+2N(N−2)η∫t0r(υ)dυ)12−N, v(t)=(α2−N1−2N(N−2)η∫t0r(υ)dυ)12−N.

We may observe some symmetry here, when is large, must be small. For the case , to ensure the increasing of and decreasing of as we desire, the initialization needs to be smaller than , which leads to the extra constraint on initialization with order of as defined before. It remains to build the connection between gradient flow and gradient descent, where again we uses the continuous approximation as before. The detailed derivation is presented in Appendix B.3.

## 5 Simulation Study

We conduct a series of simulation experiments to further illuminate our theoretical findings. Our simulation setup is described as follows. The entries of

are sampled as i.i.d. Rademacher random variables and the entries of the noise vector

are i.i.d. random variables. We let . The values for the simulation parameters are: , , , , unless otherwise specified. For -plots each simulation is repeated 30 times, and the median error is depicted. The shaded area indicates the region between and percentiles pointwisely.

Convergence results. We start by showing that the general choice of leads to the sparse recovery, similar to in Vaskevicius et al. (2019), as shown in our main theorem. We choose different values of to illustrate the convergence of the algorithm. Note that the ranges in the -axes of these figures differ due to different choice of and . We observe that as increases, the number of iterations increases significantly. This is due to the term and in (12), and the step size . With a very small initialization, it takes a large number of iterations to escape from the small region (close to ).

Larger initialization. As discussed in Remark 1, the upper bound on initialization gets larger with larger . We intentionally pick a relatively large where the algorithm fails to converge for . With the same initialization, the recovery manifests as increases (Fig. 3).

Early stopping window size. Apart from the coordinate path shown in Figure 1, we obtain multiple runs and plot the - error (the logarithm of the -error) of the recovered signals to further confirm the increase of early stopping window, as shown in Section 3. Note that for both Figures 1 and 4, we set and . Since would decrease quickly with , which would cause the algorithm takes a large number of iterations to escape from the small region. We fix instead of fixing for Figure 4.

Incremental learning dynamics. The dynamics of incremental learning for different is discussed in Gissin et al. (2019). The distinct phases of learning are also observed in sparse recovery (Figure 5), though we do not provide a theoretical justification. Larger values of would lead to more distinct learning phases for entries with different magnitudes under the same initialization and step size .

Kernel regime. As pointed out in Woodworth et al. (2020), the scale of initialization determines whether the gradient dynamics obey the “kernel” or “rich” regimes for diagonal linear networks. We have carefully analyzed and demonstrated the sparse recovery problem with small initialization, which corresponds to the “rich” regime. To explore the "kernel" regime in a more practical setting, we set , , and the entries of are i.i.d. random variables. The noise level is , and the initialization and step size is set as and for all . Note that we are not working in the case as Woodworth et al. (2020). We still observe that the gradient dynamics with large initialization (Figure 6

) can be connected to ridge regression if early stopping is deployed.

## 6 Conclusions and Future Work

In this paper, we extend the implicit regularization results in Vaskevicius et al. (2019) from to general , and further study how gradient dynamics and early stopping is affected by different choice . We show that the error bound is invariant with different choice of and yields the minimax optimal rate. The step size is of order . The initialization and early stopping window gets larger when increasing due to the changes on gradient dynamics.

The convergence result can be further improved by relaxing the requirement on the incoherence of design matrix from to , similar to Vaskevicius et al. (2019). Overall, we believe that such an analysis and associated techniques could be applied for studying other, deeper nonlinear models in more practical settings.

## References

• [1] A. Agarwal, S. Negahban, and M. J. Wainwright (2012)

Fast global convergence of gradient methods for high-dimensional statistical recovery

.
The Annals of Statistics, pp. 2452–2482. Cited by: §2.
• [2] S. Arora, N. Cohen, W. Hu, and Y. Luo (2019) Implicit regularization in deep matrix factorization. arXiv preprint arXiv:1905.13655. Cited by: §1.
• [3] M. Belkin, D. Hsu, S. Ma, and S. Mandal (2019)

Reconciling modern machine-learning practice and the classical bias-variance trade-off

.
Proceedings of the National Academy of Sciences of the United States of America 116 (32), pp. 15849–15854. Cited by: §1.
• [4] K. Bredies and D. A. Lorenz (2008) Linear convergence of iterative soft-thresholding. Journal of Fourier Analysis and Applications 14 (5-6), pp. 813–837. Cited by: §2.
• [5] E. J. Candes and T. Tao (2005)

Decoding by linear programming

.
IEEE transactions on information theory 51 (12), pp. 4203–4215. Cited by: §2.
• [6] E. Candes, T. Tao, et al. (2007) The dantzig selector: statistical estimation when p is much larger than n. Annals of statistics 35 (6), pp. 2313–2351. Cited by: §2.
• [7] S. S. Chen, D. L. Donoho, and M. A. Saunders (2001) Atomic decomposition by basis pursuit. SIAM review 43 (1), pp. 129–159. Cited by: §2.
• [8] D. L. Donoho, M. Elad, and V. N. Temlyakov (2005) Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory 52 (1), pp. 6–18. Cited by: §2.
• [9] S. Foucart and H. Rauhut (2013) An invitation to compressive sensing. In A mathematical introduction to compressive sensing, pp. 1–39. Cited by: §2.
• [10] G. Gidel, F. Bach, and S. Lacoste-Julien (2019) Implicit regularization of discrete gradient dynamics in linear neural networks. arXiv preprint arXiv:1904.13262. Cited by: §1.
• [11] D. Gissin, S. Shalev-Shwartz, and A. Daniely (2019) The implicit bias of depth: how incremental learning drives generalization. arXiv preprint arXiv:1909.12051. Cited by: Table 1, §1, §1, §1, §2, §3, §4.1, §4.1, §5.
• [12] S. Gunasekar, J. D. Lee, N. Srebro, and D. Soudry (2018) Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems 2018, pp. 9461–9471. Cited by: §1, §1.
• [13] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro (2018) Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pp. 1832–1841. Cited by: §1.
• [14] S. Gunasekar, B. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro (2018) Implicit regularization in matrix factorization. In 2018 Information Theory and Applications Workshop (ITA), pp. 1–10. Cited by: §1, §4.
• [15] M. Hardt, B. Recht, and Y. Singer (2016) Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225–1234. Cited by: §1.
• [16] Y. Li, T. Ma, and H. Zhang (2018) Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp. 2–47. Cited by: §1, §1, §2, §4.
• [17] E. Moroshko, B. E. Woodworth, S. Gunasekar, J. D. Lee, N. Srebro, and D. Soudry (2020) Implicit bias in deep linear classification: initialization scale vs training accuracy. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 22182–22193. Cited by: §1.
• [18] G. Neu and L. Rosasco (2018) Iterate averaging as regularization for stochastic gradient descent. In Conference On Learning Theory, pp. 3222–3242. Cited by: §1.
• [19] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro (2017) Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071. Cited by: §1.
• [20] G. Raskutti, M. J. Wainwright, and B. Yu (2014) Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research 15 (1), pp. 335–366. Cited by: §1.
• [21] A. M. Saxe, J. L. McClelland, and S. Ganguli (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120. Cited by: §1.
• [22] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro (2018) The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 (1), pp. 2822–2878. Cited by: §1, §1, §1.
• [23] A. Suggala, A. Prasad, and P. K. Ravikumar (2018) Connecting optimization and regularization paths. Advances in Neural Information Processing Systems 31, pp. 10608–10619. Cited by: §1.
• [24] R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §2.
• [25] T. Vaskevicius, V. Kanade, and P. Rebeschini (2019) Implicit regularization for optimal sparse recovery. In Advances in Neural Information Processing Systems, pp. 2972–2983. Cited by: Appendix A, Appendix B, Table 1, §1, §1, §1, §1, §1, §2, §2, §3, §3, §4.1, §4.1, §4.2, §4, §5, §6, §6.
• [26] F. Williams, M. Trager, C. Silva, D. Panozzo, D. Zorin, and J. Bruna (2019) Gradient dynamics of shallow univariate relu networks. arXiv preprint arXiv:1906.07842. Cited by: §1.
• [27] B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro (2020) Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pp. 3635–3673. Cited by: Table 1, §1, §1, §1, §1, §1, §2, §2, §3, §5.
• [28] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1.
• [29] P. Zhao, Y. Yang, and Q. He (2019)

Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression

.
arXiv preprint arXiv:1903.09367. Cited by: §1, §1, §1, §3.

## Appendix

The appendix is organized as follows.

In Appendix A we present a simplied theorem about non-negative signals and illustrate the idea behind the proof.

In Appendix B we study the multiplicative updates and build connections to its continuous approximation, which will be used next.

In Appendix C we provide the proof of propositions and technical lemmas in Appendix A.

In Appendix D we prove the main results stated in the paper.

## Appendix A Proof for Non-negative Signals

We mainly follow the proof structure from [25] to obtain the convergence of similar gradient descent algorithm for the case , which is a limiting case of ours. We will demonstrate how gradient dynamics changes with , which requires us to study the growth rate of error and convergence rate more carefully.

In this section, we will start with the general set up and provide a simplified version of Theorem 1 about non-negative signals.

### a.1 Set Up

 ∇uL(w) =2NnXT(Xw−y)⊙uN−1 ∇vL(w) =−2NnXT(Xw−y)⊙vN−1.

With the step size , the gradient descent updates on and simply are

 ut+1 =ut⊙(1−2Nη(1nXT(X(wt−w⋆)−ξ)⊙uN−2t)), vt+1 =vt⊙(1+2Nη(1nXT(X(wt−w⋆)−ξ)⊙vN−2t)).

Let where and with the power taken element-wisely. We denote as the support of , and let denote the index set of coordinates with positive values, and denote the index set of coordinates with negative values. Therefore and . Then define the following signal and noise-related quantities:

 st\coloneqq1S+⊙w+t−1S−⊙w−t,et\coloneqq1Sc⊙wt+1S−⊙w+t−1S+⊙w−t,bt\coloneqq1nXTXet−1nXTξ,pt\coloneqq(1nXTX−I)(st−w⋆). (13)

Let be the initial value for each entry of and rewrite the updates on , and in a more succinct way:

 w+0=w−0=αN,wt=w+t−w−t,w+t+1=w+t⊙(1−2Nη(st−w⋆+pt+bt)⊙(w+t)(N−2)/N)N,w−t+1=w−t⊙(1+2Nη(st−w⋆+pt+bt)⊙(w−t)(N−2)/N)N. (14)

When our target is with non-negative entries, the design of is no longer needed and the algorithm could be simplied to the following form.

 w+0 =uN0=αN, (15) w+t =uNt, w+t+1 =w+t⊙(1−2Nη(st−w⋆+pt+bt)⊙(w+t)(N−2)/N)N

The results in this section are all about updates in equation (15), and will be generalized to updates in equation (14) in Section D.

### a.2 The Key Propositions

Starting from , we have and . The idea of proposition 1 is to show that after some certain number of iterations , we obtain and . Proposition 2 further reduces the approximation error from to if possible, while still maintaining .

###### Proposition 1.

Consider the updates in equations (15). Fix any and let where is some small enough absolute constant. Suppose the error sequences and for any satisfy the following:

 ∥bt∥∞ ≤Cbζ−αN/4, ∥pt∥∞ ≤γ∥st−w⋆∥∞,

where is some small enough absolute constants. If the initialization satisfies

 α≤(18)2/(N−2)∧⎛⎜⎝(w⋆max)(N−2)/Nlogw⋆maxϵ⎞⎟⎠2/(N−2),

and the step size , then for any where

 T1 =7516ηN2ζ(2N−2)/Nlog|w⋆max−αN|ϵ+158N(N−2)ηζα(N−2), T2 =5N(N−1)ηζ(1α(N−2)−1α(N−2)/2),

and any , we have

 ≤ζ, ∥et∥∞ ≤αN/2.

Note that the requirement on can be relaxed to when we just consider the updates in equation (15). However, we still consider the stronger requirement in order to further generalize to updates in equation (14) later.

###### Proposition 2.

Consider the updates in equations (15). Fix any and suppose that the error sequences and for any satisfy