# More is Less: Inducing Sparsity via Overparameterization

In deep learning it is common to overparameterize the neural networks, that is, to use more parameters than training samples. Quite surprisingly training the neural network via (stochastic) gradient descent leads to models that generalize very well, while classical statistics would suggest overfitting. In order to gain understanding of this implicit bias phenomenon we study the special case of sparse recovery (compressive sensing) which is of interest on its own. More precisely, in order to reconstruct a vector from underdetermined linear measurements, we introduce a corresponding overparameterized square loss functional, where the vector to be reconstructed is deeply factorized into several vectors. We show that, under a very mild assumption on the measurement matrix, vanilla gradient flow for the overparameterized loss functional converges to a solution of minimal ℓ_1-norm. The latter is well-known to promote sparse solutions. As a by-product, our results significantly improve the sample complexity for compressive sensing in previous works. The theory accurately predicts the recovery rate in numerical experiments. For the proofs, we introduce the concept of solution entropy, which bypasses the obstacles caused by non-convexity and should be of independent interest.

Web Designing Development Course in Delhi,  Our web design & development trainer concentrates on every aspect to ensure that you achieve the most desirable web designing and development knowledge.

## Authors

• 3 publications
• 10 publications
• 17 publications
02/05/2020

### Sample Complexity Bounds for 1-bit Compressive Sensing and Binary Stable Embeddings with Generative Priors

The goal of standard 1-bit compressive sensing is to accurately recover ...
05/07/2020

### Compressive sensing with un-trained neural networks: Gradient descent finds the smoothest approximation

Un-trained convolutional neural networks have emerged as highly successf...
06/17/2021

### Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Understanding the implicit bias of training algorithms is of crucial imp...
01/21/2021

### Robust spectral compressive sensing via vanilla gradient descent

This paper investigates robust recovery of an undamped or damped spectra...
09/12/2018

07/23/2018

### Batch Sparse Recovery, or How to Leverage the Average Sparsity

We introduce a batch version of sparse recovery, where the goal is to re...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Overparameterization is highly successful in learning deep neural networks. While this empirical finding was likely observed by countless practitioners, it was systematically studied in numerical experiments in [15, 16, 23]. Increasing the number of parameters in deep neural network models far beyond the number of training samples leads to better and better generalization properties of the learned networks. This is in stark contrast to classical statistics which would rather suggest overfitting in this scenario. The loss function typically has infinitely many global minimizers in this setting (there are usually infinitely many networks fitting the training samples exactly in the overparameterized regime [23]), so that the employed optimization algorithm has a significant influence on the computed solution. The commonly used (stochastic) gradient descent and its variants seems to have an implicit bias towards “nice” networks with good generalization properties. We conjecture that in fact, (stochastic) gradient descent applied to learning deep networks favors solutions of low complexity. Of course, the right notion of “low complexity” needs to be identified and may depend on the precise scenario, i.e., network architecture. In the simplified setting of linear networks, i.e., matrix factorizations, and the problem of matrix recovery or more specifically matrix completion, several works [1, 2, 5, 8, 9, 10, 11, 12, 15, 16, 17, 19, 20] identified the right notion of low complexity to be low rank of the factorized matrix. Nevertheless, despite initial theoretical results, a fully convincing theory is not yet available even for this simplified setting.

In this article, we consider the compressive sensing problem of recovering an unknown high-dimensional signal from few linear measurements of the form

 y=Ax⋆∈RM, (1)

where models the measurement process. Whereas this is in general not possible for , the seminal works [4, 3, 6] showed that unique reconstruction of from and becomes feasible via efficient methods if is -sparse (i.e., at most entries of are non-zero),

satisfies certain conditions (that hold with high probability for various random matrices), and

scales like . While it is well-known that can be recovered via -minimization, i.e.,

 x⋆=argminAz=y∥z∥1 (2)

(and other reconstruction algorithms), we will study the effect of overparameterization and the corresponding implicit bias phenomenon of gradient flow – an abstraction of (stochastic) gradient descent – in this context. More precisely, we consider that square loss

and its overparameterized versions

 Lover(x(1),…,x(L)) :=12∥∥A(x(1)⊙⋯⊙x(L))−y∥∥22, (4) L±over(u(1),…,u(L),v(1),…,v(L)) :=12∥∥A(L⨀ℓ=1u(ℓ)−L⨀ℓ=1v(ℓ))−y∥∥22, (5)

where is the Hadamard (or entry-wise) product. Minimizing the (non-overparameterized) square loss via gradient flow/descent starting from leads to the least-squares solution

 x∞:=limt→∞x(t)=argminAz=y∥z∥2,

which is unrelated to the sparse ground-truth in general.

Motivated by overparameterization in deep learning, we consider and and optimize it via gradient flow on each of the variables and . The analysis for is simpler than for flow but has the drawback that only vectors with non-negative coefficients can be reconstructed. The gradient flow for , initialized at is defined via the differential equation

 x(j)(t)=−∇x(j)Lover(x1(t),…,x(L)(t)),x(j)(0)=α1,j=1,…,L,

and the flow , , for is defined similarly. We then set and as the factored solutions.

We will show in this article that converges to an approximate solution of the -minimization problem (1) provided that the initialization parameter is small enough (and that satisfies a very mild condition). Hence, in situations where -minimization is known to successfully recover sparse solutions also overparameterized gradient flow will succeed. In particular, conditions on the restricted isometry property or the null space property on , or tangent cone conditions and dual certificates on and ensuring recovery of -sparse vectors transfer to overparameterized gradient flow. For instance, a random Gaussian matrix ensures recovery of sparse vectors via overparameterized gradient flow for . We refer the interested reader to the monograph [7] for more details on compressive sensing.

Our main result improves on previous work: For , it was shown in [14] that gradient descent for defined in (24) and being closely related to , converges to an -sparse if satisfies a certain coherence assumption, which requires at least measurements. In [21], it was shown for general that converges to an -sparse if the restricted isometry constant of (see below) essentially satisfies . This condition can only be satisfied if , see [7]. Hence, our result significantly reduces the required number of measurements, in fact, down to the optimal number. However, we note that [14, 21] work with gradient descent, while we use gradient flow. An extension of our result to gradient descent is not yet available.

### 1.1 Contribution

In this work, we consider vanilla gradient flow on the factorized models (4) and (5). We show that, under mild assumptions on , gradient flow converges to a solution of (1) whose -norm is -close to the minimum among all solutions. Our main result reads as follows.

###### Theorem 1.1 (Equivalence to ℓ1-minimization, general case).

Let , , and . Define the general overparameterized loss function as

 L±over(u(1),…,u(L),v(1),…,v(L)):=12∥A(~u−~v)−y∥22 (6)

where and . Let and follow the flow

 (u(k))′(t)=−∇u(k)L±over(u(1),…,u(L),v(1),…,v(L)) u(0)=α1 (v(k))′(t)=−∇v(k)L±over(u(1),…,u(L),v(1),…,v(L)) v(0)=α1

for some . Suppose the set of non-zero solutions is non-empty. Then the limit exists and lies in .

Further, let and assume that

 α≤h(c,ε):=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩min(e−12,exp(12−c2+Ne−12ε))if L=2(2εL(c+N+ε))1L−2if % L>2 (7)

with . Then the -norm of satisfies

 ∥~x∞∥1−minAz=y∥z∥1≤ε.
###### Remark 1.2.

Let us emphasize two points:

1. The power of Theorem 1.1 lies in (7), which gives an explicit non-asymptotic scaling between the initialization and the error . Note that (7) takes different scaling for different . Whereas requires an exponentially small initialization (already observed in [22]), is far less restrictive. Since in practice we cannot take to be arbitrarily small due to finite precision and computation time (empirically, smaller lead to slower convergence of gradient flow/descent), our theory suggests a clear advantage of deep factorizations () over shallow factorization (). This observation is also consistent with our experiments in Section 4.

2. The assumption that is satisfied as soon as is not aligned with one of the coordinate axes. In our case the assumption can always be fulfilled by reducing the problem dimension: First determine the set of entry indices on which all vectors in are , then compute any solution of (1) and fix the entries in (all possible solutions share common values on ), and finally reduce the gradient flow dynamics to the remaining entries.

It is particularly interesting that the recent work [17], despite being in the different context of matrix recovery, provides for a counterexample to norm-minimizing properties of gradient flow that relies on violating the condition .

The significance of Theorem 1.1 is that many state-of-the-art results on reconstructing (effectively) sparse vectors can be derived in a straight-forward way. For instance, it is easy to check that with [7, Theorem 4.14] the following holds (note that, e.g., Gaussian matrices satisfy the required stable null space property with high probability if ), cf. Figure 1 and Section 4.

###### Corollary 1.3.

Let be the -sparse solution that we wish to recover. Suppose and satisfies the stable null space property of order with constant . Define as in Theorem 1.1. Let and assume that the initialization parameter satisfies

 α≤h(∥x∗∥1,1−ρ1+ρ⋅ε)

with defined as in (7). Then the reconstruction error

 ∥~x∞−x∗∥1≤ε. (8)

### 1.2 Related Work

Before continuing with the main body of the work, let us give a brief overview over related results in the literature. As already mentioned before, the works [14, 21] derive robust reconstruction guarantees for gradient descent and the compressed sensing model (1). Robust means here that their model incorporates additive noise on . Whereas [21] considers and having a restricted isometry property, [14] extends the results to under a coherence assumption for .

In [22] the authors examine the limits of gradient flow when initialized by , for any . They can show that for large the limit of gradient flow approximates the least-square solution, whereas for small it approximates an -norm minimizer. This paper is maybe most related to our work. Whereas the authors discuss more general types of initialization, their proof strategy is fundamentally different from ours and has certain shortcomings. They need to assume convergence of the gradient flow and obtain only for non-asymptotic bounds on the initialization magnitude required for implicit -regularization. In contrast, we show convergence of gradient flow under mild assumptions on and provide non-asymptotic bounds for all leading to less restrictive assumptions on the initialization magnitude.

The connection between factorized overparametrization and -minimization was also observed in statistics literature. In [13] the author shows that solving the LASSO is equivalent to minimizing an -regularized overparametrized functional (for ) and uses this to solve LASSO by alternating least-squares methods. Although deeper factorizations are considered as well in the paper, the presented approach leads to different results since the overparametrized functional is equivalent to -norm and not -norm minimization, for . The subsequent work [24] builds upon those ideas to perform sparse recovery with gradient descent assuming a restricted isometry property of . Nevertheless, the presented results share the suboptimal sample complexity of [21, 14].

Apart from these closely related works, which treat the reconstruction of sparse vectors, another line of research deals with factorization/reconstruction of matrix-valued signals via overparametrization. The corresponding results are of a similar flavor and show an implicit low-rank bias of gradient flow/descent when minimizing factorized square losses [1, 2, 5, 8, 9, 10, 11, 12, 15, 16, 17, 19, 20]. It is noteworthy that the existing matrix sensing results, e.g., [20], require measurements to guarantee reconstruction of rank- -matrices via gradient descent, i.e., they share the suboptimal sample complexity of [21, 14]. For comparison, for low-rank matrix reconstruction by conventional methods like nuclear-norm minimization, only measurements are needed. For a more detailed discussion of the literature on matrix factorization/sensing via overparametrization, we refer the reader to [5].

### 1.3 Outline

Sections 2 and 3 are dedicated to proving Theorem 1.1. Section 2 illustrates the proof strategy in a simplified setting, whereas Section 3 extends the proof to full generality. For the sake of clarity, some of the proof details of Section 3 are postponed to Appendix A. Finally, we present in Section 4 numerical evidence supporting our claims and conclude with a brief summary/outlook on future research directions in Section 5.

### 1.4 Notation

For , we denote . Boldface lower-case letters like represent vectors with entries , while boldface upper-case letters like represent matrices with entries . For , means that , for all . We use to denote the Hadamard product, i.e., the vectors and have entries and , respectively. We abbreviate . Last but not least, we apply the logarithm entry-wise to positive vectors, i.e., with .

## 2 Positive Case

To show Theorem 1.1, we are going to prove in this section the following simplified version, Theorem 2.1, that treats the model in (4) and is restricted to the positive orthant (recall that gradient flow applied to (4) preserves the entry-wise sign of the iterates). In fact, the proof strategy for Theorem 1.1 is then a straight-forward adaption of the proof of Theorem 2.1 and will be discussed in Section 3. Note that Theorem 2.1 can be easily adapted to other orthants by changing the signs of the initialization vector.

###### Theorem 2.1 (Equivalence to ℓ1-minimization, positive case).

Let , and . Define the overparameterized loss function as

 Lover(x(1),…,x(L)):=12∥A~x−y∥22 (9)

where . Let follow the flow with for any . Suppose the set of strictly positive solutions is non-empty. Then the limit exists and lies in . Moreover, its -norm is away from the minimum, i.e.

 ∥~x∞∥−minAz=y,z≥0∥z∥1≤ε

for any if defined as in Theorem 1.1 with .

### 2.1 Reduced Factorized Loss

To analyze the dynamics , we first derive a compact expression for . We further simplify this expression by assuming identical initialization, and arrive at the Reduced Factorized Loss, which will be used in the proofs later on.

Let and . Then

 ∇x(k)Lover(x(1),…,x(L))=[AT(A~x−y)]⊙~xkc. (10)
###### Proof.

By the chain rule we have, for any

, that

 ∇x(k)nLover(x(1),…,x(L)) =12∑m∈[M]∇x(k)n(A~x−y)2m =∑m∈[M](A~x−y)m(A)mn(~xkc)n=[AT(A~x−y)]n(~xkc)n.

###### Lemma 2.3 (Identical Initialization).

Suppose follows the negative gradient flow

 (x(k))′(t)=−∇x(k)Lover(x(1),…,x(L)).

If all initialization vectors are identical, i.e. for all , then the vectors remain identical for all , i.e. . Moreover, the dynamics will be given by

 x′(t)=−∇L(x).

where and .

###### Proof.

It suffices to show that if for all , then for all . By Lemma 2.2, the only dependence of on is . Since and are identical,

 ~xkc =⨀ℓ∈[L]∖{k}x(ℓ)=⨀ℓ∈[L]∖{k′}x(ℓ)=x(k′)c

and hence . Since by assumption all are identical, we can replace them with . Hence and . Plugging this into (10), we get that , which is exactly . ∎

At first sight, the reduction of the number of parameters in Lemma 2.3 seems counter-intuitive to the idea of overparameterization. However, as opposed to the standard loss function , we arrive at a new loss function , which changes the optimization landscape. This new loss promotes an implicit bias to the gradient flow trajectory, which can be useful in various contexts [2, 5, 21, 18]. Motivated by Lemma 2.3, we will thus consider the loss function defined below for the rest of the paper.

###### Definition 2.4 (Reduced Factorized Loss).

Let , . For and , the reduced factorized loss function is defined as

 L:RN→[0,∞),L(x):=12∥Ax⊙L−y∥22. (11)

Its derivative is given by .

### 2.2 Solution Entropy

The lack of convexity in that causes the intended implicit bias also complicates the analysis. In particular, even if converges, might not reach zero. For instance, is a stationary point because , but clearly for any . This means that might converge to a vector that is not in the solution space .

To overcome the difficulty described above, we introduce a useful quantity which we dub the Solution Entropy.

###### Definition 2.5 (Solution Entropy).

For and , the solution entropy is defined as

 fL,z(x) :=⎧⎪ ⎪⎨⎪ ⎪⎩12∥x∥22−⟨z,log(x)⟩if L=212∥x∥22+1L−2⟨z,x⊙−(L−2)⟩if L>2∞if xn=0,zn≠0 for some $n$. (12)

Recall that the logarithm is applied entry-wise here.

###### Remark 2.6.

The name Solution Entropy comes from the facts that we assume that is a solution to the linear equation in the following proofs, that is non-increasing in and is strictly convex in , and that the unique minimum of is attained at if . These properties are proven in Lemma 2.7 and Lemma 2.8 below.

###### Lemma 2.7 (Non-increasing in time).

Suppose . Let follow

 x′(t)=−∇L(x)=−[AT(Ax⊙L−y)]⊙x⊙L−1. (13)

If , for all , then is non-increasing in time with decay rate , for all . This implies that the functions decrease along at the same rate independent of the choice of .

###### Proof.

By direct computation we get

 ∂tfL,z=⟨∇fL,z(x),∂tx⟩ =−⟨x−z⊙x⊙−(L−1),[AT(Ax⊙L−y)]⊙x⊙L−1⟩ =−⟨x⊙L−z,ATA(x⊙L−z)⟩ =−2L(x)≤0.

###### Lemma 2.8 (Convexity).

For , the solution entropy is strictly convex with respect to . The unique minimum is attained at , given by

 minx≥0fL,z(x) =⎧⎨⎩12⟨z,1−log(z)⟩if L=2L2(L−2)⟨z⊙2L,1⟩if L>2. (14)
###### Proof.

The Hessian matrix of is diagonal with entries . Hence is strictly positive definite and is strictly convex. The gradient of , given by

 ∇fL,z(x)=x−z⊙x⊙−(L−1)

and equals to zero if and only if . Plugging this into , we get for

 minx≥0fL,z(x) =∑n∈[N]12zn−znlog(z12n)=12⟨z,1−log(z)⟩,

and for

 minx≥0fL,z(x) =∑n∈[N]12z2Ln+znL−2z−(1−2L)n=L2(L−2)∑n∈[N]z2Ln=L2(L−2)⟨z⊙2L,1⟩,

### 2.3 Convergence and Optimality

With the help of the solution entropy, which is decreasing in and convex in , we are ready to prove that avoids the bad stationary points described in Section 2.2, converges to an element in the solution space, i.e. , and that its limit can be characterized by the solution entropy. Recall that we abbreviate the set of strictly positive solutions as

 S+:={z>0:Az=y}. (15)
###### Lemma 2.9 (Bounded away from zero).

Assume that and let . Let follow as in (13). If , then for all ,

 x(t)≥ε(L,z)>0 (16)

for some time-independent lower bound given by

 εn(L,z) :=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩exp(−fL,z(x(0))+gn(L,z)zn)if L=2(zn(L−2)(fL,z(x(0))+gn(L,z)))1L−2if L>2,

where

 gn(L,z) :=⎧⎪⎨⎪⎩12(zn(1−log(zn))−⟨z,1−log(z)⟩)if L=2L2(L−2)(z2Ln−⟨z⊙2L,1⟩)if L>2

and is the solution entropy defined in (2.5).

###### Proof.

Let and

 t∗:inft≥0{∃n∈[N]:xn(t)<εn(z)}.

We will first show that the claim holds for , and then deduce that . Due to the different shape of , we separate the case from .

#### L=2:

Since ,

 12x2n(0)−znlog(xn(0)) =c0−∑k∈[N],k≠n12x2k(0)−zklog(xk(0)) ≤c0−minξ≥0∑k∈[N],k≠n12ξ2k−zklog(ξk) =c0−12∑k∈[N],k≠nzk−zklog(zk) =c0+gn(L,z),

where the second last step follows from separability of the addends and basic calculus. Since ,

 −znlog(xn(0))≤c0+gn(L,z).

Dividing by and then taking the exponential on both sides, we get that

 xn(0) ≥exp(−c0+gn(L,z)zn)=εn(L,z).

Now assume that . By continuity of , we know that there is some such that , for . By Lemma 2.7, we know that the solution entropy is non-increasing in time on and thus . By the similar calculations we have that

 12x2n(t)−znlog(xn(t)) ≤c0−∑k∈[N],k≠n12x2k(t)−zklog(xk(t)) ≤c0+gn(L,z)

which implies that for all . But this contradicts the definition of and shows that .

#### L>2:

The methodology is the same as in the case ; the only difference is the form of and . Again, since ,

 12x2n(0)+zn(L−2)xn(0)L−2 =c0−∑k∈[N],k≠n12x2k(0)+zn(L−2)xk(0)L−2 ≤c0−minξ>0∑k∈[N],k≠n12ξ2k+zn(L−2)ξL−2 =c0−L2(L−2)∑k∈[N],k≠nz2Lk =c0+gn(L,z).

Since and ,

 0

Rearrange terms and we get that

 xn(0) ≥(zn(L−2)(c0+gn(L,z)))1L−2=εn(L,z).

Now assume that . By continuity of , we know that there is some such that , for . By Lemma 2.7, we know that the solution entropy is non-increasing in time on and thus . By similar calculations we have that

 12x2n(t)+zn(L−2)xn(t)L−2 ≤c0−∑k∈[N],k≠n12x2k(t)+zn(L−2)xk(t)L−2 ≤c0+gn(L,z).

which again implies that for all . But this again contradicts the definition of and shows that .

Lemma 2.9 provides uniform bounds for the trajectory of . Elaborating on the properties of the solution entropy, we can deduce further information about the convergence and the limit of . To this end, we define some related quantities.

###### Definition 2.10 (Entropy Gap).

Let follow as in (13) with . For , denote the minimum and initial value of the solution entropy as

 cL,z,min:=minξ≥0fL,z(ξ),cL,z,0:=fL,z(x(0)), (17)

where is well-defined because of and Lemma 2.8. For , define in addition the limit value of the solution entropy as

 cL,z,∞:=limt→∞fL,z(x(t)) (18)

which exists because is non-increasing in time by Lemma 2.7 and Lemma 2.9. Define the Entropy Gap and Maximal Entropy Gap as

 ΔL,z:=cL,z,0−cL,z,∞,ΔmaxL,z:=cL,z,0−cL,z,min. (19)

Note that by construction for , and by Lemma 2.7 remains constant for all .

###### Lemma 2.11 (Convergence).

Suppose is non-empty and . Let follow as in (13) with . Then exists, , and

 x⊙L∞=argminz∈¯¯¯S+hx(0)(z) (20)

where

 hx(0)(z)=⎧⎪⎨⎪⎩⟨log(z),z⟩−⟨2log(x(0))+1,z⟩if L=22⟨1,z⟩−L⟨x⊙L−2(0),z⊙2L⟩if L>2∞if zn=0 for some n.
###### Proof.

Since is non-empty, by Lemma 2.9 for we have that for all