1 Introduction
Overparameterization is highly successful in learning deep neural networks. While this empirical finding was likely observed by countless practitioners, it was systematically studied in numerical experiments in [15, 16, 23]. Increasing the number of parameters in deep neural network models far beyond the number of training samples leads to better and better generalization properties of the learned networks. This is in stark contrast to classical statistics which would rather suggest overfitting in this scenario. The loss function typically has infinitely many global minimizers in this setting (there are usually infinitely many networks fitting the training samples exactly in the overparameterized regime [23]), so that the employed optimization algorithm has a significant influence on the computed solution. The commonly used (stochastic) gradient descent and its variants seems to have an implicit bias towards “nice” networks with good generalization properties. We conjecture that in fact, (stochastic) gradient descent applied to learning deep networks favors solutions of low complexity. Of course, the right notion of “low complexity” needs to be identified and may depend on the precise scenario, i.e., network architecture. In the simplified setting of linear networks, i.e., matrix factorizations, and the problem of matrix recovery or more specifically matrix completion, several works [1, 2, 5, 8, 9, 10, 11, 12, 15, 16, 17, 19, 20] identified the right notion of low complexity to be low rank of the factorized matrix. Nevertheless, despite initial theoretical results, a fully convincing theory is not yet available even for this simplified setting.
In this article, we consider the compressive sensing problem of recovering an unknown highdimensional signal from few linear measurements of the form
(1) 
where models the measurement process. Whereas this is in general not possible for , the seminal works [4, 3, 6] showed that unique reconstruction of from and becomes feasible via efficient methods if is sparse (i.e., at most entries of are nonzero),
satisfies certain conditions (that hold with high probability for various random matrices), and
scales like . While it is wellknown that can be recovered via minimization, i.e.,(2) 
(and other reconstruction algorithms), we will study the effect of overparameterization and the corresponding implicit bias phenomenon of gradient flow – an abstraction of (stochastic) gradient descent – in this context. More precisely, we consider that square loss
(3) 
and its overparameterized versions
(4)  
(5) 
where is the Hadamard (or entrywise) product. Minimizing the (nonoverparameterized) square loss via gradient flow/descent starting from leads to the leastsquares solution
which is unrelated to the sparse groundtruth in general.
Motivated by overparameterization in deep learning, we consider and and optimize it via gradient flow on each of the variables and . The analysis for is simpler than for flow but has the drawback that only vectors with nonnegative coefficients can be reconstructed. The gradient flow for , initialized at is defined via the differential equation
and the flow , , for is defined similarly. We then set and as the factored solutions.
We will show in this article that converges to an approximate solution of the minimization problem (1) provided that the initialization parameter is small enough (and that satisfies a very mild condition). Hence, in situations where minimization is known to successfully recover sparse solutions also overparameterized gradient flow will succeed. In particular, conditions on the restricted isometry property or the null space property on , or tangent cone conditions and dual certificates on and ensuring recovery of sparse vectors transfer to overparameterized gradient flow. For instance, a random Gaussian matrix ensures recovery of sparse vectors via overparameterized gradient flow for . We refer the interested reader to the monograph [7] for more details on compressive sensing.
Our main result improves on previous work: For , it was shown in [14] that gradient descent for defined in (24) and being closely related to , converges to an sparse if satisfies a certain coherence assumption, which requires at least measurements. In [21], it was shown for general that converges to an sparse if the restricted isometry constant of (see below) essentially satisfies . This condition can only be satisfied if , see [7]. Hence, our result significantly reduces the required number of measurements, in fact, down to the optimal number. However, we note that [14, 21] work with gradient descent, while we use gradient flow. An extension of our result to gradient descent is not yet available.
1.1 Contribution
In this work, we consider vanilla gradient flow on the factorized models (4) and (5). We show that, under mild assumptions on , gradient flow converges to a solution of (1) whose norm is close to the minimum among all solutions. Our main result reads as follows.
Theorem 1.1 (Equivalence to minimization, general case).
Let , , and . Define the general overparameterized loss function as
(6) 
where and . Let and follow the flow
for some . Suppose the set of nonzero solutions is nonempty. Then the limit exists and lies in .
Further, let and assume that
(7) 
with . Then the norm of satisfies
Remark 1.2.
Let us emphasize two points:

The power of Theorem 1.1 lies in (7), which gives an explicit nonasymptotic scaling between the initialization and the error . Note that (7) takes different scaling for different . Whereas requires an exponentially small initialization (already observed in [22]), is far less restrictive. Since in practice we cannot take to be arbitrarily small due to finite precision and computation time (empirically, smaller lead to slower convergence of gradient flow/descent), our theory suggests a clear advantage of deep factorizations () over shallow factorization (). This observation is also consistent with our experiments in Section 4.

The assumption that is satisfied as soon as is not aligned with one of the coordinate axes. In our case the assumption can always be fulfilled by reducing the problem dimension: First determine the set of entry indices on which all vectors in are , then compute any solution of (1) and fix the entries in (all possible solutions share common values on ), and finally reduce the gradient flow dynamics to the remaining entries.
It is particularly interesting that the recent work [17], despite being in the different context of matrix recovery, provides for a counterexample to normminimizing properties of gradient flow that relies on violating the condition .
The significance of Theorem 1.1 is that many stateoftheart results on reconstructing (effectively) sparse vectors can be derived in a straightforward way. For instance, it is easy to check that with [7, Theorem 4.14] the following holds (note that, e.g., Gaussian matrices satisfy the required stable null space property with high probability if ), cf. Figure 1 and Section 4.
Corollary 1.3.
1.2 Related Work
Before continuing with the main body of the work, let us give a brief overview over related results in the literature. As already mentioned before, the works [14, 21] derive robust reconstruction guarantees for gradient descent and the compressed sensing model (1). Robust means here that their model incorporates additive noise on . Whereas [21] considers and having a restricted isometry property, [14] extends the results to under a coherence assumption for .
In [22] the authors examine the limits of gradient flow when initialized by , for any . They can show that for large the limit of gradient flow approximates the leastsquare solution, whereas for small it approximates an norm minimizer. This paper is maybe most related to our work. Whereas the authors discuss more general types of initialization, their proof strategy is fundamentally different from ours and has certain shortcomings. They need to assume convergence of the gradient flow and obtain only for nonasymptotic bounds on the initialization magnitude required for implicit regularization. In contrast, we show convergence of gradient flow under mild assumptions on and provide nonasymptotic bounds for all leading to less restrictive assumptions on the initialization magnitude.
The connection between factorized overparametrization and minimization was also observed in statistics literature. In [13] the author shows that solving the LASSO is equivalent to minimizing an regularized overparametrized functional (for ) and uses this to solve LASSO by alternating leastsquares methods. Although deeper factorizations are considered as well in the paper, the presented approach leads to different results since the overparametrized functional is equivalent to norm and not norm minimization, for . The subsequent work [24] builds upon those ideas to perform sparse recovery with gradient descent assuming a restricted isometry property of . Nevertheless, the presented results share the suboptimal sample complexity of [21, 14].
Apart from these closely related works, which treat the reconstruction of sparse vectors, another line of research deals with factorization/reconstruction of matrixvalued signals via overparametrization. The corresponding results are of a similar flavor and show an implicit lowrank bias of gradient flow/descent when minimizing factorized square losses [1, 2, 5, 8, 9, 10, 11, 12, 15, 16, 17, 19, 20]. It is noteworthy that the existing matrix sensing results, e.g., [20], require measurements to guarantee reconstruction of rank matrices via gradient descent, i.e., they share the suboptimal sample complexity of [21, 14]. For comparison, for lowrank matrix reconstruction by conventional methods like nuclearnorm minimization, only measurements are needed. For a more detailed discussion of the literature on matrix factorization/sensing via overparametrization, we refer the reader to [5].
1.3 Outline
Sections 2 and 3 are dedicated to proving Theorem 1.1. Section 2 illustrates the proof strategy in a simplified setting, whereas Section 3 extends the proof to full generality. For the sake of clarity, some of the proof details of Section 3 are postponed to Appendix A. Finally, we present in Section 4 numerical evidence supporting our claims and conclude with a brief summary/outlook on future research directions in Section 5.
1.4 Notation
For , we denote . Boldface lowercase letters like represent vectors with entries , while boldface uppercase letters like represent matrices with entries . For , means that , for all . We use to denote the Hadamard product, i.e., the vectors and have entries and , respectively. We abbreviate . Last but not least, we apply the logarithm entrywise to positive vectors, i.e., with .
2 Positive Case
To show Theorem 1.1, we are going to prove in this section the following simplified version, Theorem 2.1, that treats the model in (4) and is restricted to the positive orthant (recall that gradient flow applied to (4) preserves the entrywise sign of the iterates). In fact, the proof strategy for Theorem 1.1 is then a straightforward adaption of the proof of Theorem 2.1 and will be discussed in Section 3. Note that Theorem 2.1 can be easily adapted to other orthants by changing the signs of the initialization vector.
Theorem 2.1 (Equivalence to minimization, positive case).
Let , and . Define the overparameterized loss function as
(9) 
where . Let follow the flow with for any . Suppose the set of strictly positive solutions is nonempty. Then the limit exists and lies in . Moreover, its norm is away from the minimum, i.e.
for any if defined as in Theorem 1.1 with .
2.1 Reduced Factorized Loss
To analyze the dynamics , we first derive a compact expression for . We further simplify this expression by assuming identical initialization, and arrive at the Reduced Factorized Loss, which will be used in the proofs later on.
Lemma 2.2 (Gradient).
Let and . Then
(10) 
Proof.
Lemma 2.3 (Identical Initialization).
Suppose follows the negative gradient flow
If all initialization vectors are identical, i.e. for all , then the vectors remain identical for all , i.e. . Moreover, the dynamics will be given by
where and .
Proof.
At first sight, the reduction of the number of parameters in Lemma 2.3 seems counterintuitive to the idea of overparameterization. However, as opposed to the standard loss function , we arrive at a new loss function , which changes the optimization landscape. This new loss promotes an implicit bias to the gradient flow trajectory, which can be useful in various contexts [2, 5, 21, 18]. Motivated by Lemma 2.3, we will thus consider the loss function defined below for the rest of the paper.
Definition 2.4 (Reduced Factorized Loss).
Let , . For and , the reduced factorized loss function is defined as
(11) 
Its derivative is given by .
2.2 Solution Entropy
The lack of convexity in that causes the intended implicit bias also complicates the analysis. In particular, even if converges, might not reach zero. For instance, is a stationary point because , but clearly for any . This means that might converge to a vector that is not in the solution space .
To overcome the difficulty described above, we introduce a useful quantity which we dub the Solution Entropy.
Definition 2.5 (Solution Entropy).
For and , the solution entropy is defined as
(12) 
Recall that the logarithm is applied entrywise here.
Remark 2.6.
Lemma 2.7 (Nonincreasing in time).
Suppose . Let follow
(13) 
If , for all , then is nonincreasing in time with decay rate , for all . This implies that the functions decrease along at the same rate independent of the choice of .
Proof.
By direct computation we get
∎
Lemma 2.8 (Convexity).
For , the solution entropy is strictly convex with respect to . The unique minimum is attained at , given by
(14) 
Proof.
The Hessian matrix of is diagonal with entries . Hence is strictly positive definite and is strictly convex. The gradient of , given by
and equals to zero if and only if . Plugging this into , we get for
and for
∎
2.3 Convergence and Optimality
With the help of the solution entropy, which is decreasing in and convex in , we are ready to prove that avoids the bad stationary points described in Section 2.2, converges to an element in the solution space, i.e. , and that its limit can be characterized by the solution entropy. Recall that we abbreviate the set of strictly positive solutions as
(15) 
Lemma 2.9 (Bounded away from zero).
Proof.
Let and
We will first show that the claim holds for , and then deduce that . Due to the different shape of , we separate the case from .
:
Since ,
where the second last step follows from separability of the addends and basic calculus. Since ,
Dividing by and then taking the exponential on both sides, we get that
Now assume that . By continuity of , we know that there is some such that , for . By Lemma 2.7, we know that the solution entropy is nonincreasing in time on and thus . By the similar calculations we have that
which implies that for all . But this contradicts the definition of and shows that .
:
The methodology is the same as in the case ; the only difference is the form of and . Again, since ,
Since and ,
Rearrange terms and we get that
Now assume that . By continuity of , we know that there is some such that , for . By Lemma 2.7, we know that the solution entropy is nonincreasing in time on and thus . By similar calculations we have that
which again implies that for all . But this again contradicts the definition of and shows that .
∎
Lemma 2.9 provides uniform bounds for the trajectory of . Elaborating on the properties of the solution entropy, we can deduce further information about the convergence and the limit of . To this end, we define some related quantities.
Definition 2.10 (Entropy Gap).
Let follow as in (13) with . For , denote the minimum and initial value of the solution entropy as
(17) 
where is welldefined because of and Lemma 2.8. For , define in addition the limit value of the solution entropy as
(18) 
which exists because is nonincreasing in time by Lemma 2.7 and Lemma 2.9. Define the Entropy Gap and Maximal Entropy Gap as
(19) 
Note that by construction for , and by Lemma 2.7 remains constant for all .
Lemma 2.11 (Convergence).
Proof.
Since is nonempty, by Lemma 2.9 for we have that for all
Comments
nikhilarora9811 ∙
Web Designing Development Course in Delhi, Our web design & development trainer concentrates on every aspect to ensure that you achieve the most desirable web designing and development knowledge.