Over-parameterized models are crucial in deep learning, but their workings are far from understood. Over-parameterization — the technique of using more parameters than statistically necessary — apparently improves the training: theoretical and empirical results have suggested that it can enhance the geometric properties of the optimization landscape in simplified settings[24, 18, 17, 34] and thus make it easier to train over-parameterized models.
On the other hand, over-parameterization often doesn’t hurt the test performance, even if the number of parameters is much larger than the number of examples. Large neural networks used in practice have enough expressiveness to fit any labels of the training datasets [42, 17]. The training objective function may have multiple global minima with almost zero training error, some of which generalize better than the others [21, 11]
. However, local improvement algorithms such as stochastic gradient descent, starting with proper initialization, may prefer some generalizable local minima to the others and thus provide an implicit effect of regularization[38, 28, 19, 27, 41]. Such regularization seems to depend on the algorithmic choice, the initialization scheme, and certain intrinsic properties of the data.
The phenomenon and intuition above can be theoretically fleshed out in the context of linear models , whereas less is known for non-linear models whose training objectives are usually non-convex. The very important work of Gunasekar et al.  initiates the study of low-rank matrix factorization models with over-parameterization and conjectures that gradient descent prefers small trace norm solution in over-parameterized models with thorough empirical evidences.
This paper resolves the conjecture for the matrix sensing problem — recovering a low-rank matrix from linear measurements — under the restricted isometry property (RIP). We show that with a full-rank factorized parameterization, gradient descent on the squared loss with finite step size, starting with a small initialization, converges to the true low-rank matrix (which is also the minimum trace norm solution.) One advantage of the over-parameterized approach is that without knowing/guessing the correct rank, the algorithms can automatically pick up the minimum rank or trace norm solution that fits the data.
The analysis can be extended to learning one-hidden-layer neural networks with quadratic activations. We hope such theoretical analysis of algorithmic regularization in the non-convex setting may shed light on other more complicated models where over-parameterization is crucial (if not necessary) for efficient training.
1.1 Setup and Main Results
Let be an unknown rank- symmetric positive semidefinite (PSD) matrix in that we aim to recover. Let be given symmetric measurement matrices.111Given that the matrix is symmetric, we can assume that ’s are symmetric without loss of generality: Because for any symmetric matrix , we can always replace by .
We assume that the label vectoris generated by linear measurements
Here denotes the inner product of two matrices. Our goal is to recover the matrix . 222Our analysis can naturally handle a small amount of Gaussian noise in the label vector , but for simplicity we only work with the noiseless case.
Without loss of generality, we assume that has spectral norm 1. Let denote the singular value of a matrix , and let be the condition number of . We focus on the regime where and .
Let be a matrix variable. We consider the following mean squared loss objective function with over-parameterization:
Since the label is generated by , any matrix satisfying is a local minimum of with zero training error. These are the ideal local minima that we are shooting for. However, because the number of parameters is much larger than the number of observation , there exist other choices of satisfying but .
A priori, such over-parameterization will cause over-fitting. However, we will show that the following gradient descent algorithm with small initialization converges to a desired local minimum, instead of other non-generalizable local minima:
The following theorem assumes the measurements matrices satisfy restricted isometry property (RIP), which is formally defined in Section 2. Casual readers may simply assume that the entries of
’s are drawn i.i.d from standard normal distribution333Or equivalently, as discussed in the previous footnote, causal readers may assume where is from standard normal distribution. Such symmetrization doesn’t change the model since and the number of observations : it’s known  that in this case meet the requirement of the following theorem, that is, they satisfy -RIP with . 444Technically, to get such RIP parameters that depends on , one would need to slightly modify the proof of [31, Theorem 4.2] at the end to get the dependency of on .
Let be a sufficiently small absolute constant. Assume that the set of measurement matrices satisfies -restricted isometry property (defined in Section 2 formally) with . Suppose the initialization and learning rate satisfy and . Then for every , we have
Note that the recovery error can be viewed as the test error — it’s equal to the test error on a fresh measurement drawn from the standard normal distribution. The theorem above shows that gradient descent can provide an algorithmic regularization so that the generalization error depends on the size of the initialization , instead of the number of parameters. Because the convergence is not very sensitive to the initialization, we can choose small enough (e.g., ) to get approximately zero generalization error. Moreover, when is small, gradient descent can run for a long period of time without overfitting the data. We show in Section 6 that empirically indeed the generalization error depends on the size of the initialization and gradient descent is indeed very stable.
The analysis also applies to stochastic gradient descent, as long as each batch of the measurement matrices satisfies RIP.555Smaller batch size should also work when the learning rate is sufficiently small, although its analysis seems to require more involved techniques and is left for future work. We also remark that our theory suggests that early stopping for this problem is not necessary when the initialization is small enough — the generalization error bounds apply until iterations. We corroborate this with empirical results in Section 6.
We remark that we achieve a good iteration complexity bound for the gradient descent algorithm, which was not known in previous work even for low-rank parameterization, nor for the case with infinite samples (which is the PCA problem). Part of the technical challenges is to allow finite step size and inverse-poly initialization (instead of exponentially small initialization). The dependency of on and in the theorem is possibly not tight. We conjecture that only needs to be smaller than an absolute constant, which is left for future work.
Insights of the analysis:
Interestingly, our analysis ideas seem to be different from other previous work in a conceptual sense. The analysis of the logistic regression case relies on that the iterate eventually moves to infinity. The folklore analysis of the algorithmic regularization of SGD for least squares and the analysis in  for the matrix regression with commutable measurements both follow the two-step plan: a) the iterates always stays on a low-rank manifold that only depends on the inputs (the measurement matrices) but not on the label vector ; b) generalization follows from the low complexity of the low-rank manifold. Such input-dependent but label-independent manifold doesn’t seem to exist in the setting when ’s are random.
Instead, we show that the iterates stay in the set of matrices with approximate rank smaller or equal to the minimal possible rank that can fit the data, which is a set that depends on the labels but not on the inputs ’s. We implicitly exploit the fact that gradient descent on the population risk with small initialization only searches through the space of solutions with a lower rank than that of the true matrix . The population risk is close to the empirical risk on matrices with rank smaller than or equal to the true rank. Hence, we can expect the learning dynamics of the empirical risk to be similar to that of the population risk, and therefore the iterates of GD on the empirical risk remain approximately low-rank as well. Generalization then follows straightforwardly from the low-rankness of the iterates. See Section 3 for more high-level discussions.
We note that the factorized parameterization also plays an important role here. The intuition above would still apply if we replace with a single variable and run gradient descent in the space of with small enough initialization. However, it will converge to a solution that doesn’t
generalize. The discrepancy comes from another crucial property of the factorized parameterization: it provides certain denoising effect that encourages the empirical gradient to have a smaller eigenvalue tail. This ensures the eigenvalues tails of the iterates to grow sufficiently slowly. This point will be more precise in Section3 once we apply the RIP property. In section 6, we also empirically demonstrate that GD in the original space of with projection to the PSD cone doesn’t provide as good generalization performance as GD in the factorized space.
Finally, we remark that the cases with rank are technically much more challenging than the rank-1 case. For the rank-1 case, we show that the spectrum of remains small in a fixed rank- subspace, which is exactly the complement of the column span of . Hence the iterates are approximately rank one. By contrast, for the rank- case, a direct extension of this proof strategy only gives a much weaker result compared to Theorem 1.1. Instead, we identify an adaptive rank- subspace in which remains small. Clearly, the best choice of this adaptive subspace is the subspace of the least left singular vectors of . However, we use a more convenient surrogate. We refer the reader to Section 4 for detailed descriptions.
1.2 Extensions to Neural Networks with Quadratic Activations
Our results can be applied to learning one-hidden-layer neural networks with quadratic activations. We setup the notations and state results below and defer the details to Section 5.
Let be the input and be the first layer weight matrix. We assume that the weight on the second layer is simply the all one’s vector . Formally, the label is assumed to be generated by
where is the element-wise quadratic function. For simplicity, we assume that comes from standard normal distribution . As we will demonstrate later, the representational power of the hypothesis class with is the same as those with . Thus we only focus on the case when . For the purpose of this paper, the most interesting regime is the scenario when .
We use an over-parameterized model with a variable . The prediction is parameterized by , and we use the mean squared error
as the loss function. We use a variant of stochastic gradient descent (or gradient descent) on the mean squared loss.
The following theorem shows that the learned model will generalize with examples, despite that the number of parameters can be much larger than the number of samples (when or is considered as a constant).666The dependency on here is likely to be loose. Although we note that this is the first bound of this kind for this problem that shows over-parameterized models can still generalize. We will start with an initialization in the same way as in equation (1.2), and denote as the iterates. Let be the condition number of .
The same analysis also applies to stochastic gradient descent as long as the batch size is at least . The analysis exploits the connection (pointed out by ) between neural networks with quadratic activations and matrix sensing with rank-1 measurements [23, 45, 9]: one can view as the measurement matrix in matrix sensing. However, these measurements don’t satisfy the RIP property. We will modify the learning algorithm slightly to cope with it. See Section 5 for details.
Organization: The rest of this paper is organized as follows: In Section 2, we define notations and present a review of the restricted isometry property. In Section 3, we lay out the key theoretical insights towards proving Theorem 1.1 and give the analysis for the rank-1 case as a warm-up. In Section 4, we outline the main steps for proving Theorem 1.1 and Section A completes the proofs of these steps. Section 5 and Section B give the proof of Theorem 1.2. Section 6 contains numeric simulations. Finally, Section C provide the proofs of concentration properties we have used.
Let denotes the projection to the column span of , and let
denotes the identity matrix. Letdenote the Moore-Penrose pseudo-inverse of the matrix . Let denotes the Euclidean norm of a vector and spectral norm of a matrix. Let denote the Frobenius norm of a matrix. Suppose , then denote its largest singular value and denotes its -th largest singular value. Alternatively, we have . Let denote the inner product of two matrices. We use to denote the sine of the principal angles between the columns spaces of and .
Unless explicitly stated otherwise, -notation hides absolute multiplicative constants. Concretely, every occurrence of is a placeholder for some function that satisfies for some absolute constant . Similarly, means that there exists an absolute constant such that . We use the notation as an abbreviation for .
2 Preliminaries and Related Work
Recall that we assume is rank- and positive semidefinite. Let be the eigen-decomposition of , where is an orthonormal matrix and is a diagonal matrix. The assumptions that and translate to that . Under the above notations, we see that the target solution for the variable is equal to where can be arbitrary orthonormal matrix. For convenience, we define the matrix as
Then the update rule can be rewritten as
where is the identity matrix. One of the key challenges is to understand how the matrix transforms , so that converges the target solution quickly.
are drawn from Gaussian distribution and optimistically suppose that they areindependent with . Then, we have that since the expectation of with respect to the randomness of ’s is equal to . However, they are two fundamental issues with this wishful thinking: a) obviously depends on ’s heavily for , since in every update step ’s are used; b) even if ’s are independently with , there are not enough ’s to guarantee concentrates around its mean in Euclidean norm. To have such concentration, we need , whereas we only have samples.
Restricted isometry propety:
The restricted isometry property (RIP) allows us to partially circumvent both the technical issues a) and b) above. It says that using the set of linear measurement matrices , we can preserve the Frobenius norm of any rank- matrices approximately.
(Restricted isometry property ) A set of linear measurement matrices in satisfies -restricted isometry property (RIP) if for any matrix with rank at most , we have
The crucial consequence of RIP that we exploit in this paper is the meta statement as follows:
|behaves like for approximately low-rank||(2.4)|
We will state several lemmas below that reflect the principle above. The following lemma says that behaves like for low-rank matrices and .
[6, Lemma 2.1] Let be a family of matrices in that satisfy -restricted isometry property. Then for any matrices with rank at most , we have:
The following lemma says that behaves like when multiplied by a matrix with small operator norm.
Let be a family of matrices in that satisfy -restricted isometry property. Then for any matrix of rank at most , and any matrix , where can be any positive integer, we have:
Lemma 2.3 is proved in Section C777We suspect that Lemma 2.3 is already known, however we haven’t been able to find a reference.. We can also extend Lemma 2.3 to the cases when has a higher rank (see Lemma C.1 and Lemma C.2). The bounds are not as strong as above (which is inevitable because we only have measurements), but are useful when itself is relatively small.
2.1 Related Work
Generalization theory beyond uniform convergence:
This work builds upon the remarkable work of Gunasekar et al. , which raises the conjecture of the implicit regularization in matrix factorization models and provides theoretical evidence for the simplified setting where the measurements matrices are commutable. Implicit regularization of gradient descent is studied in the logistic regression setting by Soudry et al. .
Recently, the work of Hardt et al.  studies the implicit regularization provided by stochastic gradient descent through uniform stability [4, 25, 32]. Since the analysis therein is independent of the training labels and therefore it may give pessimistic bounds . Brutzkus et al.  use a compression bound to show network-size independent generalization bounds of one-hidden-layer neural networks on linearly separable data.
Bartlett et al. , Neyshabur et al. , and Cisse et al.  recently prove spectrally-normalized margin-based generalization bounds for neural networks. Dziugaite and Roy  provide non-vacuous generalization bounds for neural networks from PCA-Bayes bounds. As pointed out by Bartlett et al. , it’s still unclear why SGD without explicit regularization can return a large margin solution. This paper makes progress on explaining the regularization power of gradient descent, though on much simpler non-linear models.
Matrix factorization problems:
Early works on matrix sensing and matrix factorization problems use convex relaxation (nuclear norm minimization) approaches and obtain tight sample complexity bounds [31, 37, 8, 30, 7]. Tu et al.  and Zheng and Lafferty  analyze the convergence of non-convex optimization algorithms from spectral initialization. The recent work of Ge et al.  and Bhojanapalli et al. 
shows that the non-convex objectives on matrix completion and matrix sensing with low-rank parameterization don’t have any spurious local minima, and stochastic gradient descent algorithm on them converges to the global minimum. Such a phenomenon was already known for the PCA problem and recently shown for phase retrieval, robust PCA, and random tensor decomposition as well (e.g., see[36, 15, 3, 14, 13, 39] and references therein). Soltanolkotabi et al.  analyzes the optimization landscape of over-parameterized one-hidden-layer neural networks with quadratic activations. Empirically, Jose et al. 
show that factorized parameterizations of recurrent neural networks provide additional regularization effect.
3 Proof Overview and Rank-1 Case
In this section, we demonstrate the key ideas of our proofs and give an analysis of the rank-1 case as a warm-up. The main intuition is that the iterate stays approximately low-rank in the sense that:
The -th singular value remains small for any ;
The top singular vectors and singular values of converge to those of in logarithmic number of iterations.
Propositions (a) and (b) can be clearly seen when the number of observations approaches infinity and are Gaussian measurements. Let’s define the population risk as
In this case, the matrix (defined in (2.1)) corresponds to , and therefore the update rule for can be simply rewritten as
Observe that the term encourages the column span of to move towards the column span of , which causes the phenomenon in Proposition (b). On the other hand, the term is performing a contraction of all the singular values of , and therefore encourages them to remain small. As a result, decreases in those directions that are orthogonal to the span of , because there is no positive force to push up those directions.
So far, we have described intuitively that the iterates of GD on the population risk remains approximately low-rank. Recall that the difficulty was that the empirical risk doesn’t uniformly concentrate well around the population risk .888Namely, we don’t have uniform convergence results in the sense that is small for all matrices . (For examples, for many matrices we can have but because we have more variables than observations.) However, the uniform convergence can occur, at least to some extent, in the restricted set of approximately low-rank matrices! In other words, since the gradient descent algorithm only searches a limited part of the whole space, we only require restricted uniform convergence theorem such as restricted isometry property. Motivated by the observations above, a natural meta proof plan is that:
The trajectory of the iterates of GD on the population risk stays in the set of approximately low-rank matrices.
The trajectory of the empirical risk behaves similarly to that of the population risk in the set of approximately low-rank matrices.
It turns out that implementing the plan above quantitatively can be technically challenging: the distance from the iterate to the set of low-rank matrices can accumulate linearly in the number of steps. Therefore we have to augment the plan with a strong result about the rate of convergence:
The iterates converge to the true solution fast enough before its effective rank increases.
For the rest of this section, we demonstrate a short proof of the rank-1 case to implement the intuitions described above. We note that the results of the rank- case in Section 4 is significantly tighter than the results presented in this section. The analysis involves more delicate techniques to control the growth of the top eigenvectors, and requires a much stronger convergence analysis.
3.1 Warm-up: Rank-1 Case
In this subsection, we assume that for and that . We decompose the iterates into the subspace of and its complement:
where we denote by and . 999Observe that we have restricted the column subspace of the signal term , so that is always a multiple of . In section 4, we will introduce an adaptive subspace instead to decompose into the signal and the error terms.
In light of the meta proof plan discussed above, we will show that the spectral norm and Frobenius norm of the “error term” remains small throughout the iterations, whereas the “signal” term grows exponentially fast (in the sense that the norm of grows to 1.) Note that any solution with and will give exact recovery, and for the purpose of this section we will show that will converges approximately to 1 and stays small.
Throughout this section, we assume that and the set of measurement matrices satisfies -RIP with where is a sufficiently small absolute constant (e.g., suffices).
In the setting of this subsection, suppose and . Then after iterations, we have:
As we already mentioned, Theorem 3.1 is weaker than Theorem 1.1 even for the case with . In Theorem 1.1 (or Theorem 4.1), the final error depends linearly on the initialization, whereas the error here depends on the RIP parameter. Improving Theorem 3.1 would involve finer inductions, and we refer the readers to Section 4 for the stronger results.
The following lemma gives the growth rate of in spectral norm and Euclidean norm, in a typical situation when and are bounded above in Euclidean norm.
Proposition 3.2 (Error dynamics).
In the setting of Theorem 3.1. Suppose that and . Then can be bounded by
A recurring technique in this section, as alluded before, is to establish the approximation
As we discussed in Section 2, if is low-rank, then the approximation above can be established by Lemma 2.2 and Lemma 2.3. However, is only approximately low-rank, and we therefore we will decompose it into
Proof Sketch of Proposition 3.2.
Using the update rule (3.3) for , we have that
When is sufficiently small and are bounded from above, the third term on the RHS is negligible compared to the second term. Therefore, we focus on the second term that is linear in .
where in the last line we rearrange the terms and use the fact that is symmetric. Now we use Lemma 2.2 to show that equation (3.7) is close to , which is its expectation w.r.t the randomness of ’s if ’s were chosen from spherical Gaussian distribution. If was a rank-1 matrix, then this would follow from Lemma 2.2 directly. However, is approximately low-rank. Thus, we decompose it into a low-rank part and an error part with small trace norm as in equation (3.5). Since has rank at most 4, we can apply Lemma 2.2 to control the effect of ’s,
where the last inequality uses that .
Note that , which implies that and . Therefore,
We can also control the third term in RHS of equation (3.6) by . Since the bound here is less important (because one can always choose small enough to make this term dominated by the first order term), we left the details to the reader. Combining the equation above with (3.10) and (3.6), we conclude the proof of equation (3.4). Towards bounding the spectral norm of , we use a similar technique to control the difference between and in spectral norm. By decomposing as in equation (3.5) and applying Lemma 2.3 and Lemma C.2 respectively, we obtain that
|(by the assumptions that )|
Observing that . Plugging the equation above into equation (3.3), we conclude that
The next Proposition shows that the signal term grows very fast, when the signal itself is not close to norm 1 and the error term is small.
Proposition 3.3 (Signal dynamics).
In the same setting of Proposition 3.2, we have,
The following proposition shows that converges to 1 approximately and remains small by inductively using the two propositions above.
Proposition 3.4 (Control and by induction).
In the setting of Theorem 3.1, after iterations,
We will analyze the dynamics of and in two stages. The first stage consists of all the steps such that , and the second stage consists of the rest of steps. We will show that
Stage 1 has at most steps. Throughout Stage 1, we have
In Stage 2, we have that
And after at most steps in Stage 2, we have .
We use induction with Proposition 3.2 and 3.3. For , we have that because , where is an orthonormal matrix. Suppose equation (3.14) is true for some , then we can prove equation (3.15) holds by Proposition 3.3:
|(by and )|
where the last inequality uses that
For claim b), we first apply the bound obtained in claim a) on and the fact that (this follows trivially from our induction, so we omit the details). We have already proved that in the first stage. For the second stage, as long as the number of iterations is less than , we can still obtain via Proposition 3.2 that:
To summarize, we bound as follows: