Neural networks are a key component of the machine learning toolbox. Still the reasons behind their success remain mysterious from a theoretical prospective. While sufficiently large neural networks can in principle represent a large class of functions, we do not yet understand under what conditions their parameters can be adjusted in an algorithmically tractable way for that purpose. For example, under worst case assumptions, some functions cannot be tractably learned with neural networks[1, 2]. We also know that there exist settings with adversarial initializations where neural networks fail in generalization to new samples, while the same setting from random initial conditions succeeds 
. And yet, in many practical settings, neural networks are trained successfully even with simple local algorithm such as gradient descent (GD) or stochastic gradient descent (SGD).
The problem of learning the parameters of a neural network is two-fold. First, we want that their training on a set of data via minimization of a suitable loss function succeed in finding a set of parameters for which the value of the loss is close to its global minimum. Second, and more importantly, we want that such a set of parameters also generalizes well to unseen data. Theoretical guarantees have been obtained in many settings by a geometrical analysis of the loss showing that only global minima are present, see e.g.[4, 5]. In particular it has been shown that network over-parametrization can be beneficial and lead to landscapes without spurious minima in which GD or SGD converge [6, 7, 8, 9, 10]. However, over-parametrized neural networks successfully optimized on a training set do not necessarily generalize well – for example neural networks can achieve zero errors in training without learning any rule . It is therefore important to understand when zero training loss implies good generalization.
It is know empirically that deep neural networks can learn functions that can be represented with a much smaller (sometimes even shallow) neural network [12, 13, 14], but that learning the smaller network without first learning the larger one is computationally harder . Our work provides a theoretical justification for this empirical observation by providing an explicit and rigorously analyzable case where this happens.
In this work we investigate the issues of training and generalization in the context of a teacher-student set-up. We assume that both the teacher and the student are one-hidden layer neural network with quadratic activation function and quadratic loss. We focus on the over-parametrized or over-realizable case where the hidden layer of the teacher is smaller than that of the student . We assume that the hidden layer of the student is larger than the dimensionality , , in that case:
We show that the value of the empirical loss is zero on all of its minimizers, but that the set of minimizers does not reduce to the singleton containing only the teacher network in general.
We derive a critical value of the number of samples per dimension above which the set of minimizers of the empirical loss has a positive probability to reduce to the singleton containing only the teacher network in the limit with —i.e. we derive a sample complexity threshold above which the minimizer can have good generalization properties. The formula is proven for a teacher with a single hidden unit (a.k.a. phase retrieval).
We study gradient descent flow on the empirical loss starting from random initialization and show that it converges to a network that can achieve perfect generalization above this sample complexity threshold .
We evaluate the convergence rate of gradient descent in the limit of large number of samples. We identify two different regimes of convergence according to the input dimension and the number of hidden units, showing that in one case the loss converges as while in the second case it converges exponentially.
We show how the string method can be used to probe the empirical loss landscape and find minimum energy paths on this landscape connecting the initial weights of the student to those of the teacher, possibly going through flat portion or above energy barrier. This allows one to probe features of this landscape not accessible by standard GD.
In Sec. 2 we formally define the problem and derive some key properties that we use in the rest of the paper. In Sec. 3 we analyze the training and the generalization losses from the geometrical prospective, and derive the formula for the sample complexity threshold. In Sec. 4 we show that gradient descent flow can find good minima for datasets above this sample complexity threshold, and we characterize its convergence rate. In Sec. 5 we present our results using the string method to probe the loss landscape. Finally in the appendix we give the proofs and some additional numerical results.
One-hidden layer neural networks with quadratic activation functions in the over-parametrized regime were considered in a range of previous works [9, 15, 8, 16, 17]. Notably it was shown that all local minima are global when the number of hidden units is larger than the dimension and that gradient descent finds the global optimum [15, 8, 16], and also when the number of hidden units with being the number of samples [15, 17]. Most of these results were established for arbitrary training data of input/output pairs, but consequently these works did not establish condition under which the minimizers reached by the gradient descent have good generalization properties. Indeed, it is intuitive that over-parametrization renders the optimization problem simpler, but it is rather non-intuitive that it does not destroy good generalization properties. In , under the assumption that the input data is Gaussian i.i.d., a generalization rate was established. However the generalization properties of neural networks with number of samples comparable to the dimensionality is mostly left open.
Much tighter (Bayes-optimal) generalization properties of neural networks were established for data generated by the teacher-student model, for the generalized linear models in , and for one hidden layer much smaller than the dimension in . However, these results were only shown to be achievable with approximate message passing algorithms and the performance of gradient-descent algorithm was not analyzed. Also studying over-parametrization with analogous tightness of generalization results is an open problem and has been achieved only for the one-pass stochastic gradient descent .
A notable special case of our setting is when the teacher has only one hidden unit, in which case the teacher network is equivalent to the phase retrieval problem with random sensing matrix . For this case the performance of message passing algorithms is well understood and requires a number of samples linearly proportional to the dimension, in the high-dimensional regime for perfect generalization . For randomly initialized gradient descent the best existing rigorous result for the phase retrieval requires number of samples . The performance of the gradient-descent in the phase retrieval problem is studied in detail in a concurrent work , showing numerically that without overparametrization randomly initialized gradient descent needs at least samples to find perfect generalization. In the present work we show that overparametrized neural networks are able to solve the phase retrieval problem with samples in the high-dimensional limit. This improves upon  and falls close to the performance of the approximate message passing algorithm that is conjectured optimal among polynomial ones . But most interesting is the comparison between our results for the phase retrieval obtained by overparametrized neural networks , and the results from  who show that without overparametrized considerably larger
is needed for gradient descent to succeed to learn the same function. This comparison provides a theoretical justification for how overparametrization helps gradient descent to find good generalization properties with fewer samples. We stress that the same property would not apply to the message passing algorithms. We could speculate that more of the properties of overparametrization observed in deep learning are limited to the gradient-descent-based algorithms and would not hold for other algorithmic classes.
Closely related to our work is Ref.  in which the authors consider the same teacher-student problem as we do. The main difference is that they only consider teachers that have more hidden units than the input dimension, , while we consider arbitrary . As we show below the regime where turns out to be interesting as it affects nontrivially the critical number of samples needed for recovery and leads to a more complex scenario in which depends also on —in particular taking allows for recovery below the threshold , which is one of our main results.
2 Problem formulation
Consider a teacher-student scenario where a teacher network generates the dataset, and a student network aims at learning the function of the teacher. The teacher has weights , with . We will keep the teacher weights generic in most of the paper and will specify them when needed, in particular for the simulations where we consider two specific teachers: one with i.i.d. Gaussian with covariance identity, and one with orthonormal.
The student’s weights are , with and . Given an input , teacher’s and student’s outputs are respectively
where we fixed the second layer of weights to and , respectively. The teacher produces outputs from random i.i.d. Gaussian samples , . Given this dataset, we define the empirical loss
where denotes expectation with respect to the empirical measure . As usual, the population loss is obtained by taking the expectation of (2) with respect to .
The student minimizes the empirical loss (2) using gradient descent, . Explicitly
where we introduced the following matrices
We can now see that a closed equation for can be derived from (3), and this new equation reduces the effective number of weights from to without affecting neither the dynamics nor the other properties of the teacher and student since and :
It is also possible to write the equivalent of this lemma for the population loss:
The GD flow of the weights on the population loss induces the following evolution equation for :
where is twice the population loss written in terms of :
3 Geometrical Considerations and Sample Complexity Threshold
The empirical loss is quadratic, hence convex, with minimum zero. In addition is a minimizer since . The main question we want to address next is when is this minimizer unique.
Since the trace is a scalar product in the vector space ofmatrices in which symmetric matrices form a dimensional subspace, the empirical loss will be strictly convex in this subspace iff we span it using linearly independent . Yet, if we restrict considerations to matrices that are also positive semidefinite, we need less data to guarantee that is the unique minimizer of , at least in some probabilistic sense:
Theorem 3.1 (Single unit teacher).
Consider a teacher with and a student with hidden units respectively, so that has rank 1 and has full rank. Given a data set with each drawn independently from a standard Gaussian, denote by the set of minimizer of the empirical loss constructed with over symmetric positive semidefinite matrices , i.e.
Set for and let . Then
In words, this theorem says that it exists a threshold value such that for any there is a finite probability that the empirical loss landscape trivializes and all spurious minima disappear in the limit as . For however, this is not the case and spurious minima exist with probability 1 in the limit. Therefore, the chance to learn by minimizing the empirical loss from a random initial condition is zero if but it becomes positive if . The proof of Theorem 3.1 is presented in Appendix A. This proof shows that we can account for the constraint that be positive definite by making a connection with the classic problem of the number of extremal rays of proper convex polyhedral cones generated by a set of random vectors in general position. Interestingly, this proof also gives a criterion on the data set that guarantees that the only minimizer of the empirical loss be : it suffices to check that the proper convex polyhedral cones constructed with the data vectors have a number of extremal rays that is less than .
Heuristic extension for arbitrary .
The result of Theorem 3.1
can also be understood via a heuristic algebraic argument that has the advantage that it applies to arbitrary. The idea, elaborated upon in Appendix A.3, is to count the number of constraints needed to ensure that the only minimum of the empirical loss is , taking into account that (i) has full rank and has rank and (ii) both and
are positive semidefinite and symmetric, so that the number of negative eigenvalues ofcan at most be . If we use a block representation of
in which we diagonalize the block that contains the direction associated with the eigenvectors ofwith nonnegative eigenvalues, and simply count the number of nonzero entries in the resulting matrix (accounting for its symmetry), for we arrive at
which, for , agrees with the result in Theorem 3.1. The sample complexity threshold is confirmed in Fig. 1 via simulations using gradient descent (GD) on the empirical loss—we explain this figure in Sec. 4 after establishing that the GD dynamics converges.
4 Gradient Descent
Let us now analyze the performance of gradient descent over the empirical loss. Initiated with random initial weights, with probability one the GD flow in (3) for the weight will eventually reach a minimum of the empirical loss, say —indeed, the only possibility for it not to do so would be to reach a critical point of Morse index 1 or above, and the probability of that event is zero from random initial data. Since the evolution equation in (3) for the weights is completely equivalent to the evolution equation in (5) for , the solution to this equation will also eventually reach and as soon as and has full rank, must be a minimizer of the empirical loss, i.e. be such that . That is, we have established:
Let be a solution to (5) for with full rank. Then as and is a minimizer of the empirical loss.
Combined with Theorem 3.1, this proposition also indicates that, when and is large, the probability that is high when , whereas the probability that becomes positive for . If we generalize this analysis to the case and large, we expect that GD will recover the teacher only if with given by (12).
These results are confirmed by numerical simulations in Fig. 1 where we plot as a function of the number of teacher hidden units for different values of . The four colors represent different input dimensions . We use circles to represent the numerical extrapolation of obtained by several runs of GD flow on different instances of the problem, using the procedure described in Appendix B. Consistent with Proposition 4.1, the extrapolation confirms that GD flow is able match the sample complexity threshold predicted by the theory.
What Proposition 4.1 leaves open is the convergence rate of in either cases. This question is hard to answer for GD on the empirical loss, but it can be addressed for GD on the population loss, as shown next.
GD on the Population Loss.
We begin by observing is that we can characterize the GD flow on the population loss by considering only the evolution of the eigenvalues of .
Let be the solution to the GD flow (7) over the population loss assuming that . Denote by an orthogonal matrix whose columns are the eigenvectors of
an orthogonal matrix whose columns are the eigenvectors of, so that with . Let so that . Then remains diagonal during the dynamics and the evolution of its entries is given by
In addition the population loss is given by
Notice that this theorem indicates that it suffices to characterize the convergence rate of the slowest eigenvalue to the target to obtain the convergence rate of the loss. The equations in (14
) can easily be solved numerically. An asymptotic analysis of their solution whenis large is also possible, as shown next.
Consider the case first. Then eigenvalues of are zero, and without loss of generality we can order so that the zero eigenvalues of are last. Denoting , for (14) then reads
We will call the first eigenvalues informative eigenvalues and the remaining (captured by ) non-informative eigenvalues. We make two observations. Since , initially the leading order term in the equation for the uninformative eigenvalues is
Substituting this solution into (16) we deduce
(18) and (19) imply an initial decreases in time of both non-informative and the informative eigenvalues. However, when becomes of order one or smaller, the other terms in equation (16) take over and allow the informative eigenvalues to bounce back up. This happens at at time in . Afterwards the informative eigenvalues emerge from the non-informative ones with an exponential growth, . As a result, these informative eigenvalues eventually match the eigenvalues of the teacher at a typical time of order . This analysis also implies a quadratic decay in time of the loss at long times
In Sec. B we give additional details comparing the asymptotic analysis to the real dynamics when but not necessarily much smaller. This analysis can e.g. be done quite explicitly when the unit in the teacher are orthonormal. It indicates that at all times, and as a result shows that
at all times.
Consider the case with next. Then (14) can be written as
which gives an exponential convergence to the target , and consequently an exponential convergence in the population loss. For example, let us specialize to the case of a teacher with orthonormal hidden vectors, for . The eigenvalues will converge to their target value as . Consequently the loss (15) will converge to zero exponentially in this case
The results above are confirmed in the numerics. The cases when and are shown by the first two and last two panels in Fig. 2, respectively. When the decay of the empirical loss is quadratic, consistent with (20). In contrast, when , the absence of non-informative eigenvalues removes the dominating terms in the loss (15). Therefore the loss is dominated by the informative eigenvalues and decays exponentially, consistent with (23). This can be clearly observed in Fig. 2, where the four panels show the population loss using teachers with and . The black dotted shows the quadratic asymptotic decay predicted in (20). The last two panels of the sequence show the exponential decay as predicted predicted in (23)
Fig. 3 shows the training and the population loss observed in the simulation using input dimension and a teacher with hidden unit. In this case our analysis suggests that the typical realization will converge to zero generalization error if . This can be observed on the right panel of the Fig. 3. We used a dashed line to represent the gradient in the population loss (7) and used a dotted line to represent the approximated result (21), observing the two being almost indistinguishable in the figure.
5 Probing the Loss Landscape with the String Method
Finally, let us show that we can use the string method [25, 26, 27] to probe the geometry of the training loss landscape and confirm numerically Theorem 3.1. The string method consists in connecting the student and the teacher with a curve (or string) in matrix space, and evolve this curve by GD while controlling its parametrization. In practice, this can be done efficiently by discretizing the string into equidistant images or points (with the Frobenius norm as metric), and iterating upon (i) evolving these images by the descent dynamics, and (ii) reparamterizing the string to make the images equidistant again. At convergence the string will identify a minimum energy path between and which will possibly have a flat portion at zero empirical loss if this loss can be minimized by GD before reaching . That is, along the string, the student reaches the first minimum by GD, and, if , then move along the set of minimizers of the empirical loss until it reaches . The advantage of the method is that by replacing the physical time along the trajectory by the arclenght along it, it permits to go to infinite times (when ) and beyond (when ), thereby probing features of the loss landscape not accessible by standard GD. (Of course it requires one to know the target in advance, i.e. the string method cannot be used instead of GD to identify this target in situations where it is unknown.)
In Fig. 4 we compare the strings obtained for input dimension 4 (red), 6 (purple), end 8 (blue). The strings are parametrized by 100 points represented on the horizontal axes. Moving from the leftmost to the rightmost panels in Fig. 4 the number of samples in the dataset increases, namely . Gradually all the represented will reach the critical size and will have a landscape with a single minimum, the informative one. Observe that for relatively small sample sizes, there is low correspondence between the topology of the training loss landscape and the population loss one. As the size increases we notice that correlation increases until the two are just slightly apart.
We thank Joan Bruna and Ilias Zadik for precious discussions. SSM acknowledges the Courant Institute for the hospitality during his visit. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe.
-  Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-complete. In Advances in neural information processing systems, pages 494–501, 1989.
-  Emmanuel Abbe and Colin Sandon. Provable limitations of deep learning. arXiv preprint arXiv:1812.06369, 2018.
-  Shengchao Liu, Dimitris Papailiopoulos, and Dimitris Achlioptas. Bad global minima exist and sgd can reach them. arXiv preprint arXiv:1906.02613, 2019.
-  Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
-  Simon Du, Jason Lee, Yuandong Tian, Aarti Singh, and Barnabas Poczos. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1339–1348, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
-  Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in neural information processing systems, pages 855–863, 2014.
-  Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246–1257, 2016.
-  Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.
-  Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20(133):1–34, 2019.
Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, and Lenka
Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models.In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4333–4342, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
-  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
-  Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
-  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
-  Simon Du and Jason Lee. On the power of over-parametrization in neural networks with quadratic activation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1329–1338, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
-  Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
-  Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2603–2612. JMLR. org, 2017.
Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka
Optimal errors and phase transitions in high-dimensional generalized linear models.Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.
-  Benjamin Aubin, Antoine Maillard, Florent Krzakala, Nicolas Macris, Lenka Zdeborová, et al. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In Advances in Neural Information Processing Systems, pages 3223–3234, 2018.
-  Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborová. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In Advances in Neural Information Processing Systems, pages 6979–6989, 2019.
-  James R Fienup. Phase retrieval algorithms: a comparison. Applied optics, 21(15):2758–2769, 1982.
-  Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming, 176(1-2):5–37, 2019.
-  Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Complex dynamics in simple neural networks: Understanding gradient flow in phase retrieval. arXiv preprint arXiv:2006.06997, 2020.
-  David Gamarnik, Eren C Kızıldağ, and Ilias Zadik. Stationary points of shallow neural networks with quadratic activation function. arXiv preprint arXiv:1912.01599, 2019.
-  Weinan E, Weiqing Ren, and Eric Vanden-Eijnden. String method for the study of rare events. Physical Review B, 66(5):052301, 2002.
-  Weinan E, Weiqing Ren, and Eric Vanden-Eijnden. Simplified and improved string method for computing the minimum energy paths in barrier-crossing events. Journal of Chemical Physics, 126(16):164103, 2007.
-  C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
-  Thomas M Cover and Bradley Efron. Geometrical probability and random points on a hypersphere. The Annals of Mathematical Statistics, pages 213–220, 1967.
-  Alfred J Lotka. Contribution to the theory of periodic reactions (1910). The Journal of Physical Chemistry, 14(3):271–274, 2002.
-  Vito Volterra. Variazioni e fluttuazioni del numero d’individui in specie animali conviventi. C. Ferrari, 1927.
Appendix A Proofs and Technical Lemmas
We also note that (5) (and similarly (7) if we use the population loss in (8) instead of the empirical loss in (6)) can be viewed as the time continuous limit of a simple proximal scheme involving the Cholesky decomposition of and the standard Forbenius norm as Bregman distance. We state this result as:
Given define the sequence of matrices via
where is a parameter. Then
where solves (5) for the initial condition .
Look for a solution to the minimization problem in (A.1) of the form
To leading order in , the objective function in (A.1) becomes
which we can set to zero by choosing with
In terms of the minimizer of the orginal problem this equation can be written as
Letting and with , we deduce that solution to
Setting we have
which is (5). ∎
a.2 Proof of Theorem 3.1
Let be a symmetric, positive semidefinite minimizer of the empirical loss and consider . Since this matrix is symmetric, there exists an orthonormal basis in made of its eigenvectors, . Since is positive semidefinite by assumption and is rank one, eigenvalues of are nonnegative, and only one can be positive, negative, or zero. Let us order the eigenvectors such that their associate eigenvalues are for and . Given the data vector , to be a minimizer of the empirical loss must satisfy
Let us analyze when (A.4) admits solutions that are not . To this end, assume first that . Then, as soon as , for each with probability one there is at least one such that . As a result, if , as as soon as , the only solution to (A.4) is for all , i.e. a.s.
The worst scenario case is actually when . In that case (A.4) can be written
This equation means that if we let (i.e. but lie in the same hemisphere as ), then the vectors must all lie on the surface of an elliptical cone centered around , with the principal axes of the ellipsoids aligned with ,
; the intersection of the cone with the hyperplaneis the ellipsoid whose boundary satisfies the equation
In , it takes vectors to uniquely define such a elliptical cone. This means that, in the worst case scenario, we recover the threshold . This worst case scenario is however unlikely. To see why, assume that , and consider the convex polyhedral cone spanned by , i.e. the region
In order that (A.6) have a nontrivial solution, the extremal rays of (i.e. its edges of dimension 1) must coincide with the set , that is, all rays , for must lie on the boundary of and none can be in the interior of ; indeed these extremal rays must also be on the boundary of elliptical cone . However, Theorem 3’ in  asserts that, if the vectors in the set are in general position (i.e. if the vectors in any subset of size no more than are linearly independent, which happens with probability one if are i.i.d. Gaussian), the number of extremal rays of satisfies
This implies that
Since by definition, we have , which from (A.9) implies that a.s. if . In turns this implies that the probability that all the vectors in be extremal ray of the cone tends to 1 as with and . This also means that the probability that (A.6) has solution with also tends to 1 in this limit, i.e. (10) holds. Conversely, since for , the probability that remains positive as with and . This means that the probability that (A.6) has no solution with is positive in this limit, i.e. (11) holds.
a.3 Heuristic argument for arbitrary and
Minimizers of the empirical loss satisfy:
Clearly is always a solution to this set of equation. The question is: how large should be in order that be the only solution to that equation? If was an arbitrary symmetric matrix, we already know the answer: with probability one, we need . What makes the problem more complicated is that is required positive semidefinite. If we assume that has rank , this implies that must be a symmetric matrix with nonnegative eigenvalues and eigenvalues whose sign is unconstrained, and we need to understand what this requirement imposes on the solution to (A.10).
In the trivial case when (i.e. ), if we decompose , where contains its eigenvectors and is a diagonal matrix with its eigenvectors , , (A.10) can be written as
where , are linearly independent eigenvectors of . In this case, since , with probability one we only need data vectors to guarantee that the only solution to this equation is for all , i.e. . Another way to think about this is to realize that the nonnegativity constraint on has removed degrees of freedom from the original in .
If , the situation is more complicated, but we can consider the projection of in the subspace not spanned by , i.e. the matrix defined as
where is the matrix whose columns are linearly independent eigenvectors of with zero eigenvalue. All the eigenvalues of are nonnegative, and this imposes constraints in the subspace where lives. If we simply subtract this number to we obtain
which is precisely (12).
This argument is nonrigorous because we cannot a priori treat separately (A.10) in the subspace spanned by and its orthogonal complement. Yet, our numerical results suggest that this assumption is valid, at least as
a.4 Proof of Theorem 4.2
Since is symmetric and positive semidefinite, its eigenvalues are nonnegative and there exists an orthonormal basis made of its eigenvectors. Denote this basis by and let us order in way that the corresponding eigenvalues are for , and for . Denote by the orthogonal matrix whose columns are the eigenvectors of , so that with . Let . Since by assumption, and from (7) this matrix evolves according to
Appendix B Additional Results
b.1 Supporting numerical results to Fig. 1
In Fig. B.1 shows the average performance of GD with datapoints and a teacher with and Gaussian hidden units. The figure is intended to show a vertical cut in the dynamical phases Fig. 1. Moving up in at fixed we observe that on average the simulations converge when and they do not when , i.e. there is an abrupt change of behavior when we cross the transition. Another interesting aspect of the figure is that the first panel has which leads to and exponential (rather than quadratic) convergence rate in the loss, consistent to our analysis. The dotted line is a reference line that represents the decay of the loss.
b.2 Supporting numerical results to Theorem 3.1
In Fig. B.2 we present a numerical verification of Theorem 3.1. According to the theorem, as with (so that ) the probability of finding the teacher should converge to zero for and to positive values for . The left panel on the figure shows the fraction of 100 simulations that achieved at least generalization loss after iterations with learning rate . The right panel shows the number of simulations for which the ratio between training and generalization losses is larger than . This second panel is meant to capture the simulations for which we expect convergence eventually, but the number of iterations was not enough to achieve it. In particular, we observed that when generalization fails, meaning that the training loss goes to zero and the generalization loss stay at a high value, the convergence rate of the training loss is exponential, contrarily to simulation where the generalization loss eventually goes to zero that have a convergence rate. Using simulations with iterations is sufficient to detect the difference between the two cases and therefore this gives us a good criterion to distinguish between successful and unsuccessful simulations.
To provide more evidence of this reasoning, in Fig. B.3 we show training and generalization loss of simulations for , and . We order the simulations according to the loss and show in the three panels three snapshots for different number of iterations. From left to right the number of iterations increases by a factor 10 in each panel. As can be seen, the ratio between generalization loss and training loss at the end of the training is a valid measure of success.
b.3 Extrapolation procedure
We estimate the critical value ofnumerically by fixing a threshold in the population loss, , and simulate the problem for a large set of . Starting from the largest value in the set, as approaches the critical value the time needed to pass the threshold increase as a power-law . In Fig. B.4 we fit the relaxation times to cross a threshold in the population loss of for and . The extrapolated thresholds
and their 95% confidence intervals are: for, ; for , ; for , ; and for , . Close to the threshold , namely , , , and , as expected. The larger the input dimension, the larger the time to pass the threshold is, and as result the smallest accessible value of also increases. This causes a decrease in accuracy on the threshold value, measured by the larger confidence intervals. The same procedure has been applied for other values of to obtain the points shown in Fig. 1.