Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

06/27/2020 ∙ by Stefano Sarao Mannelli, et al. ∙ 7

We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width m is larger than the input dimension d. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width m^*< m. We describe how the empirical loss landscape is affected by the number n of data samples and the width m^* of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on n, d, and m^*, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice. Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples. These results are confirmed by numerical experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks are a key component of the machine learning toolbox. Still the reasons behind their success remain mysterious from a theoretical prospective. While sufficiently large neural networks can in principle represent a large class of functions, we do not yet understand under what conditions their parameters can be adjusted in an algorithmically tractable way for that purpose. For example, under worst case assumptions, some functions cannot be tractably learned with neural networks

[1, 2]. We also know that there exist settings with adversarial initializations where neural networks fail in generalization to new samples, while the same setting from random initial conditions succeeds [3]

. And yet, in many practical settings, neural networks are trained successfully even with simple local algorithm such as gradient descent (GD) or stochastic gradient descent (SGD).

The problem of learning the parameters of a neural network is two-fold. First, we want that their training on a set of data via minimization of a suitable loss function succeed in finding a set of parameters for which the value of the loss is close to its global minimum. Second, and more importantly, we want that such a set of parameters also generalizes well to unseen data. Theoretical guarantees have been obtained in many settings by a geometrical analysis of the loss showing that only global minima are present, see e.g.

[4, 5]. In particular it has been shown that network over-parametrization can be beneficial and lead to landscapes without spurious minima in which GD or SGD converge [6, 7, 8, 9, 10]. However, over-parametrized neural networks successfully optimized on a training set do not necessarily generalize well – for example neural networks can achieve zero errors in training without learning any rule [11]. It is therefore important to understand when zero training loss implies good generalization.

It is know empirically that deep neural networks can learn functions that can be represented with a much smaller (sometimes even shallow) neural network [12, 13, 14], but that learning the smaller network without first learning the larger one is computationally harder [6]. Our work provides a theoretical justification for this empirical observation by providing an explicit and rigorously analyzable case where this happens.

Main contributions:

In this work we investigate the issues of training and generalization in the context of a teacher-student set-up. We assume that both the teacher and the student are one-hidden layer neural network with quadratic activation function and quadratic loss. We focus on the over-parametrized or over-realizable case where the hidden layer of the teacher is smaller than that of the student . We assume that the hidden layer of the student is larger than the dimensionality , , in that case:

  • We show that the value of the empirical loss is zero on all of its minimizers, but that the set of minimizers does not reduce to the singleton containing only the teacher network in general.

  • We derive a critical value of the number of samples per dimension above which the set of minimizers of the empirical loss has a positive probability to reduce to the singleton containing only the teacher network in the limit with —i.e. we derive a sample complexity threshold above which the minimizer can have good generalization properties. The formula is proven for a teacher with a single hidden unit (a.k.a. phase retrieval).

  • We study gradient descent flow on the empirical loss starting from random initialization and show that it converges to a network that can achieve perfect generalization above this sample complexity threshold .

  • We evaluate the convergence rate of gradient descent in the limit of large number of samples. We identify two different regimes of convergence according to the input dimension and the number of hidden units, showing that in one case the loss converges as while in the second case it converges exponentially.

  • We show how the string method can be used to probe the empirical loss landscape and find minimum energy paths on this landscape connecting the initial weights of the student to those of the teacher, possibly going through flat portion or above energy barrier. This allows one to probe features of this landscape not accessible by standard GD.

In Sec. 2 we formally define the problem and derive some key properties that we use in the rest of the paper. In Sec. 3 we analyze the training and the generalization losses from the geometrical prospective, and derive the formula for the sample complexity threshold. In Sec. 4 we show that gradient descent flow can find good minima for datasets above this sample complexity threshold, and we characterize its convergence rate. In Sec. 5 we present our results using the string method to probe the loss landscape. Finally in the appendix we give the proofs and some additional numerical results.

Related works:

One-hidden layer neural networks with quadratic activation functions in the over-parametrized regime were considered in a range of previous works [9, 15, 8, 16, 17]. Notably it was shown that all local minima are global when the number of hidden units is larger than the dimension and that gradient descent finds the global optimum [15, 8, 16], and also when the number of hidden units with being the number of samples [15, 17]. Most of these results were established for arbitrary training data of input/output pairs, but consequently these works did not establish condition under which the minimizers reached by the gradient descent have good generalization properties. Indeed, it is intuitive that over-parametrization renders the optimization problem simpler, but it is rather non-intuitive that it does not destroy good generalization properties. In [15], under the assumption that the input data is Gaussian i.i.d., a generalization rate was established. However the generalization properties of neural networks with number of samples comparable to the dimensionality is mostly left open.

Much tighter (Bayes-optimal) generalization properties of neural networks were established for data generated by the teacher-student model, for the generalized linear models in [18], and for one hidden layer much smaller than the dimension in [19]. However, these results were only shown to be achievable with approximate message passing algorithms and the performance of gradient-descent algorithm was not analyzed. Also studying over-parametrization with analogous tightness of generalization results is an open problem and has been achieved only for the one-pass stochastic gradient descent [20].

A notable special case of our setting is when the teacher has only one hidden unit, in which case the teacher network is equivalent to the phase retrieval problem with random sensing matrix [21]. For this case the performance of message passing algorithms is well understood and requires a number of samples linearly proportional to the dimension, in the high-dimensional regime for perfect generalization [18]. For randomly initialized gradient descent the best existing rigorous result for the phase retrieval requires number of samples [22]. The performance of the gradient-descent in the phase retrieval problem is studied in detail in a concurrent work [23], showing numerically that without overparametrization randomly initialized gradient descent needs at least samples to find perfect generalization. In the present work we show that overparametrized neural networks are able to solve the phase retrieval problem with samples in the high-dimensional limit. This improves upon [22] and falls close to the performance of the approximate message passing algorithm that is conjectured optimal among polynomial ones [18]. But most interesting is the comparison between our results for the phase retrieval obtained by overparametrized neural networks , and the results from [23] who show that without overparametrized considerably larger

is needed for gradient descent to succeed to learn the same function. This comparison provides a theoretical justification for how overparametrization helps gradient descent to find good generalization properties with fewer samples. We stress that the same property would not apply to the message passing algorithms. We could speculate that more of the properties of overparametrization observed in deep learning are limited to the gradient-descent-based algorithms and would not hold for other algorithmic classes.

Closely related to our work is Ref. [24] in which the authors consider the same teacher-student problem as we do. The main difference is that they only consider teachers that have more hidden units than the input dimension, , while we consider arbitrary . As we show below the regime where turns out to be interesting as it affects nontrivially the critical number of samples needed for recovery and leads to a more complex scenario in which depends also on —in particular taking allows for recovery below the threshold , which is one of our main results.

2 Problem formulation

Consider a teacher-student scenario where a teacher network generates the dataset, and a student network aims at learning the function of the teacher. The teacher has weights , with . We will keep the teacher weights generic in most of the paper and will specify them when needed, in particular for the simulations where we consider two specific teachers: one with i.i.d. Gaussian with covariance identity, and one with orthonormal.

The student’s weights are , with and . Given an input , teacher’s and student’s outputs are respectively

(1)

where we fixed the second layer of weights to and , respectively. The teacher produces outputs from random i.i.d. Gaussian samples , . Given this dataset, we define the empirical loss

(2)

where denotes expectation with respect to the empirical measure . As usual, the population loss is obtained by taking the expectation of (2) with respect to .

The student minimizes the empirical loss (2) using gradient descent, . Explicitly

(3)

where we introduced the following matrices

(4)

We can now see that a closed equation for can be derived from (3), and this new equation reduces the effective number of weights from to without affecting neither the dynamics nor the other properties of the teacher and student since and :

Lemma 2.1.

The GD flow (3) of the weights on the empirical loss induces the following evolution equation for :

(5)

where denotes gradient with respect to and is twice the empirical loss (2) rewritten in terms of :

(6)

It is also possible to write the equivalent of this lemma for the population loss:

Lemma 2.2.

The GD flow of the weights on the population loss induces the following evolution equation for :

(7)

where is twice the population loss written in terms of :

(8)

Expression (8) for the population loss was already given in [24].

3 Geometrical Considerations and Sample Complexity Threshold

The empirical loss  is quadratic, hence convex, with minimum zero. In addition is a minimizer since . The main question we want to address next is when is this minimizer unique.

Since the trace is a scalar product in the vector space of

matrices in which symmetric matrices form a dimensional subspace, the empirical loss  will be strictly convex in this subspace iff we span it using linearly independent  [24]. Yet, if we restrict considerations to matrices that are also positive semidefinite, we need less data to guarantee that is the unique minimizer of , at least in some probabilistic sense:

Theorem 3.1 (Single unit teacher).

Consider a teacher with and a student with hidden units respectively, so that has rank 1 and has full rank. Given a data set with each drawn independently from a standard Gaussian, denote by the set of minimizer of the empirical loss constructed with over symmetric positive semidefinite matrices , i.e.

(9)

Set for and let . Then

(10)

whereas

(11)

In words, this theorem says that it exists a threshold value such that for any there is a finite probability that the empirical loss landscape trivializes and all spurious minima disappear in the limit as . For however, this is not the case and spurious minima exist with probability 1 in the limit. Therefore, the chance to learn by minimizing the empirical loss from a random initial condition is zero if but it becomes positive if . The proof of Theorem 3.1 is presented in Appendix A. This proof shows that we can account for the constraint that be positive definite by making a connection with the classic problem of the number of extremal rays of proper convex polyhedral cones generated by a set of random vectors in general position. Interestingly, this proof also gives a criterion on the data set that guarantees that the only minimizer of the empirical loss be : it suffices to check that the proper convex polyhedral cones constructed with the data vectors have a number of extremal rays that is less than .

Heuristic extension for arbitrary .

The result of Theorem 3.1

can also be understood via a heuristic algebraic argument that has the advantage that it applies to arbitrary

. The idea, elaborated upon in Appendix A.3, is to count the number of constraints needed to ensure that the only minimum of the empirical loss is , taking into account that (i) has full rank and has rank and (ii) both and

are positive semidefinite and symmetric, so that the number of negative eigenvalues of

can at most be . If we use a block representation of

in which we diagonalize the block that contains the direction associated with the eigenvectors of

with nonnegative eigenvalues, and simply count the number of nonzero entries in the resulting matrix (accounting for its symmetry), for we arrive at

(12)

while for we recover the result already found in [9, 24]. Setting and sending , this gives the sample complexity threshold

(13)

which, for , agrees with the result in Theorem 3.1. The sample complexity threshold is confirmed in Fig. 1 via simulations using gradient descent (GD) on the empirical loss—we explain this figure in Sec. 4 after establishing that the GD dynamics converges.

Figure 1: Dynamical phases of the student performance with a teacher having a number of hidden units given on the -axis. The solid lines show the theoretical prediction in (13) for the sample complexity threshold and the points are obtained by extrapolation from simulations with GD. In the simulations we consider a teacher with i.i.d. Gaussian weights and we report other cases in the Appendix.

4 Gradient Descent

Let us now analyze the performance of gradient descent over the empirical loss. Initiated with random initial weights, with probability one the GD flow in (3) for the weight will eventually reach a minimum of the empirical loss, say —indeed, the only possibility for it not to do so would be to reach a critical point of Morse index 1 or above, and the probability of that event is zero from random initial data. Since the evolution equation in (3) for the weights is completely equivalent to the evolution equation in (5) for , the solution to this equation will also eventually reach and as soon as and has full rank, must be a minimizer of the empirical loss, i.e. be such that . That is, we have established:

Proposition 4.1.

Let be a solution to (5) for with full rank. Then as and is a minimizer of the empirical loss.

Combined with Theorem 3.1, this proposition also indicates that, when and is large, the probability that is high when , whereas the probability that becomes positive for . If we generalize this analysis to the case and large, we expect that GD will recover the teacher only if with given by (12).

These results are confirmed by numerical simulations in Fig. 1 where we plot as a function of the number of teacher hidden units for different values of . The four colors represent different input dimensions . We use circles to represent the numerical extrapolation of obtained by several runs of GD flow on different instances of the problem, using the procedure described in Appendix B. Consistent with Proposition 4.1, the extrapolation confirms that GD flow is able match the sample complexity threshold predicted by the theory.

What Proposition 4.1 leaves open is the convergence rate of in either cases. This question is hard to answer for GD on the empirical loss, but it can be addressed for GD on the population loss, as shown next.

GD on the Population Loss.

Figure 2: Convergence rates increasing the number of hidden units in the teacher . The figures show log-average of simulations with and from left to right , respectively. The individual simulations are shown in transparency. The dotted line is the quadratic decay and serves as reference. The figure shows that, if and the convergence rate becomes faster than quadratic, and in fact exponential as derived in Sec. 4.

We begin by observing is that we can characterize the GD flow on the population loss by considering only the evolution of the eigenvalues of .

Theorem 4.2.

Let be the solution to the GD flow (7) over the population loss assuming that . Denote by

an orthogonal matrix whose columns are the eigenvectors of

, so that with . Let so that . Then remains diagonal during the dynamics and the evolution of its entries is given by

(14)

In addition the population loss is given by

(15)

Notice that this theorem indicates that it suffices to characterize the convergence rate of the slowest eigenvalue to the target to obtain the convergence rate of the loss. The equations in (14

) can easily be solved numerically. An asymptotic analysis of their solution when

is large is also possible, as shown next.

Consider the case first. Then eigenvalues of are zero, and without loss of generality we can order so that the zero eigenvalues of are last. Denoting , for (14) then reads

(16)
(17)

We will call the first eigenvalues informative eigenvalues and the remaining (captured by ) non-informative eigenvalues. We make two observations. Since , initially the leading order term in the equation for the uninformative eigenvalues is

(18)

Substituting this solution into (16) we deduce

(19)

(18) and (19) imply an initial decreases in time of both non-informative and the informative eigenvalues. However, when becomes of order one or smaller, the other terms in equation (16) take over and allow the informative eigenvalues to bounce back up. This happens at at time in . Afterwards the informative eigenvalues emerge from the non-informative ones with an exponential growth, . As a result, these informative eigenvalues eventually match the eigenvalues of the teacher at a typical time of order . This analysis also implies a quadratic decay in time of the loss at long times

(20)

In Sec. B we give additional details comparing the asymptotic analysis to the real dynamics when but not necessarily much smaller. This analysis can e.g. be done quite explicitly when the unit in the teacher are orthonormal. It indicates that at all times, and as a result shows that

(21)

at all times.

Consider the case with next. Then (14) can be written as

(22)

which gives an exponential convergence to the target , and consequently an exponential convergence in the population loss. For example, let us specialize to the case of a teacher with orthonormal hidden vectors, for . The eigenvalues will converge to their target value as . Consequently the loss (15) will converge to zero exponentially in this case

(23)

The results above are confirmed in the numerics. The cases when and are shown by the first two and last two panels in Fig. 2, respectively. When the decay of the empirical loss is quadratic, consistent with (20). In contrast, when , the absence of non-informative eigenvalues removes the dominating terms in the loss (15). Therefore the loss is dominated by the informative eigenvalues and decays exponentially, consistent with (23). This can be clearly observed in Fig. 2, where the four panels show the population loss using teachers with and . The black dotted shows the quadratic asymptotic decay predicted in (20). The last two panels of the sequence show the exponential decay as predicted predicted in (23)

Figure 3: Training loss (left figure) and population loss (right figure) for and . The plots show the average in log-scale of 100 simulation for each value of and the individual realizations are shown in transparency.

Fig. 3 shows the training and the population loss observed in the simulation using input dimension and a teacher with hidden unit. In this case our analysis suggests that the typical realization will converge to zero generalization error if . This can be observed on the right panel of the Fig.  3. We used a dashed line to represent the gradient in the population loss (7) and used a dotted line to represent the approximated result (21), observing the two being almost indistinguishable in the figure.

5 Probing the Loss Landscape with the String Method

Figure 4: Results from the application of the string method. Training loss (solid line) and population loss (dashed line) evaluated across a string dicretized with 100 immges. Moving from left to right panels, the number of samples in the dataset increases, respectively , while the teacher always has hidden units. The critical size to obtain a smooth landscape in average is , which is confirmed by the string reaching zero empirical loss at a finite value of the population loss, or not. Each string is mediated in log-scale over 10 realizations.

Finally, let us show that we can use the string method [25, 26, 27] to probe the geometry of the training loss landscape and confirm numerically Theorem 3.1. The string method consists in connecting the student and the teacher with a curve (or string) in matrix space, and evolve this curve by GD while controlling its parametrization. In practice, this can be done efficiently by discretizing the string into equidistant images or points (with the Frobenius norm as metric), and iterating upon (i) evolving these images by the descent dynamics, and (ii) reparamterizing the string to make the images equidistant again. At convergence the string will identify a minimum energy path between and which will possibly have a flat portion at zero empirical loss if this loss can be minimized by GD before reaching . That is, along the string, the student reaches the first minimum by GD, and, if , then move along the set of minimizers of the empirical loss until it reaches . The advantage of the method is that by replacing the physical time along the trajectory by the arclenght along it, it permits to go to infinite times (when ) and beyond (when ), thereby probing features of the loss landscape not accessible by standard GD. (Of course it requires one to know the target in advance, i.e. the string method cannot be used instead of GD to identify this target in situations where it is unknown.)

In Fig. 4 we compare the strings obtained for input dimension 4 (red), 6 (purple), end 8 (blue). The strings are parametrized by 100 points represented on the horizontal axes. Moving from the leftmost to the rightmost panels in Fig. 4 the number of samples in the dataset increases, namely . Gradually all the represented will reach the critical size and will have a landscape with a single minimum, the informative one. Observe that for relatively small sample sizes, there is low correspondence between the topology of the training loss landscape and the population loss one. As the size increases we notice that correlation increases until the two are just slightly apart.

Acknowledgements

We thank Joan Bruna and Ilias Zadik for precious discussions. SSM acknowledges the Courant Institute for the hospitality during his visit. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe.

References

Appendix A Proofs and Technical Lemmas

a.1 Proof of Lemmas 2.1 and 2.2

Equations (5) and (6) can be derived directly from (3) using the definitions in (4). Equations (7) and (8) can be derived from (5) and (6) by taking their expectation of .

We also note that (5) (and similarly (7) if we use the population loss in (8) instead of the empirical loss in (6)) can be viewed as the time continuous limit of a simple proximal scheme involving the Cholesky decomposition of and the standard Forbenius norm as Bregman distance. We state this result as:

Proposition A.1.

Given define the sequence of matrices via

(A.1)

where is a parameter. Then

(A.2)

where solves (5) for the initial condition .

Proof.

Look for a solution to the minimization problem in (A.1) of the form

To leading order in , the objective function in (A.1) becomes

which we can set to zero by choosing with

In terms of the minimizer of the orginal problem this equation can be written as

Letting and with , we deduce that solution to

(A.3)

Setting we have

which is (5). ∎

a.2 Proof of Theorem 3.1

Let be a symmetric, positive semidefinite minimizer of the empirical loss and consider . Since this matrix is symmetric, there exists an orthonormal basis in made of its eigenvectors, . Since is positive semidefinite by assumption and is rank one, eigenvalues of are nonnegative, and only one can be positive, negative, or zero. Let us order the eigenvectors such that their associate eigenvalues are for and . Given the data vector , to be a minimizer of the empirical loss must satisfy

(A.4)

Let us analyze when (A.4) admits solutions that are not . To this end, assume first that . Then, as soon as , for each with probability one there is at least one such that . As a result, if , as as soon as , the only solution to (A.4) is for all , i.e. a.s.

The worst scenario case is actually when . In that case (A.4) can be written

(A.5)

This equation means that if we let (i.e. but lie in the same hemisphere as ), then the vectors must all lie on the surface of an elliptical cone centered around , with the principal axes of the ellipsoids aligned with ,

; the intersection of the cone with the hyperplane

is the ellipsoid whose boundary satisfies the equation

(A.6)

In , it takes vectors to uniquely define such a elliptical cone. This means that, in the worst case scenario, we recover the threshold . This worst case scenario is however unlikely. To see why, assume that , and consider the convex polyhedral cone spanned by , i.e. the region

(A.7)

In order that (A.6) have a nontrivial solution, the extremal rays of (i.e. its edges of dimension 1) must coincide with the set , that is, all rays , for must lie on the boundary of and none can be in the interior of ; indeed these extremal rays must also be on the boundary of elliptical cone . However, Theorem 3’ in [28] asserts that, if the vectors in the set are in general position (i.e. if the vectors in any subset of size no more than are linearly independent, which happens with probability one if are i.i.d. Gaussian), the number of extremal rays of satisfies

(A.8)

This implies that

(A.9)

Since by definition, we have , which from (A.9) implies that a.s. if . In turns this implies that the probability that all the vectors in be extremal ray of the cone tends to 1 as with and . This also means that the probability that (A.6) has solution with also tends to 1 in this limit, i.e. (10) holds. Conversely, since for , the probability that remains positive as with and . This means that the probability that (A.6) has no solution with is positive in this limit, i.e. (11) holds.

a.3 Heuristic argument for arbitrary and

Minimizers of the empirical loss satisfy:

(A.10)

Clearly is always a solution to this set of equation. The question is: how large should be in order that be the only solution to that equation? If was an arbitrary symmetric matrix, we already know the answer: with probability one, we need . What makes the problem more complicated is that is required positive semidefinite. If we assume that has rank , this implies that must be a symmetric matrix with nonnegative eigenvalues and eigenvalues whose sign is unconstrained, and we need to understand what this requirement imposes on the solution to (A.10).

In the trivial case when (i.e. ), if we decompose , where contains its eigenvectors and is a diagonal matrix with its eigenvectors , , (A.10) can be written as

(A.11)

where , are linearly independent eigenvectors of . In this case, since , with probability one we only need data vectors to guarantee that the only solution to this equation is for all , i.e. . Another way to think about this is to realize that the nonnegativity constraint on has removed degrees of freedom from the original in .

If , the situation is more complicated, but we can consider the projection of in the subspace not spanned by , i.e. the matrix defined as

(A.12)

where is the matrix whose columns are linearly independent eigenvectors of with zero eigenvalue. All the eigenvalues of are nonnegative, and this imposes constraints in the subspace where lives. If we simply subtract this number to we obtain

(A.13)

which is precisely (12).

This argument is nonrigorous because we cannot a priori treat separately (A.10) in the subspace spanned by and its orthogonal complement. Yet, our numerical results suggest that this assumption is valid, at least as

a.4 Proof of Theorem 4.2

Since is symmetric and positive semidefinite, its eigenvalues are nonnegative and there exists an orthonormal basis made of its eigenvectors. Denote this basis by and let us order in way that the corresponding eigenvalues are for , and for . Denote by the orthogonal matrix whose columns are the eigenvectors of , so that with . Let . Since by assumption, and from (7) this matrix evolves according to

(A.14)

This equation shows that remains diagonal for all times, . Written componentwise (A.14) is (14).

Appendix B Additional Results

b.1 Supporting numerical results to Fig. 1

Figure B.1: Population loss for and and several values of . The line shown with full color are average of the logarithm of 100 simulations (300 for ) and the individual instances are shown in transparency.

In Fig. B.1 shows the average performance of GD with datapoints and a teacher with and Gaussian hidden units. The figure is intended to show a vertical cut in the dynamical phases Fig. 1. Moving up in at fixed we observe that on average the simulations converge when and they do not when , i.e. there is an abrupt change of behavior when we cross the transition. Another interesting aspect of the figure is that the first panel has which leads to and exponential (rather than quadratic) convergence rate in the loss, consistent to our analysis. The dotted line is a reference line that represents the decay of the loss.

b.2 Supporting numerical results to Theorem 3.1

Figure B.2: Left panel: fraction of simulations that went below for . Right panel: complement of the fraction of simulations that have a ratio between final generalization loss and training loss that is larger then .

In Fig. B.2 we present a numerical verification of Theorem 3.1. According to the theorem, as with (so that ) the probability of finding the teacher should converge to zero for and to positive values for . The left panel on the figure shows the fraction of 100 simulations that achieved at least generalization loss after iterations with learning rate . The right panel shows the number of simulations for which the ratio between training and generalization losses is larger than . This second panel is meant to capture the simulations for which we expect convergence eventually, but the number of iterations was not enough to achieve it. In particular, we observed that when generalization fails, meaning that the training loss goes to zero and the generalization loss stay at a high value, the convergence rate of the training loss is exponential, contrarily to simulation where the generalization loss eventually goes to zero that have a convergence rate. Using simulations with iterations is sufficient to detect the difference between the two cases and therefore this gives us a good criterion to distinguish between successful and unsuccessful simulations.

Figure B.3: Final value of the training and generalization loss of several simulations with input and samples in the dataset. From left to right the maximum number of steps in the simulation increases by a factor 10.

To provide more evidence of this reasoning, in Fig. B.3 we show training and generalization loss of simulations for , and . We order the simulations according to the loss and show in the three panels three snapshots for different number of iterations. From left to right the number of iterations increases by a factor 10 in each panel. As can be seen, the ratio between generalization loss and training loss at the end of the training is a valid measure of success.

b.3 Extrapolation procedure

Figure B.4: Extrapolation of the sample complexity threshold for assuming a power-law increase of the time to converge to a value of the loss when approaching this threshold. In the inset we show that the points lie on a line in log-log scale.

We estimate the critical value of

numerically by fixing a threshold in the population loss, , and simulate the problem for a large set of . Starting from the largest value in the set, as approaches the critical value the time needed to pass the threshold increase as a power-law . In Fig. B.4 we fit the relaxation times to cross a threshold in the population loss of for and . The extrapolated thresholds

and their 95% confidence intervals are: for

, ; for , ; for , ; and for , . Close to the threshold , namely , , , and , as expected. The larger the input dimension, the larger the time to pass the threshold is, and as result the smallest accessible value of also increases. This causes a decrease in accuracy on the threshold value, measured by the larger confidence intervals. The same procedure has been applied for other values of to obtain the points shown in Fig. 1.

b.4 GD in the populations loss with orthogonal teacher

Figure B.5: Same as Fig. 1 in the main text but for a teacher with orthonormal hidden nodes. In that case, as soon as becomes equal to or larger than , , and therefore the student equal the teacher at initialization since .

A simple special case of (14) in Theorem 4.2 is when the teacher has orthogonal hidden weights, so that