An analytic theory of shallow networks dynamics for hinge loss classification

06/19/2020 ∙ by Franco Pellegrini, et al. ∙ 0

Neural networks have been shown to perform incredibly well in classification tasks over structured high-dimensional datasets. However, the learning dynamics of such networks is still poorly understood. In this paper we study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task. We show that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population. We specialize our theory to the prototypical case of a linearly separable dataset and a linear hinge loss, for which the dynamics can be explicitly solved. This allow us to address in a simple setting several phenomena appearing in modern networks such as slowing down of training dynamics, crossover between rich and lazy learning, and overfitting. Finally, we asses the limitations of mean-field theory by studying the case of large but finite number of nodes and of training samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite their proven ability to tackle a large class of complex problems LeCun et al. (2015), neural networks are still poorly understood from a theoretical point of view. While general theorems prove them to be universal approximators Barron (1993), their ability to obtain generalizing solutions given a finite set of examples remains largely unexplained. This behavior has been observed in multiple settings. The huge number of parameters and the optimization algorithms employed to optimize them (gradient descent and its variations) are thought to play key roles in it Poggio et al. (2017); Suggala et al. (2018); Gidel et al. (2019).

In consequence, a large research effort has been devoted in recent years to understanding the training dynamics of neural networks with a very large number of nodes Dauphin et al. (2014); Sagun et al. (2016); Baity-Jesi et al. (2019). Much theoretical insight has been gained in the training dynamics of linear Saxe et al. (2014); Lampinen and Ganguli (2019) and nonlinear networks for regression problems, often with quadratic loss and in a teacher-student setting Saad and Solla (1995); Advani and Saxe (2017); Goldt et al. (2019); Yoshida et al. (2019), highlighting the evolution of correlations between data and network outputs. More generally, the input-output correlation and its effect on the landscape has been used to show the effectiveness of gradient descent Du et al. (2019); Arora et al. (2019a). Other approaches have focused on infinitely wide networks to perform a mean-field analysis of the weights dynamics Mei et al. (2018, 2019); Kadmon and Sompolinsky (2016); Rotskoff and Vanden-Eijnden (2018); Araújo et al. (2019); Nguyen (2019), or study its neural tangent kernel (NTK, or “lazy”) limit Chizat et al. (2019); Jacot et al. (2018); Lee et al. (2019); Geiger et al. (2020).

In this work, we investigate the learning dynamics for binary classification problems, by considering one of the most common cost functions employed in this setting: the linear hinge loss. The idea behind the hinge loss is that examples should contribute to the cost function if misclassified, but also if classified with a certainty lower than a given threshold. In our case this cost is linear in the distance from the threshold, and zero for examples classified above threshold, that we shall call

satisfied henceforth. This specific choice leads to an interesting consequence: the instantaneous gradient for each node due to unsatisfied examples depends on the activation of the other nodes only through their population, while that due to satisfied examples is just zero. Describing the learning dynamics in the mean-field limit amounts to computing the effective example distribution for a given distribution of parameters: each node then evolves “independently” with a time-dependent dataset determined self-consistently from the average nodes population.

Contribution. We provide an analytical theory for the dynamics of a single hidden layer neural network trained for binary classification with linear hinge loss. In Sec. 2 we obtain the mean-field theory equations for the training dynamics. Those equations are a generalizations of the ones obtained for mean-square loss in Mei et al. (2018, 2019); Kadmon and Sompolinsky (2016); Rotskoff and Vanden-Eijnden (2018); Araújo et al. (2019); Nguyen (2019). In Sec. 3 we focus on linearly separable data with spherical symmetry and present an explicit analytical solution of the dynamics of the nodes parameters. In this setting we provide a detailed study of the cross-over between the lazy Chizat et al. (2019) and rich Woodworth et al. (2020) learning regimes (Sec. 3.2). Finally, we asses the limitations of mean-field theory by studying the case of large but finite number of nodes and finite number of training samples (Sec. 3.3). The most important new effect is overfitting, which we are able to describe by analyzing corrections to mean-field theory. In Sec. 3.4 we show that introducing a small fraction of mislabeled examples induces a slowing down of the dynamics and hastens the onset of the overfitting phase. Finally in Sec. 4 we present numerical experiments on a realistic case, and show that the associated nodes dynamics in the first stage of training is in good agreement with our results.
The merit of the model we focused on is that, thanks to its simplicity, several effects happening in real networks can be studied analytically. Our analytical theory is derived using reasoning common in theoretical physics, which we expect can be made rigorous following the lines of Mei et al. (2018, 2019); Kadmon and Sompolinsky (2016); Rotskoff and Vanden-Eijnden (2018); Araújo et al. (2019); Nguyen (2019). All our results are tested throughout the paper by numerical simulations which confirm their validity.

Related works. Mean-field analysis of the training dynamics of very wide neural networks have mainly focused on regression problems with mean-square losses Mei et al. (2018, 2019); Kadmon and Sompolinsky (2016); Rotskoff and Vanden-Eijnden (2018); Araújo et al. (2019); Nguyen (2019); Chizat et al. (2019), whereas fewer works des Combes et al. (2018); Nacson et al. (2019) have tackled the dynamics for classification tasks.111In the NTK (or “lazy”) limit Chizat et al. (2019); Jacot et al. (2018); Lee et al. (2019) general losses have been considered. The model of data we focus on bears strong similarities to the one proposed in des Combes et al. (2018), but with fewer assumptions on the dataset and initialization. With respect to des Combes et al. (2018), we show the relation with mean-field treatments Mei et al. (2018, 2019); Kadmon and Sompolinsky (2016); Rotskoff and Vanden-Eijnden (2018); Araújo et al. (2019); Nguyen (2019) and provide a full analysis of the dynamics, in particular the cross-over between rich and lazy learning. Moreover, we discuss the limitations of mean-field theory, the source of overfitting and the change in the dynamics due to mislabeling.

2 Mean-Field equation for the density of parameters

We consider a binary classification task for points in dimensions with corresponding labels . We focus on a hidden layer neural network consisting of nodes with activation . The output of the network is therefore

(1)

where represents all the trainable parameters of the model: , the

-dimensional weight vectors between input and each hidden node, and

, the contributions of each node to the output. All components are initialized before training from a Gaussian distribution with zero mean and unit standard deviation. The

in front of the sum leads to the so-called mean-field normalization Mei et al. (2018). In the large- limit, this allows to do what is called a hydrodynamic treatment in physics, a procedure that have been put on a rigorous basis in this context in Mei et al. (2018, 2019); Kadmon and Sompolinsky (2016); Rotskoff and Vanden-Eijnden (2018); Araújo et al. (2019); Nguyen (2019); Chizat et al. (2019) (here the s play the role of particle positions). In this limit one can rewrite the output function in terms of the averaged nodes population (or density) :

(2)

To optimize the parameters we minimize the loss function

(3)

by gradient flow ( will be specified later). The dynamical equations for the parameters read:

(4)

where we have defined the effective learning rate . These equations show that the coupling between the different nodes has a mean-field form: it is through the function , i.e. only through the density . Following standard techniques one can obtain in the large limit a closed hydrodynamic-like equation on (see Appendix A.1 for details):

(5)

where we have made explicit that the is a functional of the density since it depends on , see eqs. (2, 3).

To be more concrete, in the following we consider the case of linear hinge loss, (

being the size of the hinge, often taken as 1), and rectified linear unit (ReLU) activation function:

. With this choice

(6)

The notation denotes the indicator function of the unsatisfied examples, i.e. those for which the loss is positive, and denotes the average over examples and classes ( for binary classification). The dynamical equations on the node parameters simplify too:

(7)

Remarkably, the equation on the is very similar to the one induced by the Hebb rule in biological neural networks.

3 Analysis of a linearly separable case

We now focus on a linearly separable model, where the dynamics can be solved explicitly. We consider a reference unit vector

in input space and examples distributed according to a spherical probability distribution

. We label each example based on the sign of its scalar product with , leading to a distribution for : .

In order to be able to explore different training regimes, we adopt a rescaled loss function, similar to the one proposed in Chizat et al. (2019):

(8)

where is the rescaling parameter and are the parameters at the beginning of training. Subtracting the initial output of the network ensures that no bias is introduced by the specific finite choice of parameters at initialization, while having no influence in the hydrodynamic limit since the output is 0 by construction.

3.1 Explicit solution for an infinite training set

We first consider the limit of infinite number of examples, and later discuss the effects induced by a finite training set.

Figure 1: Training of a network with , , , , , , for timesteps (until all examples are classified) with final generalization error evaluated on

examples. Data and initial parameters are taken from a normal distribution of zero mean and width 1 per dimension.

a, b: Evolution of ten of the s in (a) and of the s in (b) during training (circles) compared to our theoretical prediction (lines) for the same initial values. c: Evolution of obtained through numerical integration of eq. 13 for the parameters of this example. The dashed lines represent the linear approximation near and the logarithmic slope for large (shifted with a fitted constant). d: Projection of examples on the vector as a function of the time

when they are first satisfied. The red line is the estimate of our theory, the dashed lines represent our estimate for a standard deviation due to the finite number of nodes

(see Sec. 3.3).

The explicit solution of the training dynamics is obtained making use of the cylindrical symmetry around , which implies that

(9)

where . By plugging the identity (9) into eqs. (6, 7) one finds that the hydrodynamic equation (5) can be solved by the method of the characteristic, where is obtained by transporting the initial condition through the equations (7). By decomposing the vector in its parallel and perpendicular components with respect to , i.e. , and using the solution , one finds that the parameters at time are distributed in law as:

(10)

where are given by the initial condition distributions, i.e. they are i.i.d Gaussian. Using the distribution of at time , one can then compute and hence obtain a self-consistent equation on , which completes the mean-field solution. Similarly, one can obtain explicitly the output function and the indicator function which acquire a simple form:

(11)
(12)

where we have used that at . As expected, both functions have cylindrical symmetry around . The analytical derivation of these results and the following ones is presented in the Appendix A.2.
Since by definition the function is monotonously increasing and starts from zero at

. To be more specific, we consider two cases: normally distributed data with unit variance in each dimension, and uniform data on the

-dimensional unit sphere. The corresponding self-consistent equations on read respectively:

(13)
(14)

where and . Both equations imply that for small and for large .

We have now gained a full analytical description of the training dynamics: the node parameters evolve in time following eqs. (10). Note that their trajectory is independent of the training parameters and the initial distribution, which only affect the time dependence, i.e. the “clock” . The change of the output function is given by eq. (11), where one sees that only the amplitude of varies with time and is governed by . The amplitude increases monotonically so that more examples can be classified above the margin at later times; the more examples are classified the slower becomes the increase of and hence the dynamics.

Our theoretical prediction can be directly compared with a simple numerical experiment. Fig. 1 shows the training of a network with on Gaussian input data. The top panels (a and b) compare the analytical evolution of the network parameters and obtained from eqs. (10) to the numerical one. In c we plot (computed numerically) showing that it grows linearly in the beginning and logarithmically at longer times, as expected from theory. In d we show a scatter plot illustrating that the time when an example is satisfied is proportional to its projection on the reference vector, following on average our estimate based on eq. (12). Overall, the agreement with the analytical solution is very good. The spread around the analytical solution in panel d is a finite- effect, that we will analyze in Sec. 3.3. The departure from the analytical result (10) happens at large time when the finiteness of the training set starts to matter (the larger is the training set the larger is this time). In fact, for any finite number of examples the empirical average over unsatisfied examples deviates from its population average and the dynamics is modified eventually, and ultimately stops when the whole training set is classified beyond margin. We study this regime in Sec. 3.3.

3.2 Lazy learning and rich learning regimes

Figure 2: Evolution of and for a network with , , , in two different regimes. Data and initial parameters are taken from a normal distribution of zero mean and width 1 per dimension. a: First and last step of a case with (learning rate , training set is fitted by , final generalization error ). The arrows indicate the analytical derivative at , showing that the evolution is approximately linear. b: Initial steps (time indicated in legend) of a case with (learning rate , training set is fitted by , final generalization error ). The gray lines follow the evolution of each node.

The presence of the factor in the loss function (8) allows us to explore explicitly the crossover between different learning regimes, in particular the “lazy learning” regime corresponding to  Chizat et al. (2019). The dynamical equations can be studied in this limit by introducing . For concreteness, let us focus on the case of normally distributed data. Taking the limit of eq. (13) one finds the equation for :

(15)

As for the evolution of the parameters and the output function, we obtain:

(16)

The equations above provide an explicit solution of lazy learning dynamics and illustrate its main features: the evolves very little and along a fixed direction, in this case given by . Despite the small changes in the nodes parameters, of the order of , the network does learn since classification is performed through which has an order one change even for . In this regime, the correlation between and only increases slightly, but this is enough for classification, since an infinite amount of displacements in the right direction is sufficient to solve the problem.
On the contrary, when is of order one or smaller, the dynamics is in the so-called “rich learning” regime Woodworth et al. (2020). At the beginning of learning, the initial evolution of the s follows the same linear trajectories of the lazy-learning regime. However, at later stages, the trajectories are no more linear and the norm of the weights increases exponentially in , stopping only at very large values of when all nodes are almost aligned with (for small ). Note that, as observed in Geiger et al. (2019), with the standard normalization it would be the parameter governing the crossover between the two regimes.

We compare the two dynamical evolutions in Fig. 2. The left panel (a) shows the displacement of parameters between initialization and full classification (zero training loss) for a network with . As expected, the displacement is small and linear. A very different evolution takes place for in the right (b) panel. The trajectories are non-linear, and all nodes approach large values close to the line at the end of the training. Correspondingly, the initially isotropic Gaussian distribution evolves towards one with covariance matrix on the diagonal and off diagonal.

Note that for all values of , even very large ones, the trajectories of the s are identical and given by eqs. (10). What differs is the “clock” , in particular for large the system remains for a much longer time in the lazy regime. This is true as long as the number of training samples is infinite. Instead, if the number of data is finite, the dynamics stops once the whole training set is fitted: for large this happens before the system is able to leave the lazy regime, whereas for small a full non-linear (rich) evolution takes place. Hence, the finiteness of the training set leads to very distinct dynamics and profoundly different “trained” models (having both fitted the training dataset) with possibly different generalization properties Lee et al. (2019); Geiger et al. (2019); Arora et al. (2019b).

3.3 Beyond mean-field theory

The solution we presented in the previous sections holds in the limit of an infinite number of nodes and of training data. Here we study the corrections to this asymptotic limit, and discuss the new phenomena that they bring about.

Figure 3: a: Training (blue) and generalization (orange) error (fraction of misclassified examples), during training with same parameters as Fig. 1. b: Components of along (parallel) and perpendicular to it, during training. The dots are numerical results for the same training show in a. The lines represent our analytical predictions and for the same parameters.

Finite number of nodes. In the large limit the and

are Gaussian i.i.d. variables. By the central limit theorem, the function (

2) concentrates around its average, and has negligible fluctuations of the order of when . If is large but finite (keeping an infinite training set), these fluctuations of are responsible for the leading corrections to mean-field theory. In Appendix A.3 we compute explicitly the variance of the output function, , with

(17)

The main effect of this correction is to induce a spread in the dynamics, e.g. of the data with same satisfaction time. This phenomenon is shown in Fig. 1(d) for , where we compare the numerical spread to an estimate of the values of such that the hinge is equal to the average plus or minus one standard deviation (details on this estimate in Appendix A.3).

Finite number of data. We now consider a finite but large number of examples (keeping infinite the number of nodes). In the large limit the empirical average over the data in converges to its mean . The main effect of considering a finite is that the empirical average fluctuates around this value. Using the central limit theorem we show in Appendix A.3 that the leading correction to the asymptotic result reads:

(18)

where is a unitary random vector perpendicular to and . The term , the fraction of unsatisfied examples at time , controls the strength of the correction, as expected since only unsatisfied data contribute to the empirical average . The vector on the RHS of (18) is the one towards which all the align, see eqs. (10). Therefore, the main effect of the correction (18) is for the nodes parameters to align along a direction which is slightly different from and dependent on the training set. This naturally induces different accuracies between the training and the test sets, i.e. it leads to overfitting.222The two accuracies instead coincide for , since all possible data are seen during the training and no overfitting is present in the asymptotic limit. Note that the strength of the signal, , is roughly of the order of the fraction of unsatisfied data , whereas the noise due to the finite training set is proportional to the square root of it. The larger the time, the smaller is, hence the stronger are the fluctuations with respect to the signal. In Fig. 3(b) we compute numerically the components of parallel and perpendicular to , and compare them to and . Remarkably, we find a very good agreement even for times when is no longer a small correction. This suggest that an estimate of the time when overfitting takes place is given by . We test this conjecture in panel (a): indeed the two contributions are of the same order of magnitude for , which is around the time when training and validation errors diverge.

3.4 Mislabeling

We now briefly address the effects due to noise in the labels, see Appendix A.4 for detailed results and Appendix B.2 for numerical experiments. Mislabeling is introduced by flipping the label of a small fraction of the examples. The main effect is to decrease the strength of the signal, , since the mislabeled data lead to an opposite contribution in (9) with respect to the correctly labeled ones. In the asymptotic limit of infinite and , the reduction of the signal slows down the dynamics, which stops when the number of unsatisfied correct examples equals the one of mislabeled ones. For large but finite , the noise is enhanced with respect to the signal because its strength is related to the fraction of all unsatisfied examples, and not just the correctly labeled ones. Hence, overfitting is stronger and takes place earlier with respect to the case analyzed before.

4 Discussion and Experiment

Figure 4: a: Training (blue) and generalization (orange) error for a network with , trained on MNIST data (), with parity labels. Inputs are only rescaled by a factor , no further processing is done. The training is performed with , , and the validation error on examples is after evolution steps. The shaded area represents the area where our theory applies. b: Evolution of and in the first steps of training. The color (see color bar) represents the step of evolution.

We have provided an analytical theory for the dynamics of a single hidden layer neural network trained for binary classification with linear hinge loss. We have found two dynamical regimes: a first one, correctly accounted for by mean-field theory, in which every node has its own dynamics with a time-dependent dataset determined self-consistently from the average nodes population. During this evolution the nodes parameters align with the direction of the reference classification vector. In the second regime, which is not accounted for by mean-field theory, the noise due to the finite training set becomes important and overfitting takes place. The merit of the model we focused on is that, thanks to its simplicity, several effects happening in real networks can be studied in detail analytically. Several works have shown distinct dynamical regimes in the training dynamics: first the network learns coarse grained properties, later on it captures the finer structure, and eventually it overfits Baity-Jesi et al. (2019); Goldt et al. (2019); Saad and Solla (1996); Bordelon et al. (2020). Given the simplicity of the dataset we focused on, we expect our model to describe the first regime but not the second one, which would need a more complex model of data. To test this conjecture, we train our network to classify the parity of MNIST handwritten digits LeCun et al. (2010). To establish a relationship with our case, we define as the direction of the difference between the averages of the two parity sets. We can now define for each node, and study the dynamics of . We report in Fig. 4 the evolution of these parameters in the early steps of training, in which the training loss decreases of of its initial value (Fig. 4a). The evolution of the parameters (Fig. 4b) bears a strong resemblance with our findings, see the remarkable similarity with Fig. 2(b).

We thank S. d’Ascoli and L. Sagun for discussions, and M. Wyart for exchanges about his work on a similar model Paccolat et al. (2020).

We acknowledge funding from the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and from the Simons Foundation collaboration “Cracking the Glass Problem” (No. 454935 to G. Biroli).

Appendix A Explicit calculations

a.1 Derivation of the hydrodynamics mean-field equation

In order to simplify the derivation in the following we use a compact notation for the function :

(19)

where , and for the gradient flow equations on the parameters of the network:

(20)

The strategy to derive hydrodynamics mean-field equations developed in physics consists in using the following equation, valid for and any test function :

(21)

and then in differentiating RHS and LHS with respect to time, see e.g. Dean (1996). The important point here (and later) is that the density , which depends on the random initial conditions, concentrates in the large limit due to the nature of the interaction between parameters, which is only through the function , and the type of distributions considered for the initial conditions.333These two features lead to mean-field interactions in which one parameter interacts weakly with all the others. In physical systems a particle instead interacts only with a finite number of other particles, hence the density field remains highly fluctuating. Only performing coarse-graining in space and time one can get hydrodynamic equations, see Spohn (2012) for a rigorous presentation and Chaikin et al. (1995) for a more general one. The derivative of the RHS leads to

(22)

whereas the derivative of the LHS reads:

(23)

We now use the identity:

(24)

to rewrite the LHS as

(25)

For this expression can be rewritten as

(26)

where we have used an integration by part to obtain the last identity. Since the expressions in (22) and (26) are equal for any test function , we obtain that the density verifies Eq. 5 from the main text:

(27)

The initial condition for is a Gaussian distribution since the parameters at initialization are i.i.d. Gaussian variables.

a.2 Calculation of

We want to compute the integral of Eq. (9) of the main text:

(28)

for the task and distributions mentioned in the text.

Let us start by observing that since has spherical symmetry and has cylindrical symmetry around and is symmetric under inversion along (because of the label symmetry of the problem), the whole integrand without the is symmetric under inversion operation. Indeed, , and . The effect of the term is to select one particular half-space over which the integral is done. However, because of the symmetric under inversion the integral on any half space is equivalent, hence the result is independent of . Moreover for any direction orthogonal to

, the integrand is odd under inversion of that component, and is therefore

. The only component different from zero is then the one along , dubbed in the text. Let us define and notice that that so that we can for simplicity consider the integral on the positive values

(29)

We will now consider the specific expression found in the main text , and for the noiseless case .

In the case of normally distributed data, all orthogonal directions integrate to and we are left with a simple Gaussian integral

(30)

With and , we recover Eq. (13) from the text.

For the case of data uniformly distributed on the

-dimensional unit sphere in dimensions, we divide by the sphere surface and integrate on the angular coordinates. Because of the symmetry, we perform angular integrals and obtain the surface of the -dimensional sphere. The limit will set the extreme of integration to for and not affect the integral otherwise. Considering for simplicity directly the limit we obtain:

(31)

Using the equation for the sphere surface and properly accounting for the different cases we recover Eq. (14) from the text.

a.3 Calculation of finite size quantities

Finite number of nodes. To estimate the fluctuations due to a finite number of nodes, we will have to estimate the width of the output distribution for a given set of parameters. Let us explicit from Eqs. (10) of the main text for the parameters evolution that, starting from i.i.d. Gaussian initialization, the distribution of is

(32)

while all perpendicular components remain i.i.d.

The average output for an example can then be simply computed from its definition as

(33)

(all orthogonal integrals being equal to 1), having defined again . This proves Eq. (11) of the main text.

In order to estimate the fluctuations we should however compute the integral (we drop the t dependence for simplicity)

(34)

Since the integral is 1 for any direction perpendicular to , this is more easily done considering the distribution of (with ). Defining as , i.e. the versor in the direction of perpendicular to , we can write and calling (being a component perpendicular to and therefore i.i.d) we can explicit .

We can thus write the distribution for this component as

(35)

and the integral as just

(36)

The total spread due to this is thus

(37)

which is equivalent to Eq. (17) in the main text.

To estimate the error in Fig. 1(d) of the main text, we ask what are the values of such that the average output plus or minus a standard deviation, divided by , would be equal to the threshold. Since the standard deviation involves , we estimate its average value for points with a given , i.e. . The variance is thus the sum of two terms: multiplying and a constant . Requesting that we find:

(38)

These values are the dashed lines reported in Fig. 1(d).

Figure 5: Evolution of for the same evolution as Fig. 1 of the main text.

Finite number of data. To estimate the fluctuations due to finite number of data in in the direction perpendicular to , we use the central limit theorem, which gives fluctuations of the order . We refer to section A.2 for the general symmetry considerations about that integral: in the case of normally distributed data, and if all data are not satisfied, i.e. inside the empirical average over data, then for any given direction orthogonal to one obtains . Since there are such direction, this means that that considering finite number of data leads to a fluctuating component orthogonal to of norm of the order of .

Let us consider now the case in which only examples remain to satisfy, then the number of terms in the empirical sum is instead of . In consequence, we obtain the same results than previously for the variance, but with an extra-factor in front, thus leading to an error of order .

Estimating for normally distributed data, and with the specific expression is then a simple Gaussian integral:

(39)

Computing this for normally distributed data leads to:

(40)

as was used to compute the estimates in Fig. 3(b) in the main text.

a.4 Calculations for the mislabeling case

We now analyze the case, qualitatively described in the text, where a small fraction of the examples has been mislabeled as belonging to the opposite class.

Looking back at Eq. (29) and with , it is clear that with an infinite number of examples the mislabeled ones are simply never classified, so that the fraction of correct examples gives rise to a normal dynamics, while the fraction of opposite ones contributes an opposite term of constant magnitude. The effective integrals entering the dynamics are thus in this case

(41)

and would drive the dynamics until the two contributions are equal.

When considering a finite number of data, as discussed in Sec. A.3, the number of unsatisfied examples with the correct label amounts to , but since all the mislabeled examples are unsatisfied the total number will be incremented by leading to .

Again, evaluating this for the normally distributed case we find:

(42)
Figure 6: a: Training (blue) and generalization (orange) error (fraction of misclassified examples), during a training with a small fraction of mislabeled examples. Training parameters: , , , , , timesteps , validation on examples. b: Components of along (parallel) and perpendicular to it, during training. The dots are numerical results for the same training show in a. The lines represent our analytical predictions and for the same parameters (Eqs. (41) and (42)). c, d: Evolution of a sample (10) of the (c) and (d) during training (circles) compared to our theoretical prediction (lines) for the noiseless case with the same initial values and parameters. e: Evolution of for the same sample of nodes.

Appendix B Further numerical experiments

b.1 Evolution of

We report in Fig. 5 the perpendicular component of the weights for a selection of nodes for the same example shown in Fig. 1 of the main text. As expected, the perpendicular component does not evolve for most of the training, and only increases moderately when we move into the overfitting regime.

b.2 Quantities for the mislabeling case

We report here in Fig. 6 some of the same quantities shown in Fig. 1 and Fig. 3 of the main text, for a case where a small fraction of the examples are mislabeled. As discussed in the main text, we can see how the dynamics still follows our estimate initially, then diverges into a much stronger overfitting state. Panel b shows a comparison of numerical quantities to our estimates of Sec. A.4: our estimate are still accurate up to the overfitting regime, after which the dynamics changes qualitatively.

Appendix C Other material

Code.

The code to reproduce all numerical results and graphs reported in this article can be found at https://github.com/phiandark/DynHingeLoss/

. It consists of a single Jupyter notebook, based on Python 3 and requiring libraries numpy, scipy, tensorflow (1.xx), and matplotlib. All examples can be run in a few minutes on a moderately powerful machine. For more details, please see comments in the code.

Time evolution.

An animation showing the training and validation error and parameters evolution for the same cases reported in Fig. 2 can be found on the same page. As discussed in Sec. 3.2, the different behavior of the parameters is apparent, despite the similar final error. Moreover, the effects of overfitting can be noticed in the final phases of training.

References