1 Introduction
A major impediment for understanding the effectiveness of deep neural networks is our lack of mathematical models for the data sets on which neural networks are trained. This lack of tractable models prevents us from analysing the impact of data sets on the training of neural networks and their ability to generalise from examples, which remains an open problem both in statistical learning theory
[1, 2], and in analysing the averagecase behaviour of algorithms in synthetic data models [3, 4, 5].Indeed, most theoretical results on neural networks do not model the structure of the training data, while some works build on a setup where inputs are drawn componentwise i.i.d. from some probability distribution, and labels are either random or given by some random, but fixed function of the inputs. Despite providing valuable insights, these approaches are by construction blind to key structural properties of realworld data sets.
Here, we focus on two types of data structure that can both already be illustrated by considering the simple canonical problem of classifying the handwritten digits in the MNIST database using a neural network
[6]. The input patterns are images with pixels, so a priori we work in the highdimensional . However, the inputs that may be interpreted as handwritten digits, and hence constitute the “world” of our problem, span but a lowerdimensional manifold withinwhich is not easily defined. Its dimension can nevertheless be estimated to be around
based on the neighbourhoods of inputs in the data set [7, 8, 9, 10]. The intrinsic dimension being lower than the dimension of the input space is a property expected to be common to many real data sets used in machine leanring. We should not consider presenting with an input that is outside of its world (or maybe we should train it to answer that the “input is outside of my world” in such cases). We will call inputs structured if they are concentrated on a lowerdimensional manifold and thus have a lowerdimensional latent representation.The second type of the structure concerns the function of the inputs that is to be learnt, which we will call the learning task. We will consider two models: the teacher task, where the label is obtained as a function of the highdimensional input; and the latent task, where the label is a function of only the lowerdimensional latent representation of the input.
structured inputs  inputs that are concentrated on a fixed, lowerdimensional manifold in input space 

latent representation  for a structured input, its coordinates in the lowerdimensional manifold 
task  the function of the inputs to be learnt 
latent task  for structured inputs, labels are given as a function of the latent representation only 
teacher task  for all inputs, labels are obtained from a random, but fixed function of the highdimensional input without explicit dependence on the latent representation, if it exists 
MNIST task  discriminating odd from even digits in the MNIST database 
vanilla teacherstudent setup  Generative model due to [11], where data sets consist of componentwise i.i.d. inputs with labels given by a fixed, but random neural network acting directly on the input 
hidden manifold model (HMF)  Generative model introduced in Sec. 4 for data sets consisting of structured inputs (Eq. 6) with latent labels (Eq. 7) 
We begin this paper by comparing neural networks trained on two different problems: the MNIST task, where one aims to discriminate odd from even digits in the in the MNIST data set; and the vanilla teacherstudent setup
, where inputs are drawn as vectors with i.i.d. component from the Gaussian distribution and labels are given by a random, but fixed, neural network acting on the highdimensional inputs. This model is an example of a teacher task on unstructured inputs. It was introduced by Gardner & Derrida
[11] and has played a major role in theoretical studies of the generalisation ability of neural networks from an averagecase perspective, particularly within the framework of statistical mechanics [3, 12, 4, 5, 13, 14, 15, 16, 17], and also in recent statistical learning theory works, e.g. [18, 19, 20, 21]. We choose the MNIST data set because it is the simplest widely used example of a structured data set on which neural networks show significantly different behaviour than when trained on synthetic data of the vanilla teacherstudent setup.Our main contributions are then twofold:
1. We experimentally identify two key differences between networks trained in the vanilla teacherstudent setup and networks trained on the MNIST task (Sec. 3). i) Two identical networks trained on the same MNIST task, but starting from different initial conditions, will achieve the same test error on MNIST images, but they learn globally different functions. Their outputs coincide in those regions of input space where MNIST images tend to lie – the “world” of the problem, but differ significantly when tested on Gaussian inputs. In contrast, two networks trained on the teacher task learn the same functions globally to within a small error. ii) In the vanilla teacherstudent setup, the test error of a network is stationary during long periods of training before a sudden dropoff. These plateaus are wellknown features of this setup [22, 4]
, but are not observed when training on the MNIST task nor on other datasets used commonly in machine learning.
2. We introduce the hidden manifold model (HMF), a probabilistic model that generates data sets containing highdimensional inputs which lie on a lowerdimensional manifold and whose labels depend only on their position within that manifold (Sec. 4). In this model, inputs are thus structured and labels depend on their lowerdimensional latent representation. We experimentally demonstrate that training networks on data sets drawn from this model reproduces both behaviours observed when training on MNIST. We also show that the structure of both, input space and the task to be learnt, play an important role for the dynamics and the performance of neural networks.
Related work
Several works have compared neural networks trained from different initial conditions on the same task by comparing the different features learnt in vision problems [23, 24, 25], but these works did not compare the functions learned by the network. On the theory side, several works have appreciated the need to model the inputs, and to go beyond the simple componentwise i.i.d. modelling [26, 27, 28, 29, 30]. While we will focus on the ability of neural network to generalise from examples, two recent papers studied a network’s ability to store inputs with lowerdimensional structure and random labels: Chung et al. [31] studied the linear separability of general, finitedimensional manifolds, while Rotondo et al. [32] extended Cover’s argument [33] to count the number of learnable dichotomies when inputs are grouped in tuples of inputs with the same label.
Accessibility and reproducibility
We provide the full code of our experiments at https://github.com/sgoldt/hiddenmanifoldmodel and give necessary parameter values to reproduce our figures beneath each plot. For ease of reading, we adopt the notation from the textbook by Goodfellow et al. [34].
2 Setup
In order to proceed on the question of what is a suitable model for structured data, we consider the setup of a feedforward neural network with one hidden layer with a few hidden units, as described below. We chose this setting because it is the simplest one we found where we were able to identify key differences between training in the vanilla teacherstudent setup and training on the MNIST task. So throughout this work, we focus on the dynamics and performance of fullyconnected twolayer neural networks with hidden units and first and secondlayer weights and , resp. Given an input , the output of a network with parameters is given by
(1) 
where is the th row of , and
is the nonlinear activation function of the network, acting componentwise. We will focus on sigmoidal networks with
, or ReLU networks where
(see Appendix E).We will train the neural networks on data sets with inputoutput pairs , , where we use the starred to denote the true label of an input . We train networks by minimising the quadratic training error with
using stochastic gradient descent (SGD) with constant learning rate
,(2) 
Initial weights for both layers of sigmoidal networks were always taken componentwise i.i.d. from the normal distribution with mean 0 and variance 1. The initial weights of ReLU networks were also taken from the normal distribution, but with variance
to ensure convergence.The key quantity of interest is the test error or generalisation error of a network, for which we compare its predictions to the labels given in a test set that is composed of inputoutput pairs , that are not used during training,
(3) 
The test set might be composed of MNIST test images or generated by the same probabilistic model that generated the training data. For binary classification tasks with , this definition is easily amended to give the fractional generalisation error , where is the Heaviside step function.
2.1 Learning from real data or from generative models?
We want to compare the behaviours of twolayer neural networks Eq. (1) trained either on real data sets or on unstructured tasks. As an example of a real data set, we will use the MNIST image database of handwritten digits [6] and focus on the task of discriminating odd from even digits. Hence the inputs will be the MNIST images with labels for odd and even digits, resp. The joint probability distribution of inputoutput pairs for this task is inaccessible, which prevents analytical control over the test error and other quantities of interest. To make theoretical progress, it is therefore promising to study the generalisation ability of neural networks for data arising from a probabilistic generative model.
A classic model for data sets is the vanilla teacherstudent setup [11], where unstructured i.i.d. inputs are fed through a random neural network called the teacher. We will take the teacher to have two layers and hidden nodes. We allow that and we will draw the components of the teacher’s weights i.i.d. from the normal distribution with mean zero and unit variance. Drawing the inputs i.i.d. from the standard normal distribution , we will take
(4) 
for regression tasks, or for binary classification tasks. This is hence an example of a teacher task. In this setting, the network with hidden units that is trained using SGD Eq. (2) is traditionally called the student. Notice that, if , there exist a student network that has zero generalisation error, the one with the same architecture and parameters as the teacher.
3 Two characteristic behaviours of neural networks trained on structured data sets
We now proceed to demonstrate experimentally two significant differences in the dynamics and the performance of neural networks trained on realistic data sets and networks trained within the vanilla teacherstudent setup.
3.1 Independent networks achieve similar performance, but learn different functions when trained on structured tasks
We trained two sigmoidal networks with hidden units, starting from two independent draws of initial conditions to discriminate odd from even digits in the MNIST database. We trained both networks using SGD with constant learning rate , eq. (2), until the generalisation error had converged to a stationary value. We plot this asymptotic fractional test error as blue circles on the left in Fig. 1 (the averages are taken over both networks and over several realisations of the initial conditions). We observed the same qualitative behaviour when we employed the earlystopping error to evaluate the networks, where we take the minimum of the generalisation error during training (see Appendix C).
First, we note that increasing the number of hidden units in the network decreases the test error on this task. We also compared the networks to one another by counting the fraction of inputs which the two networks classify differently,
(5) 
This is a measure of the degree to which both networks have learned the same function . Independent networks disagree on the classification of MNIST test images at a rate that roughly corresponds to their test error for (orange crosses). However, even though the additional parameters of bigger networks are helpful in the discrimination task (decreasing ), both networks learn increasingly different functions when evaluated over the whole of using Gaussian inputs as the network size increases (green diamonds). The network learned the right function on the lowerdimensional manifold on which MNIST inputs concentrate, but not outside of it.
This behaviour is not reproduced if we substitute the MNIST data set with a data set of the same size drawn from the vanilla teacherstudent setup from Sec. 2.1 with , leaving everything else the same (right of Fig. 1). The final test error decreases with , and as soon as the expressive power of the network is at least equal to that of the teacher, i.e. , the asymptotic test error goes to zero, since the data set is large enough for the network to recover the teacher’s weights to within a very small error, leading to a small generalisation error. We also computed the evaluated using Gaussian i.i.d. inputs (green diamonds). Networks with fewer parameters than the teacher find different approximations to that function, yielding finite values of . If they have just enough parameters (), they learn the same function. Remarkably, they also learn the same function when they have significantly more parameters than the teacher. The vanilla teacherstudent setup is thus unable to reproduce the behaviour observed when training on MNIST.
3.2 The generalisation error exhibits plateaus during training on i.i.d. inputs
We plot the generalisation dynamics, i.e. the test error as a function of training time, for neural networks of the form (1) in Fig. 2. For a data set drawn from the vanilla teacherstudent setup with , (blue lines in the lefthand plot of Fig. 2), we observe that there is an extended period of training during which the test error remains constant before a sudden drop. These “plateaus” are wellknown in the literature for both SGD, where they appear as a function of time [35, 22, 36], and in batch learning, where they appear as a function of the training set size [37, 4]. Their appearance is related to different stages of learning: After a brief exponential decay of the test error at the start of training, the network “believes” that data are linearly separable and all her hidden units have roughly the same overlap with all the teacher nodes. Only after a longer time, the network picks up the additional structure of the teacher and “specialises”: each of its hidden units ideally becomes strongly correlated with one and only one hidden unit of the teacher before the generalisation error decreases exponentially to its final value.
In contrast, the generalisation dynamics of the same network trained on the MNIST task (orange trajectories on the left of Fig. 2) shows no plateau. In fact, plateaus are rarely seen during the training of neural networks (note that during training, we do not change any of the hyperparameters, e.g. the learning rate .)
It has been an open question how to eliminate the plateaus from the dynamics of neural networks trained in the teacherstudent setup. The use of secondorder gradient descent methods such natural gradient descent [38] can shorten the plateau [39], but we would like to focus on the more practically relevant case of firstorder SGD. Yoshida et al. [17] recently showed that length and existence of the plateau depend on the dimensionality of the output of the network, but we would like a model where the plateau disappears independently of the output dimension.
4 The hidden manifold model
We now introduce a new generative probabilistic model for structured data sets with the aim of reproducing the behaviour observed during training on MNIST, but with a synthetic data set. The main motivation for such a model is that a closedform solution of the learning dynamics is expected to be accessible. To generate a data set containing inputs in dimensions, we first choose feature vectors in dimensions and collect them in a feature matrix . Next we draw vectors with random i.i.d. components and collect them in the matrix . The vector gives the coordinates of the th input on the lowerdimensional manifold spanned by the feature vectors in . We will call the latent representation of the input , which is given by the th row of
(6) 
where is a nonlinear function acting componentwise. In this model, the “world” of the data on which the true label can depend is a dimensional manifold, which is obtained from the linear subspace of generated by the lines of matrix , through a folding process induced by the nonlinear function . As we discuss in Appendix A, the exact form of does not seem to be important, as long as it is a nonlinear function.
The latent labels are obtained by applying a twolayer neural network with weights within the unfolded hidden manifold according to
(7) 
We draw the weights in both layers componentwise i.i.d. from the normal distribution with unity variance, unless we note it otherwise. The key point here is the dependency of labels on the coordinates of the lowerdimensional manifold
rather than on the highdimensional data
. We believe that the exact form of this dependence is not crucial and we expect several other choices to yield similar results to the ones we will present in the next section.In the following, we choose the entries of both and to be i.i.d. draws from the normal distribution with mean zero and unit variance. To ensure comparability of the data sets for different datagenerating function , we always center the input matrix by subtracting the mean value of the entire matrix from all components and we rescale inputs by dividing all entries by the covariance of the entire matrix before training.
4.1 The impact of the hidden manifold model on neural networks
We repeated the experiments with two independent networks reported in Sec. 3.1 using data sets generated from the hidden manifold model with latent dimensions (see Appendix D). On the right of Fig. 3, we plot the asymptotic performance of a network trained on structured inputs which lie on a manifold (6) with a teacher task: the labels are a function of the highdimensional inputs and do not explicitly take the latent representation of an input into account, . The final results are similar to those of networks trained on data from the vanilla teacherstudent setup (cf. left of Fig. 1): given enough data, the network recovers the teacher function if the network has at least as many parameters as the teacher. Once the teacher weights are recovered by both networks, they achieve zero test error (blue circles) and they agree on the classification of random Gaussian inputs because they do implement the same function.
The left plot of Fig. 3 shows network performance when trained on the same inputs, but this time with a latent task where the labels are a function of the latent representation of the inputs: . The asymptotic performance of the networks then resembles that of networks trained on MNIST: after convergence, the two networks will disagree on structured inputs at a rate that is roughly their generalisation error, but as increases, they also learn increasingly different functions, up to the point where they will agree on their classification of a random Gaussian input in just half the cases. The hidden manifold model thus reproduces the behaviour of independent networks trained on MNIST.
A look at the righthand plot Fig. 2 reveals that in this model the plateaus are absent. Again, we repeat the experiment of Sec. 3.2, but we train networks on structured inputs with teacher () and latent labels (), respectively. It is clear from these plots that the plateaus only appear for the teacher task. In Appendix B, we demonstrate that the lack of plateaus for latent tasks in Fig. 2 is not due to the fact that the network in the latent task asymptotes at a higher generalisation error than the teacher task.
4.2 Latent tasks, structured inputs are both necessary to model real data sets
Our quest to reproduce the behaviour of networks trained on MNIST has led us to consider three different setups so far: the vanilla teacherstudent setup, i.e. a teacher task on unstructured inputs; and teacher and latent tasks on structured inputs. While it is not strictly possible to test the case of a latent task with unstructured inputs, we can approximate this setup by training a network on the MNIST task and then using the resulting network as a teacher to generate labels (4) for inputs drawn i.i.d. componentwise from the standard normal distribution. To test this idea, we trained both layers sigmoidal networks with hidden units using vanilla SGD on the MNIST task, where they reach a generalisation error of about . They have thus clearly learnt some of the structure of the MNIST task. However, as we show on the left of Fig. 4, independent students trained on a data set with i.i.d. Gaussian inputs and true labels given by the pretrained teacher network behave similarly to students trained in the vanilla teacherstudent setup of Sec. 3.1. Furthermore, the learning dynamics of a network trained in this setup display the plateaus that we observed in the vanilla teacherstudent setup (inset of Fig. 4).
On the right of Fig. 4
, we summarise the four different setups for synthetic data sets in supervised learning problems that we have analysed in this paper. Only the hidden manifold model, consisting of a latent task on structured inputs, reproduced the behaviour of neural networks trained on the MNIST task, leading us to conclude that a model for realistic data sets has to feature both, structured inputs and a latent task.
5 Concluding perspectives
We have introduced the hidden manifold model for structured data sets that is simple to write down, yet displays some of the phenomena that we observe when training neural networks on realworld inputs. We saw that the model has two key ingredients, both of which are necessary: (1) highdimensional inputs which lie on a lowerdimensional manifold and (2) latent labels for these inputs that depend on the inputs’ position within the low dimensional manifold. We hope that this model is a step towards a more thorough understanding of how the structure we find in realworld data sets impacts the training dynamics of neural networks and their ability to generalise.
We see two main lines for future work. On the one hand, the present work needs to be generalised to multilayer networks to identify how depth helps to deal with structured data sets and to build a model capturing the key properties. On the other hand, the key promise of the synthetic hidden manifold model is that the learning dynamics should be amenable to closedform analysis in some limit. Such analysis and its results would then provide further insights about the properties of learning beyond what is possible with numerical experiments.
Acknowledgements
We would like to thank the Kavli Institute For Theoretical Physics for its hospitality during an extended stay, during which parts of this work were conceived and carried out. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608SMiLe, from “Chaire de recherche sur les modèles et sciences des données”, Fondation CFM pour la RechercheENS, and from the French National Research Agency (ANR) grant PAIL. This research was supported in part by the National Science Foundation under Grant No. NSF PHY1748958
References
 [1] V. Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
 [2] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
 [3] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056–6091, 1992.
 [4] A. Engel and C. Van den Broeck. Statistical Mechanics of Learning. Cambridge University Press, 2001.
 [5] L. Zdeborová and F. Krzakala. Statistical physics of inference: thresholds and algorithms. Adv. Phys., 65(5):453–552, 2016.
 [6] Y. LeCun and C. Cortes. The MNIST database of handwritten digits, 1998.
 [7] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. Physica D: Nonlinear Phenomena, 9(12):189–208, 1983.
 [8] J.A. Costa and A.O. Hero. Learning intrinsic dimension and intrinsic entropy of highdimensional datasets. In 2004 12th European Signal Processing Conference, pages 369–372, 2004.
 [9] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems 17, 2004.
 [10] S. Spigler, M. Geiger, and M. Wyart. Asymptotic learning curves of kernel methods: empirical data v.s. TeacherStudent paradigm. arXiv:1905.10843, 2019.
 [11] E. Gardner and B. Derrida. Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General, 22(12):1983–1994, 1989.
 [12] T.L.H. Watkin, A. Rau, and M. Biehl. The statistical mechanics of learning a rule. Reviews of Modern Physics, 65(2):499–556, 1993.
 [13] M.S. Advani and A.M. Saxe. Highdimensional dynamics of generalization error in neural networks. arXiv:1710.03667, 2017.
 [14] B. Aubin, A. Maillard, J. Barbier, F. Krzakala, N. Macris, and L. Zdeborová. The committee machine: Computational to statistical gaps in learning a twolayers neural network. In Advances in Neural Information Processing Systems 31, pages 3227–3238, 2018.

[15]
J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová.
Optimal errors and phase transitions in highdimensional generalized linear models.
Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.  [16] S. Goldt, M.S. Advani, A.M. Saxe, F. Krzakala, and L. Zdeborová. Dynamics of stochastic gradient descent for twolayer neural networks in the teacherstudent setup. to appear. In Advances in Neural Information Processing Systems 33, arXiv:1906.08632, 2019.

[17]
Y. Yoshida, R. Karakida, M. Okada, and S.I. Amari.
Statistical mechanical analysis of learning dynamics of twolayer perceptron with multiple output units.
Journal of Physics A: Mathematical and Theoretical, 52(18), 2019.  [18] R. Ge, J.D. Lee, and T. Ma. Learning onehiddenlayer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
 [19] Y. Li and Y. Yuan. Convergence analysis of twolayer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
 [20] S. Mei and A. Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
 [21] S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit Regularization in Deep Matrix Factorization. In Advances in Neural Information Processing Systems 33, arXiv:1905.13655, 2019.
 [22] D. Saad and S.A. Solla. Exact Solution for OnLine Learning in Multilayer Neural Networks. Phys. Rev. Lett., 74(21):4337–4340, 1995.

[23]
Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft.
Convergent Learning: Do different neural networks learn the same
representations?
In D. Storcheus, A. Rostamizadeh, and S. Kumar, editors,
Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015
, volume 44 of Proceedings of Machine Learning Research, pages 196–212. PMLR, 2015. 
[24]
M. Raghu, J. Gilmer, J. Yosinski, and J. SohlDickstein.
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability.
In Advances in Neural Information Processing Systems 30, pages 6076–6085. Curran Associates, Inc., 2017.  [25] A.S. Morcos, M. Raghu, and S. Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems 31, pages 5727–5736, 2018.
 [26] J. Bruna and S. Mallat. Invaraint scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, (35):1872–1886, 2013.
 [27] A.B. Patel, M.T. Nguyen, and R. Baraniuk. A probabilistic framework for deep learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2558–2566. Curran Associates, Inc., 2016.
 [28] M. Mézard. Meanfield message passing equations in the Hopfield model and its generalizations. Phys. Rev. E, (95):022117, 2017.
 [29] M. Gabrié, A. Manoel, C. Luneau, J. Barbier, N. Macris, F. Krzakala, and L. Zdeborová. Entropy and mutual information in models of deep neural networks. In Advances in Neural Information Processing Systems 31, pages 1826–1836, 2018.
 [30] E. Mossel. Deep learning and hierarchical generative models. arXiv preprint arXiv:1612.09057, 2018.
 [31] S. Chung, Daniel D. Lee, and H. Sompolinsky. Classification and Geometry of General Perceptual Manifolds. Physical Review X, 8(3):31003, 2018.
 [32] P. Rotondo, M. Cosentino Lagomarsino, and M. Gherardi. Counting the learnable functions of structured data. arXiv:1903.12021, 2019.

[33]
T.M. Cover.
Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition.
IEEE Transactions on Electronic Computers, EC14(3):326–334, 1965.  [34] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.
 [35] M. Biehl and H. Schwarze. Learning by online gradient descent. J. Phys. A. Math. Gen., 28(3):643–656, 1995.
 [36] M. Biehl, P. Riegler, and C. Wöhler. Transient dynamics of online learning in twolayered neural networks. Journal of Physics A: Mathematical and General, 29(16), 1996.
 [37] H. Schwarze. Learning a rule in a multilayer neural network. Journal of Physics A: Mathematical and General, 26(21):5781–5794, 1993.
 [38] H.H. Yang and S.I. Amari. The Efficiency and the Robustness of Natural Gradient Descent Learning Rule. In M I Jordan, M J Kearns, and S A Solla, editors, Advances in Neural Information Processing Systems 10, pages 385–391, 1998.
 [39] M. Rattray, D. Saad, and S.I. Amari. Natural Gradient Descent for OnLine Learning. Physical Review Letters, 81(24):5461–5464, 1998.
Appendix A The exact form of the datagenerating function is not important, as long as it is nonlinear
Two questions arise when looking at the way we generate inputs in our data sets, : is the nonlinearity necessary necessary, and is the choice of nonlinearity important?
To answer the first question, we plot the results of the experiment with independent networks described in Sec. 4.1. The setup is exactly the same, except that we now take inputs to be
(8) 
i.e. inputs are just a linear combination of the feature vectors, without applying a nonlinearity. In this case, two networks trained in the vanilla teacherstudent setup will learn globally different functions, as can be seen from the fractional generalisation error between the networks (5) (green diamonds), which is , i.e. not better than chance. This is a direct consequence of using : to perfectly generalise with respect to the teacher, it is thus sufficient to learn only the components of the teacher weights in the direction . Thus the weights of the network in the weight space orthogonal to the directions are unconstrained, and by starting from random initial conditions, will converge to different values for each network.
We also checked that the qualitative behaviour of a neural networks trained on the hidden manifold model does not depend on the datagenerating nonlinearity . In Fig. 7, we therefore show the results of the same experiment described in Sec. 4.1, but this time using
(9) 
where the application of the nonlinearity is again componentwise. Indeed, the results mirror those when we used the sign function .
Appendix B The existence of plateaus does not depend on the asymptotic generalisation error
We have demonstrated on the right of Fig. 2 that neural networks trained on data drawn from the hidden manifold model (HMF) introduced here do not show the plateau phenomenon, where the generalisation error stays stationary after an initial exponential decay, before dropping again. Upon closer inspection, one might think that this is due to the fact that the student trained on data from the HMF asymptotes at a higher generalisation error than the student trained in the vanilla teacherstudent setup. This is not the case, as we demonstrate in Fig. 7: we observe no plateau in a sigmoidal network trained on data from the HMF even that network asymptotes at a generalisation error that is, within fluctuations, the same as the generalisation error of a network of the same sized trained in the vanilla teacherstudent setup and which shows a plateau.
Appendix C Earlystopping yields qualitatively similar results
In Fig. 8, we reproduce Fig. 3, where we compare the performance of independent neural networks trained on the MNIST task (Left), or trained on structured inputs with a latent task (Center) and a teacher task (Right), respectively. This time, we the earlystopping generalisation error rather than the asymptotic value at the end of training. We define as the minimum of during the whole of training. Clearly, the qualitative result of Sec. 4.1 is unchanged: although we use structured inputs (6) in both cases, independent students will learn different functions which agree on those inputs only when they are trained on a latent task (7) (Center), but not when trained on a vanilla teacher task (4) (Right). Thus structured inputs and latent tasks are sufficient to reproduce the behaviour observed when training on the MNIST task.
Appendix D Dynamics with a large number of features
It is of independent interest to investigate the behaviour of networks trained on data from the hidden manifold model when the number of feature vectors is on the same order as the input dimension . We call this the regime of extensive . It is a different regime from MNIST, where experimental studies consistently find that inputs lie on a lowdimensional manifold of dimension , which is much smaller than the input dimension [8, 9, 10].
We show the results of our numerical experiments with in Fig. 9, where we reproduce Fig. 3 for the asymptotic (top row) and the earlystopping (bottow row) generalisation error. The behaviour of networks trained on a teacher task with structured inputs (right column) is unchanged w.r.t. to the case with . For the latent task, increasing the number of hidden units however increases the generalisation error, indicating severe overfitting, which is only partly mitigated by early stopping. The generalisation error on this task is generally much higher than in the low regime and clearly, increasing the width of the network is not the right way to learn a latent task; instead, it would be intriguing to analyse the performance of deeper networks on this task where finding a good intermediate representation for inputs is key. This is an intriguing avenue for future research.
Appendix E Independent students with ReLU activation function
We also verified that the behaviour of independent networks we observed on MNIST with sigmoidal students persists when training networks with ReLU activation function and that the hidden manifold model is able to reproduce it for these networks. We show the results of our numerical experiments in Fig. 10. To that end, we trained both layers of a network with starting from small initial conditions, where we draw the weights componentwise i.i.d. from a normal distribution with variance .
We see that the generalisation error of ReLU networks on the MNIST task (Left of Fig. 10) decreases with increasing number of hidden units, while the generalisation error on MNIST inputs of the two independent students with respect to each other is comparable or less than the generalisation error of each individual network on the MNIST task.
On structured inputs with a teacher task (Right of Fig. 10), where labels were generated by a teacher with hidden units, the student recovers the teacher such that its generalisation error is less than for , and both independent students learn the same function, as evidenced by their generalisation errors with respect to each other. This is the same behaviour that we see in Fig. 3 for sigmoidal networks. The finite value of the generalisation error for is due to two out of ten runs taking a very long time to converge, longer than our simulation lasted for. Finally, we see that for a latent task on structured inputs, the generalisation error of the two networks with respect to each other increases beyond the generalisation error on structured inputs of each of them, as we observed on MNIST. Thus we have recovered the phenomenology that we described for sigmoidal networks in ReLU networks, too.
Comments
There are no comments yet.