A major impediment for understanding the effectiveness of deep neural networks is our lack of mathematical models for the data sets on which neural networks are trained. This lack of tractable models prevents us from analysing the impact of data sets on the training of neural networks and their ability to generalise from examples, which remains an open problem both in statistical learning theory[1, 2], and in analysing the average-case behaviour of algorithms in synthetic data models [3, 4, 5].
Indeed, most theoretical results on neural networks do not model the structure of the training data, while some works build on a setup where inputs are drawn component-wise i.i.d. from some probability distribution, and labels are either random or given by some random, but fixed function of the inputs. Despite providing valuable insights, these approaches are by construction blind to key structural properties of real-world data sets.
Here, we focus on two types of data structure that can both already be illustrated by considering the simple canonical problem of classifying the handwritten digits in the MNIST database using a neural network. The input patterns are images with pixels, so a priori we work in the high-dimensional . However, the inputs that may be interpreted as handwritten digits, and hence constitute the “world” of our problem, span but a lower-dimensional manifold within
which is not easily defined. Its dimension can nevertheless be estimated to be aroundbased on the neighbourhoods of inputs in the data set [7, 8, 9, 10]. The intrinsic dimension being lower than the dimension of the input space is a property expected to be common to many real data sets used in machine leanring. We should not consider presenting with an input that is outside of its world (or maybe we should train it to answer that the “input is outside of my world” in such cases). We will call inputs structured if they are concentrated on a lower-dimensional manifold and thus have a lower-dimensional latent representation.
The second type of the structure concerns the function of the inputs that is to be learnt, which we will call the learning task. We will consider two models: the teacher task, where the label is obtained as a function of the high-dimensional input; and the latent task, where the label is a function of only the lower-dimensional latent representation of the input.
|structured inputs||inputs that are concentrated on a fixed, lower-dimensional manifold in input space|
|latent representation||for a structured input, its coordinates in the lower-dimensional manifold|
|task||the function of the inputs to be learnt|
|latent task||for structured inputs, labels are given as a function of the latent representation only|
|teacher task||for all inputs, labels are obtained from a random, but fixed function of the high-dimensional input without explicit dependence on the latent representation, if it exists|
discriminating odd from even digits in the MNIST database
|vanilla teacher-student setup||Generative model due to , where data sets consist of component-wise i.i.d. inputs with labels given by a fixed, but random neural network acting directly on the input|
|hidden manifold model (HMF)||Generative model introduced in Sec. 4 for data sets consisting of structured inputs (Eq. 6) with latent labels (Eq. 7)|
We begin this paper by comparing neural networks trained on two different problems: the MNIST task, where one aims to discriminate odd from even digits in the in the MNIST data set; and the vanilla teacher-student setup
, where inputs are drawn as vectors with i.i.d. component from the Gaussian distribution and labels are given by a random, but fixed, neural network acting on the high-dimensional inputs. This model is an example of a teacher task on unstructured inputs. It was introduced by Gardner & Derrida and has played a major role in theoretical studies of the generalisation ability of neural networks from an average-case perspective, particularly within the framework of statistical mechanics [3, 12, 4, 5, 13, 14, 15, 16, 17], and also in recent statistical learning theory works, e.g. [18, 19, 20, 21]. We choose the MNIST data set because it is the simplest widely used example of a structured data set on which neural networks show significantly different behaviour than when trained on synthetic data of the vanilla teacher-student setup.
Our main contributions are then twofold:
1. We experimentally identify two key differences between networks trained in the vanilla teacher-student setup and networks trained on the MNIST task (Sec. 3). i) Two identical networks trained on the same MNIST task, but starting from different initial conditions, will achieve the same test error on MNIST images, but they learn globally different functions. Their outputs coincide in those regions of input space where MNIST images tend to lie – the “world” of the problem, but differ significantly when tested on Gaussian inputs. In contrast, two networks trained on the teacher task learn the same functions globally to within a small error. ii) In the vanilla teacher-student setup, the test error of a network is stationary during long periods of training before a sudden drop-off. These plateaus are well-known features of this setup [22, 4]
, but are not observed when training on the MNIST task nor on other datasets used commonly in machine learning.
2. We introduce the hidden manifold model (HMF), a probabilistic model that generates data sets containing high-dimensional inputs which lie on a lower-dimensional manifold and whose labels depend only on their position within that manifold (Sec. 4). In this model, inputs are thus structured and labels depend on their lower-dimensional latent representation. We experimentally demonstrate that training networks on data sets drawn from this model reproduces both behaviours observed when training on MNIST. We also show that the structure of both, input space and the task to be learnt, play an important role for the dynamics and the performance of neural networks.
Several works have compared neural networks trained from different initial conditions on the same task by comparing the different features learnt in vision problems [23, 24, 25], but these works did not compare the functions learned by the network. On the theory side, several works have appreciated the need to model the inputs, and to go beyond the simple component-wise i.i.d. modelling [26, 27, 28, 29, 30]. While we will focus on the ability of neural network to generalise from examples, two recent papers studied a network’s ability to store inputs with lower-dimensional structure and random labels: Chung et al.  studied the linear separability of general, finite-dimensional manifolds, while Rotondo et al.  extended Cover’s argument  to count the number of learnable dichotomies when inputs are grouped in tuples of inputs with the same label.
Accessibility and reproducibility
In order to proceed on the question of what is a suitable model for structured data, we consider the setup of a feedforward neural network with one hidden layer with a few hidden units, as described below. We chose this setting because it is the simplest one we found where we were able to identify key differences between training in the vanilla teacher-student setup and training on the MNIST task. So throughout this work, we focus on the dynamics and performance of fully-connected two-layer neural networks with hidden units and first- and second-layer weights and , resp. Given an input , the output of a network with parameters is given by
where is the th row of , and
is the non-linear activation function of the network, acting component-wise. We will focus on sigmoidal networks with
, or ReLU networks where(see Appendix E).
We will train the neural networks on data sets with input-output pairs , , where we use the starred to denote the true label of an input . We train networks by minimising the quadratic training error with
using stochastic gradient descent (SGD) with constant learning rate,
Initial weights for both layers of sigmoidal networks were always taken component-wise i.i.d. from the normal distribution with mean 0 and variance 1. The initial weights of ReLU networks were also taken from the normal distribution, but with varianceto ensure convergence.
The key quantity of interest is the test error or generalisation error of a network, for which we compare its predictions to the labels given in a test set that is composed of input-output pairs , that are not used during training,
The test set might be composed of MNIST test images or generated by the same probabilistic model that generated the training data. For binary classification tasks with , this definition is easily amended to give the fractional generalisation error , where is the Heaviside step function.
2.1 Learning from real data or from generative models?
We want to compare the behaviours of two-layer neural networks Eq. (1) trained either on real data sets or on unstructured tasks. As an example of a real data set, we will use the MNIST image database of handwritten digits  and focus on the task of discriminating odd from even digits. Hence the inputs will be the MNIST images with labels for odd and even digits, resp. The joint probability distribution of input-output pairs for this task is inaccessible, which prevents analytical control over the test error and other quantities of interest. To make theoretical progress, it is therefore promising to study the generalisation ability of neural networks for data arising from a probabilistic generative model.
A classic model for data sets is the vanilla teacher-student setup , where unstructured i.i.d. inputs are fed through a random neural network called the teacher. We will take the teacher to have two layers and hidden nodes. We allow that and we will draw the components of the teacher’s weights i.i.d. from the normal distribution with mean zero and unit variance. Drawing the inputs i.i.d. from the standard normal distribution , we will take
for regression tasks, or for binary classification tasks. This is hence an example of a teacher task. In this setting, the network with hidden units that is trained using SGD Eq. (2) is traditionally called the student. Notice that, if , there exist a student network that has zero generalisation error, the one with the same architecture and parameters as the teacher.
3 Two characteristic behaviours of neural networks trained on structured data sets
We now proceed to demonstrate experimentally two significant differences in the dynamics and the performance of neural networks trained on realistic data sets and networks trained within the vanilla teacher-student setup.
3.1 Independent networks achieve similar performance, but learn different functions when trained on structured tasks
We trained two sigmoidal networks with hidden units, starting from two independent draws of initial conditions to discriminate odd from even digits in the MNIST database. We trained both networks using SGD with constant learning rate , eq. (2), until the generalisation error had converged to a stationary value. We plot this asymptotic fractional test error as blue circles on the left in Fig. 1 (the averages are taken over both networks and over several realisations of the initial conditions). We observed the same qualitative behaviour when we employed the early-stopping error to evaluate the networks, where we take the minimum of the generalisation error during training (see Appendix C).
First, we note that increasing the number of hidden units in the network decreases the test error on this task. We also compared the networks to one another by counting the fraction of inputs which the two networks classify differently,
This is a measure of the degree to which both networks have learned the same function . Independent networks disagree on the classification of MNIST test images at a rate that roughly corresponds to their test error for (orange crosses). However, even though the additional parameters of bigger networks are helpful in the discrimination task (decreasing ), both networks learn increasingly different functions when evaluated over the whole of using Gaussian inputs as the network size increases (green diamonds). The network learned the right function on the lower-dimensional manifold on which MNIST inputs concentrate, but not outside of it.
This behaviour is not reproduced if we substitute the MNIST data set with a data set of the same size drawn from the vanilla teacher-student setup from Sec. 2.1 with , leaving everything else the same (right of Fig. 1). The final test error decreases with , and as soon as the expressive power of the network is at least equal to that of the teacher, i.e. , the asymptotic test error goes to zero, since the data set is large enough for the network to recover the teacher’s weights to within a very small error, leading to a small generalisation error. We also computed the evaluated using Gaussian i.i.d. inputs (green diamonds). Networks with fewer parameters than the teacher find different approximations to that function, yielding finite values of . If they have just enough parameters (), they learn the same function. Remarkably, they also learn the same function when they have significantly more parameters than the teacher. The vanilla teacher-student setup is thus unable to reproduce the behaviour observed when training on MNIST.
3.2 The generalisation error exhibits plateaus during training on i.i.d. inputs
We plot the generalisation dynamics, i.e. the test error as a function of training time, for neural networks of the form (1) in Fig. 2. For a data set drawn from the vanilla teacher-student setup with , (blue lines in the left-hand plot of Fig. 2), we observe that there is an extended period of training during which the test error remains constant before a sudden drop. These “plateaus” are well-known in the literature for both SGD, where they appear as a function of time [35, 22, 36], and in batch learning, where they appear as a function of the training set size [37, 4]. Their appearance is related to different stages of learning: After a brief exponential decay of the test error at the start of training, the network “believes” that data are linearly separable and all her hidden units have roughly the same overlap with all the teacher nodes. Only after a longer time, the network picks up the additional structure of the teacher and “specialises”: each of its hidden units ideally becomes strongly correlated with one and only one hidden unit of the teacher before the generalisation error decreases exponentially to its final value.
In contrast, the generalisation dynamics of the same network trained on the MNIST task (orange trajectories on the left of Fig. 2) shows no plateau. In fact, plateaus are rarely seen during the training of neural networks (note that during training, we do not change any of the hyper-parameters, e.g. the learning rate .)
It has been an open question how to eliminate the plateaus from the dynamics of neural networks trained in the teacher-student setup. The use of second-order gradient descent methods such natural gradient descent  can shorten the plateau , but we would like to focus on the more practically relevant case of first-order SGD. Yoshida et al.  recently showed that length and existence of the plateau depend on the dimensionality of the output of the network, but we would like a model where the plateau disappears independently of the output dimension.
4 The hidden manifold model
We now introduce a new generative probabilistic model for structured data sets with the aim of reproducing the behaviour observed during training on MNIST, but with a synthetic data set. The main motivation for such a model is that a closed-form solution of the learning dynamics is expected to be accessible. To generate a data set containing inputs in dimensions, we first choose feature vectors in dimensions and collect them in a feature matrix . Next we draw vectors with random i.i.d. components and collect them in the matrix . The vector gives the coordinates of the th input on the lower-dimensional manifold spanned by the feature vectors in . We will call the latent representation of the input , which is given by the th row of
where is a non-linear function acting component-wise. In this model, the “world” of the data on which the true label can depend is a -dimensional manifold, which is obtained from the linear subspace of generated by the lines of matrix , through a folding process induced by the nonlinear function . As we discuss in Appendix A, the exact form of does not seem to be important, as long as it is a nonlinear function.
The latent labels are obtained by applying a two-layer neural network with weights within the unfolded hidden manifold according to
We draw the weights in both layers component-wise i.i.d. from the normal distribution with unity variance, unless we note it otherwise. The key point here is the dependency of labels on the coordinates of the lower-dimensional manifold
rather than on the high-dimensional data. We believe that the exact form of this dependence is not crucial and we expect several other choices to yield similar results to the ones we will present in the next section.
In the following, we choose the entries of both and to be i.i.d. draws from the normal distribution with mean zero and unit variance. To ensure comparability of the data sets for different data-generating function , we always center the input matrix by subtracting the mean value of the entire matrix from all components and we rescale inputs by dividing all entries by the covariance of the entire matrix before training.
4.1 The impact of the hidden manifold model on neural networks
We repeated the experiments with two independent networks reported in Sec. 3.1 using data sets generated from the hidden manifold model with latent dimensions (see Appendix D). On the right of Fig. 3, we plot the asymptotic performance of a network trained on structured inputs which lie on a manifold (6) with a teacher task: the labels are a function of the high-dimensional inputs and do not explicitly take the latent representation of an input into account, . The final results are similar to those of networks trained on data from the vanilla teacher-student setup (cf. left of Fig. 1): given enough data, the network recovers the teacher function if the network has at least as many parameters as the teacher. Once the teacher weights are recovered by both networks, they achieve zero test error (blue circles) and they agree on the classification of random Gaussian inputs because they do implement the same function.
The left plot of Fig. 3 shows network performance when trained on the same inputs, but this time with a latent task where the labels are a function of the latent representation of the inputs: . The asymptotic performance of the networks then resembles that of networks trained on MNIST: after convergence, the two networks will disagree on structured inputs at a rate that is roughly their generalisation error, but as increases, they also learn increasingly different functions, up to the point where they will agree on their classification of a random Gaussian input in just half the cases. The hidden manifold model thus reproduces the behaviour of independent networks trained on MNIST.
A look at the right-hand plot Fig. 2 reveals that in this model the plateaus are absent. Again, we repeat the experiment of Sec. 3.2, but we train networks on structured inputs with teacher () and latent labels (), respectively. It is clear from these plots that the plateaus only appear for the teacher task. In Appendix B, we demonstrate that the lack of plateaus for latent tasks in Fig. 2 is not due to the fact that the network in the latent task asymptotes at a higher generalisation error than the teacher task.
4.2 Latent tasks, structured inputs are both necessary to model real data sets
Our quest to reproduce the behaviour of networks trained on MNIST has led us to consider three different setups so far: the vanilla teacher-student setup, i.e. a teacher task on unstructured inputs; and teacher and latent tasks on structured inputs. While it is not strictly possible to test the case of a latent task with unstructured inputs, we can approximate this setup by training a network on the MNIST task and then using the resulting network as a teacher to generate labels (4) for inputs drawn i.i.d. component-wise from the standard normal distribution. To test this idea, we trained both layers sigmoidal networks with hidden units using vanilla SGD on the MNIST task, where they reach a generalisation error of about . They have thus clearly learnt some of the structure of the MNIST task. However, as we show on the left of Fig. 4, independent students trained on a data set with i.i.d. Gaussian inputs and true labels given by the pre-trained teacher network behave similarly to students trained in the vanilla teacher-student setup of Sec. 3.1. Furthermore, the learning dynamics of a network trained in this setup display the plateaus that we observed in the vanilla teacher-student setup (inset of Fig. 4).
On the right of Fig. 4
, we summarise the four different setups for synthetic data sets in supervised learning problems that we have analysed in this paper. Only the hidden manifold model, consisting of a latent task on structured inputs, reproduced the behaviour of neural networks trained on the MNIST task, leading us to conclude that a model for realistic data sets has to feature both, structured inputs and a latent task.
5 Concluding perspectives
We have introduced the hidden manifold model for structured data sets that is simple to write down, yet displays some of the phenomena that we observe when training neural networks on real-world inputs. We saw that the model has two key ingredients, both of which are necessary: (1) high-dimensional inputs which lie on a lower-dimensional manifold and (2) latent labels for these inputs that depend on the inputs’ position within the low dimensional manifold. We hope that this model is a step towards a more thorough understanding of how the structure we find in real-world data sets impacts the training dynamics of neural networks and their ability to generalise.
We see two main lines for future work. On the one hand, the present work needs to be generalised to multi-layer networks to identify how depth helps to deal with structured data sets and to build a model capturing the key properties. On the other hand, the key promise of the synthetic hidden manifold model is that the learning dynamics should be amenable to closed-form analysis in some limit. Such analysis and its results would then provide further insights about the properties of learning beyond what is possible with numerical experiments.
We would like to thank the Kavli Institute For Theoretical Physics for its hospitality during an extended stay, during which parts of this work were conceived and carried out. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe, from “Chaire de recherche sur les modèles et sciences des données”, Fondation CFM pour la Recherche-ENS, and from the French National Research Agency (ANR) grant PAIL. This research was supported in part by the National Science Foundation under Grant No. NSF PHY-1748958
-  V. Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
-  M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
-  H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056–6091, 1992.
-  A. Engel and C. Van den Broeck. Statistical Mechanics of Learning. Cambridge University Press, 2001.
-  L. Zdeborová and F. Krzakala. Statistical physics of inference: thresholds and algorithms. Adv. Phys., 65(5):453–552, 2016.
-  Y. LeCun and C. Cortes. The MNIST database of handwritten digits, 1998.
-  P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. Physica D: Nonlinear Phenomena, 9(1-2):189–208, 1983.
-  J.A. Costa and A.O. Hero. Learning intrinsic dimension and intrinsic entropy of high-dimensional datasets. In 2004 12th European Signal Processing Conference, pages 369–372, 2004.
-  E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems 17, 2004.
-  S. Spigler, M. Geiger, and M. Wyart. Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm. arXiv:1905.10843, 2019.
-  E. Gardner and B. Derrida. Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General, 22(12):1983–1994, 1989.
-  T.L.H. Watkin, A. Rau, and M. Biehl. The statistical mechanics of learning a rule. Reviews of Modern Physics, 65(2):499–556, 1993.
-  M.S. Advani and A.M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv:1710.03667, 2017.
-  B. Aubin, A. Maillard, J. Barbier, F. Krzakala, N. Macris, and L. Zdeborová. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In Advances in Neural Information Processing Systems 31, pages 3227–3238, 2018.
J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová.
Optimal errors and phase transitions in high-dimensional generalized linear models.Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.
-  S. Goldt, M.S. Advani, A.M. Saxe, F. Krzakala, and L. Zdeborová. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. to appear. In Advances in Neural Information Processing Systems 33, arXiv:1906.08632, 2019.
Y. Yoshida, R. Karakida, M. Okada, and S.-I. Amari.
Statistical mechanical analysis of learning dynamics of two-layer perceptron with multiple output units.Journal of Physics A: Mathematical and Theoretical, 52(18), 2019.
-  R. Ge, J.D. Lee, and T. Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
-  Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
-  S. Mei and A. Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
-  S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit Regularization in Deep Matrix Factorization. In Advances in Neural Information Processing Systems 33, arXiv:1905.13655, 2019.
-  D. Saad and S.A. Solla. Exact Solution for On-Line Learning in Multilayer Neural Networks. Phys. Rev. Lett., 74(21):4337–4340, 1995.
Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft.
Convergent Learning: Do different neural networks learn the same
In D. Storcheus, A. Rostamizadeh, and S. Kumar, editors,
Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, volume 44 of Proceedings of Machine Learning Research, pages 196–212. PMLR, 2015.
M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein.
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability.In Advances in Neural Information Processing Systems 30, pages 6076–6085. Curran Associates, Inc., 2017.
-  A.S. Morcos, M. Raghu, and S. Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems 31, pages 5727–5736, 2018.
-  J. Bruna and S. Mallat. Invaraint scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, (35):1872–1886, 2013.
-  A.B. Patel, M.T. Nguyen, and R. Baraniuk. A probabilistic framework for deep learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2558–2566. Curran Associates, Inc., 2016.
-  M. Mézard. Mean-field message passing equations in the Hopfield model and its generalizations. Phys. Rev. E, (95):022117, 2017.
-  M. Gabrié, A. Manoel, C. Luneau, J. Barbier, N. Macris, F. Krzakala, and L. Zdeborová. Entropy and mutual information in models of deep neural networks. In Advances in Neural Information Processing Systems 31, pages 1826–1836, 2018.
-  E. Mossel. Deep learning and hierarchical generative models. arXiv preprint arXiv:1612.09057, 2018.
-  S. Chung, Daniel D. Lee, and H. Sompolinsky. Classification and Geometry of General Perceptual Manifolds. Physical Review X, 8(3):31003, 2018.
-  P. Rotondo, M. Cosentino Lagomarsino, and M. Gherardi. Counting the learnable functions of structured data. arXiv:1903.12021, 2019.
Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition.IEEE Transactions on Electronic Computers, EC-14(3):326–334, 1965.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.
-  M. Biehl and H. Schwarze. Learning by on-line gradient descent. J. Phys. A. Math. Gen., 28(3):643–656, 1995.
-  M. Biehl, P. Riegler, and C. Wöhler. Transient dynamics of on-line learning in two-layered neural networks. Journal of Physics A: Mathematical and General, 29(16), 1996.
-  H. Schwarze. Learning a rule in a multilayer neural network. Journal of Physics A: Mathematical and General, 26(21):5781–5794, 1993.
-  H.H. Yang and S.-I. Amari. The Efficiency and the Robustness of Natural Gradient Descent Learning Rule. In M I Jordan, M J Kearns, and S A Solla, editors, Advances in Neural Information Processing Systems 10, pages 385–391, 1998.
-  M. Rattray, D. Saad, and S.-I. Amari. Natural Gradient Descent for On-Line Learning. Physical Review Letters, 81(24):5461–5464, 1998.
Appendix A The exact form of the data-generating function is not important, as long as it is non-linear
Two questions arise when looking at the way we generate inputs in our data sets, : is the non-linearity necessary necessary, and is the choice of non-linearity important?
To answer the first question, we plot the results of the experiment with independent networks described in Sec. 4.1. The setup is exactly the same, except that we now take inputs to be
i.e. inputs are just a linear combination of the feature vectors, without applying a non-linearity. In this case, two networks trained in the vanilla teacher-student setup will learn globally different functions, as can be seen from the fractional generalisation error between the networks (5) (green diamonds), which is , i.e. not better than chance. This is a direct consequence of using : to perfectly generalise with respect to the teacher, it is thus sufficient to learn only the components of the teacher weights in the direction . Thus the weights of the network in the weight space orthogonal to the directions are unconstrained, and by starting from random initial conditions, will converge to different values for each network.
We also checked that the qualitative behaviour of a neural networks trained on the hidden manifold model does not depend on the data-generating non-linearity . In Fig. 7, we therefore show the results of the same experiment described in Sec. 4.1, but this time using
where the application of the non-linearity is again component-wise. Indeed, the results mirror those when we used the sign function .
Appendix B The existence of plateaus does not depend on the asymptotic generalisation error
We have demonstrated on the right of Fig. 2 that neural networks trained on data drawn from the hidden manifold model (HMF) introduced here do not show the plateau phenomenon, where the generalisation error stays stationary after an initial exponential decay, before dropping again. Upon closer inspection, one might think that this is due to the fact that the student trained on data from the HMF asymptotes at a higher generalisation error than the student trained in the vanilla teacher-student setup. This is not the case, as we demonstrate in Fig. 7: we observe no plateau in a sigmoidal network trained on data from the HMF even that network asymptotes at a generalisation error that is, within fluctuations, the same as the generalisation error of a network of the same sized trained in the vanilla teacher-student setup and which shows a plateau.
Appendix C Early-stopping yields qualitatively similar results
In Fig. 8, we reproduce Fig. 3, where we compare the performance of independent neural networks trained on the MNIST task (Left), or trained on structured inputs with a latent task (Center) and a teacher task (Right), respectively. This time, we the early-stopping generalisation error rather than the asymptotic value at the end of training. We define as the minimum of during the whole of training. Clearly, the qualitative result of Sec. 4.1 is unchanged: although we use structured inputs (6) in both cases, independent students will learn different functions which agree on those inputs only when they are trained on a latent task (7) (Center), but not when trained on a vanilla teacher task (4) (Right). Thus structured inputs and latent tasks are sufficient to reproduce the behaviour observed when training on the MNIST task.
Appendix D Dynamics with a large number of features
It is of independent interest to investigate the behaviour of networks trained on data from the hidden manifold model when the number of feature vectors is on the same order as the input dimension . We call this the regime of extensive . It is a different regime from MNIST, where experimental studies consistently find that inputs lie on a low-dimensional manifold of dimension , which is much smaller than the input dimension [8, 9, 10].
We show the results of our numerical experiments with in Fig. 9, where we reproduce Fig. 3 for the asymptotic (top row) and the early-stopping (bottow row) generalisation error. The behaviour of networks trained on a teacher task with structured inputs (right column) is unchanged w.r.t. to the case with . For the latent task, increasing the number of hidden units however increases the generalisation error, indicating severe over-fitting, which is only partly mitigated by early stopping. The generalisation error on this task is generally much higher than in the low- regime and clearly, increasing the width of the network is not the right way to learn a latent task; instead, it would be intriguing to analyse the performance of deeper networks on this task where finding a good intermediate representation for inputs is key. This is an intriguing avenue for future research.
Appendix E Independent students with ReLU activation function
We also verified that the behaviour of independent networks we observed on MNIST with sigmoidal students persists when training networks with ReLU activation function and that the hidden manifold model is able to reproduce it for these networks. We show the results of our numerical experiments in Fig. 10. To that end, we trained both layers of a network with starting from small initial conditions, where we draw the weights component-wise i.i.d. from a normal distribution with variance .
We see that the generalisation error of ReLU networks on the MNIST task (Left of Fig. 10) decreases with increasing number of hidden units, while the generalisation error on MNIST inputs of the two independent students with respect to each other is comparable or less than the generalisation error of each individual network on the MNIST task.
On structured inputs with a teacher task (Right of Fig. 10), where labels were generated by a teacher with hidden units, the student recovers the teacher such that its generalisation error is less than for , and both independent students learn the same function, as evidenced by their generalisation errors with respect to each other. This is the same behaviour that we see in Fig. 3 for sigmoidal networks. The finite value of the generalisation error for is due to two out of ten runs taking a very long time to converge, longer than our simulation lasted for. Finally, we see that for a latent task on structured inputs, the generalisation error of the two networks with respect to each other increases beyond the generalisation error on structured inputs of each of them, as we observed on MNIST. Thus we have recovered the phenomenology that we described for sigmoidal networks in ReLU networks, too.