Deep learning has achieved an unparalleled success in various tasks of artificial intelligence such as image classification [1, 2] and speech recognition . Remarkably, in modern machine learning applications, impressive generalization performance has been observed in an overparameterized regime, in which the number of parameters in the network is much larger than that of training data samples. Contrary to what we learn in the classical learning theory, an overparameterized network fits random labels and yet generalizes very well without serious overfitting . We do not have general theory that explains why deep learning works so well.
Recently, the learning dynamics and the generalization power of heavily overparameterized wide neural networks have extensively been studied. It has been reported that training of an overparameterized network easily achieves zero training error without getting stuck in local minima of the loss landscape [4, 5]. Mathematically rigorous results have also been obtained [6, 7]. From a different point of view, theory of the neural tangent kernel (NTK) has been developed as a new tool to investigate an overparameterized network with an infinite width [8, 9], which simply explains the reason why a sufficiently wide neural network can achieve a global minimum of the training loss.
. The standard bias-variance tradeoff picture predicts a U-shaped curve of the test error
, but instead we find the double-descent curve, which tells us that the increased model capacity beyond the interpolation threshold results in improved performance. This finding triggered detailed studies on the behavior of the bias and variance in an overparameterized regime[13, 14]. The double-descent phenomenon is not explained by traditional complexity measures such as the Vapnik-Chervonenkis dimension and the Rademacher complexity , and hence one seeks for new complexity measures of deep neural networks that can prove better generalization bounds [16, 17, 18, 19, 20, 21].
These theoretical efforts mainly focus on the effect of increasing the network width, while benefits of the network depth remain unclear. It is known that expressivity of a deep neural network grows exponentially with the depth rather than the width . See also [23, 24]. However, it is far from clear whether exponential expressivity really leads to better generalization [25, 26]. It is also nontrivial whether typical problems encountered in practice require such high expressivity. Although some works [27, 28] have shown that there exist simple and natural functions that are efficiently approximated by a network with two hidden layers but not by a network with one hidden layer, a recent work  has demonstrated that a deep network can only learn functions that are well approximated by a shallow network by using a gradient-based optimization algorithm, which indicates that benefits of depth are not due to high expressivity of deep networks. Some other recent works have reported no clear advantage of the depth in an overparameterized regime [30, 31].
To gain an insight into the advantage of the depth, in the present paper, we report our experimental study on the depth and width dependences of generalization in abstract but simple, well-controlled classification tasks with fully connected neural networks. We find that whether a deep network outperforms a shallow one depends on the property of relevant features for a given classification rule.
In this work, we introduce local labels and global labels
, both of which give simple mappings between inputs and output class labels. By “local”, we mean that the label is determined only by a few components of the input vector. On the other hand, a global label is given by a sum of local terms and determined by all components of the input. Our experiments show strong depth-dependences of the generalization error for those simple input-output mappings. In particular, we find thatdeeper is better for local labels, while shallower is better for global labels. The implication of this result is that the depth is not always advantageous, but the locality of relevant features would give us a clue for understanding the advantage the depth brings about.
We also compare the generalization performance of a trained network of a finite width with that of the kernel method with the NTK. The latter corresponds to the infinite-width limit of a fully connected network with a proper initialization and an infinitesimal learning rate , which is referred to as the NTK limit. It is found that even if the width increases, in many cases the generalization error with an optimal learning rate does not converge to the NTK limit. In such a case, a finite-width network shows much better generalization compared with the kernel learning with the NTK. In the NTK limit, the network parameters stay close to their initial values during training, which is called the lazy learning , and hence the result mentioned above indicates the importance of the feature learning, in which network parameters change to learn relevant features.
We consider a classification task with a training dataset , where is an input data and is its label. In this work, we consider the binary classification, , unless otherwise stated.
Each input is a
-dimensional vector taken from i.i.d. Gaussian random variables of zero mean and unit variance, whereis the transpose of a vector . For each input , we assign a label according to one of the following rules.
We randomly fix integers with . In the “-local” label, the relevant feature is given by the product of the components of an input , that is, the label is determined by
This label is said to be local in the sense that is completely determined by just the components of an input .111The locality here does not necessarily imply that points are spatially close to each other. Such a use of the terminology “-local” has been found in the field of quantum computation . For fully connected networks considered in this paper, without loss of generality, we can choose , ,… because of the permutation symmetry with respect to indices of input vectors.
We again fix . Let us define
where the convention is used. The -global label for is defined by
The relevant feature for this label is given by a uniform sum of the product of components of the input vector. Every component of contributes to this “-global” label, in contrast to the -local label with .
2.2 Network architecture
In the present work, we consider fully connected feedforward neural networks with hidden layers of width . We call and the depth and the width of the network, respectively. The output of the network for an input vector is determined as follows:
whereis the output of the th layer, and
are the weights and the biases, respectively. Let us denote by the set of all the weights and biases in the network. We focus on an overparameterized regime, i.e., the number of network parameters (the number of components of ) exceeds , the number of training data points.
2.3 Supervised learning
The network parameters
are adjusted to correctly classify the training data. It is done by minimizing the softmax cross-entropy lossgiven by
where the th component of is denoted by
. The main results of our paper do not change for other standard loss functions such as the mean-squared error.
The training of the network is done by the stochastic gradient descent (SGD) with learning rateand the mini-batch size . That is, for each mini-batch with , the network parameter at time is updated as
Throughout the paper, we fix . Meanwhile, we optimize before training (explain the detail later). Biases are initialized to be zero, and weights are initialized using the Glorot initialization .222We also tried the He initialization  and confirmed that results are similar to the ones obtained by the Glorot initialization, in particular when input vectors are normalized as .
The trained network classifies an input to the class given by . Let us then define the training error as
that is the miss-classification rate for the training data . We train our network until is achieved, i.e., all the training data samples are correctly classified, which is possible in an overparameterized regime.
For a training dataset , we first perform the 10-fold cross validation to optimize the learning rate under the Bayesian optimization method , and then perform the training via the SGD by using the full training dataset. In the optimization of , we try to minimize the miss-classification ratio for the validation data.
The generalization performance of a trained network is measured by computing the test error. We prepare the test data independently from the training data . The test error is defined as the miss-classification ratio for , i,.e.,
where is the prediction of our trained network. In our experiment discussed in Sec. 3, we fix .
2.4 Neural Tangent Kernel
Suppose a network of depth and width with the output . When the network is sufficiently wide and the learning rate is sufficiently small, the network parameters stay close to their randomly initialized values during training, and hence is approximated by a linear function of : . As a result, the minimization of the mean-squared error , where is the one-hot representation of the label , is equivalent to the kernel regression with the NTK () that is defined as
where denotes the average over random initializations of .
Let us consider a network whose biases and weights are randomly initialized as with and with for every respectively, where
is the number of neurons in theth layer, i.e., , . The parameter controls the impact of bias terms, and we set in our numerical experiment following Jacot et al. . By using the ReLu activation function, we can give an explicit expression of the NTK that is suited for numerical calculations. Such formulas are given in Supplimentary Material.
It is shown that the NTK takes the form , and the minimization of the mean-squared error with an infinitesimal weight decay yields the output function
where is the inverse matrix of the Gram matrix . An input data is classified to .
3 Experimental results
We now present our experimental results. For each data point, the training dataset is fixed and we optimize the learning rate via the 10-fold cross validation with the Bayesian optimization method (we used the package provided in ). We used the optimized
to train our network. At every 50 epochs we compute the training error, and we stop the training if . For the fixed dataset and the optimized learning rate
, the training is performed 10 times and calculate the average and the standard deviation of test errors.
3.1 1-local and 1-global labels
|(a) 1-local||(b) 1-global|
In the 1-local and 1-global labels, the relevant feature is a linear function of the input vector. Therefore, in principle, even a linear network can correctly classify the data. Figure 1
shows the generalization errors in nonlinear networks of the varying depth and width as well as those in the linear perceptron (the network of zero depth). The input dimension is set to be. We also plotted test errors calculated by the NTK, but we postpone the discussion about the NTK until Sec. 3.3.
Figure 1 shows that in both 1-local and 1-global labels, the test error decreases with the network width, and a shallower network () shows better generalization compared with a deeper one (). The linear perceptron shows the best generalization performance, which is natural because it is the simplest network that is capable of learning the relevant feature associated with the 1-local or 1-global label. Remarkably, test errors of nonlinear networks ( and ) are not too large compared with those of the linear perceptron, although nonlinear networks are much more complex than the linear perceptron.
For a given network architecture, we do not see any important difference between the results for 1-local and 1-global labels, which would be explained by the fact that these labels are transformed to each other via the Fourier transformation of input vectors.
3.2 Opposite depth dependences for -local and -global labels with
For , it turns out that experimental results show opposite depth dependences for -local and -global labels. Let us first consider -local labels with . Figure 2 (a) and (b) show test errors for varying in various networks for the 2-local and the 3-local labels, respectively. The input dimension is set to be in the 2-local label and in the 3-local label. We see that the test error strongly depends on the network depth. A deeper network () generalizes better than a shallower one (). It should be noted that for , the network of and contains about trainable parameters, the number of which is much larger than that of trainable parameters () in the network of and . In spite of this fact, the latter outperforms the former in the 2-local label as well as in the 3-local label with large , which implies that increasing the number of trainable parameters do not necessarily implies better generalization. In -local labels with , the network depth is more strongly correlated to generalization compared with the network width.
From Fig. 2 (b), it is obvious that the network of and fails to learn the 3-local label for small . We also see that error bars of the test error are large in the network of and . The error bar represents the variance due to initialization and training. By increasing the network width , both variances and test errors decrease. This result is consistent with the recent observation in the lazy regime that increasing the network width results in better generalization because it reduces the variance due to initialization .
|(a) 2-local||(b) 3-local|
|(c) 2-global||(d) 3-global|
Next, we consider -global labels with and 3. The input dimension is set as for the 2-global label and for the 3-global label. We plot test errors against in Fig. 2 for (c) the 2-global label and (d) the 3-global label. Again we find strong depth dependences, but now shallow networks () outperform deep ones (), which is contrary to the results for -local labels. For , we also find strong width dependences; the test error of a wider network more quickly decreases with . In particular, in the 3-global label, an improvement of the generalization with is subtle for and . By increasing the width, the decrease of the test error with becomes much faster [see the result for and in Fig. 2 (d)].
To see more details of the effect of depth, we also plot the depth dependence of the test error for fixed training data samples. We prepare training data samples for the 2-local and 2-global labels, respectively. The input dimension is for the 2-local label and for the 2-global label. By using the prepared training data samples, networks of the depth and the width are trained up to . The test errors of trained networks are shown in Fig. 3. In the 2-local label, the test error decreases with , whereas the test error increases with in the 2-global label. Thus, Fig. 3 clearly shows the opposite depth dependences for local and global labels.
3.3 Comparison between finite networks and NTKs
In Figs. 1 and 2, test errors calculated by using the NTK are also plotted. In the case of (Fig. 1), the generalization performance of the NTK is comparable with that of finite networks. For the 2-global label [Fig. 2 (c)], the test error obtained by the NTK is comparable or lower than that of finite networks.
The crucial difference is seen in the case of -local label with and 3 and the 3-global label. In Fig. 2 (a) and (b), we see that the NTK almost completely fails to classify the data, although finite networks succeed in doing so. In the case of the 3-global label, the NTK of depth correctly classifies the data, while the NTK of depth fails [see Fig. 2 (d)]. In those cases, the test error calculated by a finite network does not seem to converge to that obtained by the NTK as the network width increases.
The NTK has been proposed as a theoretical tool to investigate the infinite-width limit, but it should be kept in mind that the learning rate has to be sufficiently small to achieve the NTK limit [8, 9]. The discrepancy between a wide network and the NTK in Fig. 2 stems from the strong learning-rate dependence of the generalization performance. In our experiment, the learning rate has been optimized by performing the 10-fold cross validation. If the optimized learning rate is not small enough for each width, the trained network may not be described by the NTK even in the infinite-width limit.
In Fig. 4 (a), we plot the learning-rate dependence of the test error for the 2-local label and the 2-global label in the network of the depth and the width . We observe a sharp learning-rate dependence in the case of the 2-local label in contrast to the case of the 2-global label. In Fig. 4 (b), we compare the learning-rate dependences of the test error for and in the case of the 3-global label (in both cases ). We see that the learning-rate dependence for is much stronger than that for , which is consistent with the fact that the NTK fails only for . It should be noted that Fig. 4 (b) shows that the deep network () outperforms the shallow one () in the regime of small learning rates, while the shallow one performs better than the deep one at their optimal learning rates.
Figure 4 also shows that the test error for a sufficiently small learning rate approaches the one obtained by the corresponding NTK. Therefore, the regime of small learning rates is identified as a lazy learning regime, while larger learning rates correspond to a feature learning regime. Sharp learning-rate dependences found here provide theoretical and practical importance of the feature learning.
In this work, we have studied the effect of increasing the depth in classification tasks. Instead of using real data, we have employed abstract setting with random inputs and simple classification rules because such a simple setup helps us understand under what situations deeper networks perform better or worse. We find that the locality of the relevant feature for a given classification rule plays a key role.
We note that the advantage of the depth in local labels is not due to high expressivity of deep networks. If a network can accurately classify the data with the -local label and the input dimension , it can in principle classify the data with an arbitrarily large input dimension . This is because the -local label depends only on the components among components. By using this fact, it is confirmed that a small network with one hidden layer of the width of about 10-100 can express the 2-local label and the 3-local label almost perfectly.333This fact does not mean at all that such a small network can actually learn the local label for large by a gradient-based algorithm. In other words, learning the -local label for small does not require high expressive power. Nevertheless, a deeper network outperforms a shallower one.
It is also an interesting observation that shallower networks do better than deeper ones for the -global label. This result shows that the depth is not always beneficial. In future studies, we hope to investigate which properties of the data other than locality studied here result in (dis)advantage of the depth.
It is an important practical problem to find neural architecture designing principles depending on specific machine-learning tasks. Although our paper is motivated by a theoretical question, i.e., why deep networks perform so well compared with shallow networks, this work will impact the practical problem mentioned above. Our work indicates that a deeper architecture is better for local features, whereas a shallower architecture is better for global features. It is expected that local features are important in typical image classification tasks, and our work suggests that a deep architecture should be used for such a task, which is consistent with our experience. Furthermore, if we could find a transformation of local features to global ones, then even shallow networks should be able to classify the data with a great accuracy after the transformation, which is of practical merit. It would be also important to identify problems where labels are effectively global (e.g., where Fourier analysis works well, finance, weather forecasting). The present research can lead to better solutions in these fields. In short, our theoretical work suggests a guiding principle for future studies on architecture designing principles.
- Krizhevsky et al.  Advances in Neural Information Processing Systems (2012) pp. 1097–1105.
- LeCun et al.  Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521, 436–444 (2015).
- Hinton et al.  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and Others, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal processing magazine 29, 82–97 (2012).
- Zhang et al.  C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding Deep Learning Requires Rethinking of Generalization, in International Conference on Learning Representations (2017).
- Baity-Jesi et al.  M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. Ben Arous, C. Cammarota, Y. LeCun, M. Wyart, and G. Biroli, Comparing Dynamics : Deep Neural Networks versus Glassy Systems, in International Conference on Machine Learning (2018).
- Allen-Zhu et al.  Z. Allen-Zhu, Y. Li, and Z. Song, A convergence theory for deep learning via over-parameterization, in International Conference on Machine Learning (2019) arXiv:1811.03962.
- Du et al.  S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai, Gradient descent finds global minima of deep neural networks, in International Conference on Machine Learning (2019) arXiv:1811.03804.
- Jacot et al.  A. Jacot, F. Gabriel, and C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, in Advances in Neural Information Processing Systems (2018) arXiv:1806.07572.
- Arora et al.  S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang, On Exact Computation with an Infinitely Wide Neural Net, in Neural Information Processing Systems (2019) arXiv:1904.11955.
- Spigler et al.  S. Spigler, M. Geiger, S. D’Ascoli, L. Sagun, G. Biroli, and M. Wyart, A jamming transition from under- To over-parametrization affects generalization in deep learning, Journal of Physics A: Mathematical and Theoretical 52, 474001 (2019), arXiv:1810.09665.
- Belkin et al.  M. Belkin, D. Hsu, S. Ma, and S. Mandal, Reconciling modern machine-learning practice and the classical biasâvariance trade-off, Proceedings of the National Academy of Sciences of the United States of America 116, 15849–15854 (2019), arXiv:1812.11118.
- Geman et al.  S. Geman, E. Bienenstock, and R. Doursat, Neural Networks and the Bias/Variance Dilemma, Neural Computation 4, 1–58 (1992).
- Neal et al.  B. Neal, S. Mittal, A. Baratin, V. Tantia, M. Scicluna, S. Lacoste-Julien, and I. Mitliagkas, A Modern Take on the Bias-Variance Tradeoff in Neural Networks, in Workshop on Identifying and Understanding Deep Learning Phenomena (2019) arXiv:1810.08591.
-  S. D’Ascoli, M. Refinetti, G. Biroli, and F. Krzakala, Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime, arXiv:2003.01054.
- Mohri et al.  M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning (MIT press, 2018).
- Dziugaite and Roy  G. K. Dziugaite and D. M. Roy, Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, in Uncertainty in Artificial Intelligence (2017) arXiv:1703.11008.
- Neyshabur et al.  B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro, Exploring Generalization in Deep Learning, in Advances in Neural Information Processing Systems (2017).
- Neyshabur et al.  B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro, The role of over-parametrization in generalization of neural networks, in International Conference on Learning Representations (2019) arXiv:1805.12076.
- Arora et al.  S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, Stronger generalization bounds for deep nets via a compression approach, in International Conference on Machine Learning (2018) arXiv:1802.05296 .
- Nagarajan and Kolter  V. Nagarajan and J. Z. Kolter, Generalization in Deep Networks: The Role of Distance from Initialization, in Advances in Neural Information Processing Systems (2017) arXiv:1901.01672.
- Pérez et al.  G. V. Pérez, A. A. Louis, and C. Q. Camargo, Deep learning generalizes because the parameter-function map is biased towards simple functions, in International Conference on Learning Representations (2019) arXiv:1805.08522.
- Poole et al.  B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, in Advances in Neural Information Processing Systems (2016) arXiv:1606.05340.
- Bianchini and Scarselli  M. Bianchini and F. Scarselli, On the complexity of neural network classifiers: A comparison between shallow and deep architectures, IEEE Transactions on Neural Networks and Learning Systems 25, 1553–1565 (2014).
- Montúfar et al.  G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio, On the number of linear regions of deep neural networks, in Advances in Neural Information Processing Systems (2014) arXiv:1402.1869.
- Ba and Caruana  J. Ba and R. Caruana, Do deep networks really need to be deep?, in Advances in Neural Information Processing Systems (2014) arXiv:1312.6184.
- Becker et al.  S. Becker, Y. Zhang, and A. A. Lee, Geometry of Energy Landscapes and the Optimizability of Deep Neural Networks, Physical Review Letters 124, 108301 (2020), arXiv:1808.00408.
- Eldan and Shamir  R. Eldan and O. Shamir, The Power of Depth for Feedforward Neural Networks, in Proceedings of Machine Learning Research (2016).
- Safran and Shamir  I. Safran and O. Shamir, Depth-width tradeoffs in approximating natural functions with neural networks, in International Conference on Machine Learning (2017) arXiv:1610.09887.
- Malach and Shalev-Shwartz  E. Malach and S. Shalev-Shwartz, Is Deeper Better only when Shallow is Good?, in Advances in Neural Information Processing Systems (2019) arXiv:1903.03488.
- Geiger et al.  M. Geiger, S. Spigler, S. D’Ascoli, L. Sagun, M. Baity-Jesi, G. Biroli, and M. Wyart, Jamming transition as a paradigm to understand the loss landscape of deep neural networks, Physical Review E 100, 012115 (2019), arXiv:1809.09349.
-  M. Geiger, S. Spigler, A. Jacot, and M. Wyart, Disentangling feature and lazy training in deep neural networks, arXiv:1906.08034.
- Chizat et al.  L. Chizat, E. Oyallon, and F. Bach, On Lazy Training in Differentiable Programming, in Neural Information Processing Systems (2019) arXiv:1812.07956.
- Kempe et al.  J. Kempe, A. Kitaev, and O. Regev, The complexity of the local Hamiltonian problem, SIAM Journal on Computing 35, 1070–1097 (2006).
- Glorot and Bengio  X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of Machine Learning Research (2010).
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in Proceedings of the IEEE International Conference on Computer Vision (2015) arXiv:1502.01852.
- Snoek et al.  J. Snoek, H. Larochelle, and R. P. Adams, Practical Bayesian Optimization of Machine Learning Algorithms, in Advances in Neural Information Processing Systems (2012).
- Nogueira  F. Nogueira, Bayesian Optimization: Open source constrained global optimization tool for Python (2014).
Appendix A Explicit expression of the NTK
We consider a network whose biases and weights are randomly initialized as with and with for every , where is the number of neurons in the th layer, i.e., , . In the infinite-width limit , the pre-activation at every hidden layer tends to an i.i.d. Gaussian process with covariance that is defined recursively as
for . We also define
where is the derivative of . The NTK is then expressed as , where
The derivation of this formula is given by Arora et al. .
Using the ReLu activation function , we can further calculate and , obtaining
For , we obtain .
By solving eqs. (15) and (16) iteratively, the NTK in eq. (14) is obtained.444When (no bias), the equations are further simplified; and , where is the angle between and , and is iteratively determined by