Trap of Feature Diversity in the Learning of MLPs

12/02/2021
by   Dongrui Liu, et al.
Shanghai Jiao Tong University
3

In this paper, we discover a two-phase phenomenon in the learning of multi-layer perceptrons (MLPs). I.e., in the first phase, the training loss does not decrease significantly, but the similarity of features between different samples keeps increasing, which hurts the feature diversity. We explain such a two-phase phenomenon in terms of the learning dynamics of the MLP. Furthermore, we propose two normalization operations to eliminate the two-phase phenomenon, which avoids the decrease of the feature diversity and speeds up the training process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/08/2020

Two-Phase Learning for Overcoming Noisy Labels

To counter the challenge associated with noise labels, the learning stra...
09/15/2017

Improving the Diversity of Top-N Recommendation via Determinantal Point Process

Recommender systems take the key responsibility to help users discover i...
06/26/2018

Phase transition in the knapsack problem

We examine the phase transition phenomenon for the Knapsack problem from...
05/29/2021

Rapid Feature Evolution Accelerates Learning in Neural Networks

Neural network (NN) training and generalization in the infinite-width li...
09/03/2019

LCA: Loss Change Allocation for Neural Network Training

Neural networks enjoy widespread use, but many aspects of their training...
05/20/2021

A cahn-Hilliard multiphase system with mobilities for wetting simulation

This paper tackles the simulation of the wetting phenomenon using a phas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved significant success in various tasks. However, the essential reason for the superior performance of DNNs is not fully investigated, and many phenomena have not been well explained. Therefore, many methods have been proposed to explain DNNs,

e.g. investigating the phenomena of the lottery ticket hypothesis [15], explaining the phenomena of the double descent [21, 35, 17], understanding the information bottleneck hypothesis [45, 41], exploring the gradient noise and regularization [42, 32], and analyzing the nonlinear learning dynamics [40, 25, 37].

In this paper, we focus on the learning dynamics of the MLP. In general, the training process of the MLP can be considered to have two phases. In the first phase, the learning process does not find a clear optimization direction, and the training loss does not decrease. In the second phase, the training loss suddenly begins to decrease, as shown in Figure 1. The first phase may be very short or cannot be observed for simple tasks, but the first phase is long for complex tasks.

Figure 1:

The training process of a 9-layer MLP exhibits a two-phase phenomenon on the CIFAR-10 dataset. In the first phase, the training loss does not decrease significantly, and the average cosine similarity between intermediate features of different categories keeps increasing. We aim to explain the increase of feature similarity.

More specifically, this study investigates the learning dynamics of the MLP in the first phase. We find an interesting phenomenon that features of different categories become more and more similar to each other in the first phase, as Figure 2

(a) shows. The feature diversity keeps decreasing until the second phase. More crucially, we find that this phenomenon is widely shared by MLPs with different depths and widths, different activation functions, different learning rates, trained on different datasets. In fact, the investigation of feature diversity is of great value.

[38, 6], and [33] have pointed that the decrease of feature diversity may hurt the classification performance of DNNs.

In this paper, we aim to investigate why and how the feature diversity decreases during the early training process of the MLP. To this end, we explain the dynamics of features and parameters in intermediate layers of the MLP. We find that both the features and parameters of the MLP are mainly optimized towards a special direction, which is called the primary direction. Furthermore, the tendency of the optimization towards the primary direction can be boosted, just like a self-enhanced system, which decreases the diversity of features of different categories.

Besides the theoretical explanation of the decrease of feature diversity, we further propose two normalization operations to eliminate the two-phase phenomenon, in order to avoid the decrease of feature diversity.

Contributions of this study can be summarized as follows. (1) We discover the decrease of feature diversity in early iterations of learning the MLP. (2) We theoretically explain this phenomenon in terms of learning dynamics. (3) We further propose two normalization operations to avoid the decrease of feature diversity.

2 Related Work

The representation capacity of DNNs. Analyzing the representation capacity of DNNs is an important direction to explain DNNs. The information bottleneck theory [45, 41] quantitatively explained the information encoded by features in intermediate layers of DNNs. [46], [1], and [8] used the information bottleneck theory to evaluate and improve the DNN’s representation capacity. [5] analyzed the representation capacity of DNNs with real training data and noises. In addition, several metrics were proposed to measure the generalization capacity or robustness of DNNs, including the stiffness [14], the sensitivity metrics [36], the Fourier analysis [48], and the CLEVER score [44]. Some studies focused on proving the generalization bound of DNNs [30, 29].

In comparison, we explain the MLP from a new perspective, i.e. we explain the decrease of feature diversity in early iterations of the MLP.

The learning dynamics of DNNs. Analyzing the learning dynamics is another perspective to understand DNNs. Many studies analyzed the local minimal in the optimization landscape of linear networks [7, 40, 16, 12] and nonlinear networks [9, 23, 39]. Some studies discussed the convergence rate of gradient descent on separable data [43, 47, 34]. [18] and [22] have investigated the effects of the batch size and the learning rate on SGD dynamics. In addition, some studies analyzed the dynamics of gradient descent in the overparameterization regime [4, 20, 28, 13].

In comparison, we analyze the learning dynamics of features and weights of the MLP, in order to explain the decrease of feature diversity in the first phase.

Figure 2:

We trained a 9-layer MLP on the CIFAR-10 dataset for 300 epochs, where each layer of the MLP had 512 neurons. (a) Cosine similarity of features between samples in different categories kept increasing in the first phase. The low cosine similarity indicated the high feature diversity. (b) Cosine similarity of gradients

w.r.t. features between different samples of a category kept increasing in the first phase. (c) Cosine similarity of weight changes between pseudo-neurons in a layer kept increasing. We reported the value averaged over different samples of a category. (d) The curve of the testing error and the curve of the training loss during the training process.
Figure 3:

We trained MLPs on the MNIST dataset, each MLP being designed with different depths and widths. (left) The cosine similarity of features between different training samples. (right) The cosine similarity of gradients

w.r.t. features between samples in a category. We computed the epoch-wise similarity curve for each MLP. Please see Appendix Section A for more discussion.

3 Algorithm

3.1 The decrease of feature diversity

In this paper, we aim to explain an interesting phenomenon when we train an MLP in early iterations. Specifically, the training process of the MLP can usually be divided into the following two phases according to the training loss. As Figure 2(d) shows, in the first phase, the training loss does not decrease significantly, and the training loss suddenly begins to decrease in the second phase.

We discover an interesting phenomenon in the first phase, i.e. both the diversity of features and the diversity of gradients w.r.t. features in intermediate layers over different samples keep decreasing.

As Figure 2(a)(b) shows, before the 130-th epoch (the first phase), both the feature diversity and the gradient diversity keep decreasing, i.e., both the cosine similarity between features over different samples and the cosine similarity between gradients keep increasing. After the 130-th epoch (the second phase), the feature diversity and the gradient diversity suddenly begin to increase, i.e. their similarities begin to decrease. Therefore, the MLP has the lowest feature diversity and the lowest gradient diversity at around the 130-th epoch.

Crucially, this phenomenon is widely shared by MLPs with different architectures for different tasks.

Specifically, we train MLPs with different activation functions (including ReLU and Leaky ReLU

[31] with the slope of 0.01, 0.1), different learning rates (= 0.1, 0.01), and different batch sizes (=100, 500). These MLPs are trained on the MNIST dataset [27], the CIFAR-10 dataset [24]

, and the first fifty categories in the Tiny ImageNet dataset

[26], respectively. Please see Appendix Section A for details. Figure 3 shows that in the first phase, both the feature diversity and the gradient diversity decrease in all MLPs.

Connection to the epoch-wise double descent. The above two-phase phenomenon is closely related to the epoch-wise double descent behavior, which was investigated by [35] and [17]. The epoch-wise double descent behavior has three stages during the training process of a DNN. The testing error decreases in the first stage, then increases in the second stage, and finally continues to decrease in the third stage. As Figure 2(d) shows, the first and the second stages in the epoch-wise double descent behavior are temporally aligned with the first phase in the aforementioned two-phase phenomenon, where the training loss does not change significantly. Please see Appendix Section B for more discussion about the epoch-wise double descent behavior. Instead of explaining the epoch-wise double descent behavior, in this paper, we mainly explain the decrease of feature diversity in the first phase.

3.2 Modeling the decrease of feature diversity

Notation. We consider an MLP with concatenated linear layers, each being followed by a ReLU layer, except the last linear layer before the softmax operation. Without loss of generality, we mainly focus on the learning dynamics of gradient descent. Let denote the weight matrix of the -th linear layer with neurons , which is learned after the -th iteration. Given an input sample , the layer-wise forward propagation of the -th layer is represented as , where denotes a ReLU layer, and denotes the output feature of the -th ReLU layer in the MLP after the -th iteration. denotes a diagonal matrix, where each element on the main diagonal represents the gating state of the -th neuron in the ReLU layer. Accordingly, gradients of the loss w.r.t output features of the -th ReLU layer are given as .

In Section 3.1, we have demonstrated the typical phenomena in the first phase of the learning process, i.e. both the similarity between features and the similarity between gradients w.r.t. features keep increasing. In this section, we mainly explore the reason for the increasing similarity between gradients w.r.t. features. Then, based on the analysis of the increasing similarity between gradients, we can explain the increasing similarity between features.

Basic idea. The theoretical explanation of the increasing similarity of gradients w.r.t. features is inspired by the following observation, i.e. weights of different neurons in the same layer are mainly changed towards a common direction.

Specifically, we disentangle the component of weight changes along the common direction. We prove that weight changes along the common direction are enhanced in the first phase. In other words, different neurons are more and more likely to be pushed towards the same direction. The enhanced common direction can explain both the increasing similarity between gradients w.r.t. features and the increasing similarity between features.

3.2.1 Observations: assuming and disentangling the common direction

We find that the increasing similarity of feature gradients in the first phase can be explained as the increasing similarity between neurons, i.e. being explained as the phenomenon that weights of different neurons are optimized towards a common direction. Given an input sample , we rewrite the back propagation in the -th linear layer as follows.

(1)

In this way, we can consider gradients computed in the back propagation as the result of a pseudo-forward propagation. In this pseudo-forward propagation, the input is gradients w.r.t. features of the -th layer , and the output is -dimensional gradients . Accordingly, the equivalent weight matrix can be considered consisting of pseudo-neurons.

Therefore, the increasing similarity between feature gradients over different samples can be explained as the increase of the similarity between pseudo-neurons in the weight matrix. The increasing similarity between weight changes of pseudo-neurons in the pseudo-forward propagation is illustrated in Figure 2

(c). Note that the initial weight of pseudo-neurons are orthogonal to each other according to the law of large numbers. Thus, Figure

2(c) shows that for most pseudo-neurons and , , which makes the cosine similarity between weights of different pseudo-neurons in the same layer keep increasing across different iterations in the first phase.

Disentangling the weight change along the common direction as the basis of the further proof. Based on the above analysis, we can roughly assume that weights of different pseudo-neurons are changed along a common direction . Thus, we aim to formulate the common direction , and disentangle the component of weight changes along the common direction . This can be served as the basis of further proof of the increasing similarity between feature gradients in the first phase.

Given a training sample after the -th iteration, let denote weight changes of pseudo-neurons in the -th layer. According to the above observation, we can decompose into the component along the common direction and the component in other directions , where denotes the common direction of weight changes .

(2)

where denotes the weight changes of different pseudo-neurons along the common direction . is relatively small and is called the “noise” term. denotes the -th row of

. We can estimate the common direction

by minimizing the noise term over different samples through different iterations as follows.

(3)
Lemma 1.

(Proof in Appendix Section C) For the decomposition , given and , reaches its minimum if and only if , where = , and .

Lemma 1 indicates that given the weight change and the common direction , we can compute and explicitly, and obtain .

Experiments to illustrate the significance of the common direction. Here, we illustrate that the weight change along the common direction is much greater than the weight change along other directions. Given the overall weight change made by all training samples after the -th iteration, we decompose into components along five common directions as , where = is termed the primary common direction. represents the secondary common direction, which is the most common direction when we remove the component of the weight change along the direction from the , i.e. . Then, we compute the by conducting Equation (3) on . Please see Appendix Section D for more discussion about the procedure of the decomposition. Similarly, and represent the third, forth, and fifth common directions, respectively. Theoretically, , , , , and are orthogonal to each other. In this way, measures the strength of weight changes along the -th common direction.

Specifically, in experiments, we trained 9-layer MLPs on the MNIST dataset [27], the CIFAR-10 dataset [24]

, and the Tiny ImageNet dataset

[26], respectively. Each layer of the MLP had 512 neurons. Figure 4 shows that the strength of the component of weights changes along the primary direction was approximately ten times greater than the strength of the component along the second direction .

Figure 4: The strength of different common directions in the (a) CIFAR-10 dataset, (b) MNIST dataset, and (c) Tiny ImageNet dataset. We illustrated results on the two categories with the highest training accuracies. The strength of the primary direction was much greater than the strength of other directions.

3.2.2 Proof of the decrease of feature diversity

In the above analysis, we have illustrated, assumed, and disentangled the common direction of weight changes. Based on this common direction, we aim to prove the decrease of feature diversity in the first phase. The proof consists of the following three steps.
(1) We formulate common directions by investigating the learning dynamics of the weight change made by each specific training sample.
(2) Based on the learning dynamics made by a training sample, we prove that features and weight changes can enhance the significance of each other.
(3) Based on this, we can explain the increase of both the feature similarity and the gradient similarity.

CIFAR-10

Category Cat Truck
() Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6

MNIST

Category Eight Zero
() Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6

Tiny ImageNet

Category Flagpole Bottle
() Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
Table 1: Strength of components of weight changes along the common direction and other directions. We trained 9-layer MLPs on the CIFAR-10 dataset, the MNIST dataset, and the Tiny ImageNet dataset, respectively. Each layer of the MLP had 512 neurons. The strength of the primary common direction was much greater than those of other directions. Appendix Section F discusses the reason for the phenomenon that , , and does not decrease monotonically.
Lemma 2.

(Proof in Appendix Section E) For the decomposition , given and , reaches its minimum if and only if , where = , and .

Lemma 2 reveals that given the weight matrix and the common direction , we can compute and explicitly, and obtain .

Theorem 1.

(Proof in Appendix Section F) The weight change made by a sample can be decomposed into terms after the -th iteration: , where , . , where the SVD of is given as , and denotes the

-th singular value

. and denote the -th column of the matrix and , respectively. Besides, we have .

Formulating the learning dynamics made by each specific training sample. Theorem 1 enables us to decompose common directions from the learning dynamics (weight changes) made by each specific training sample .

The primary term represents the component of weight changes along the common direction . The -th term represents the component of weight changes along the -th direction , which is orthogonal to the common direction .

Table 1 illustrates that the strength of the component of weight changes along the primary common direction is much more significant than strengths of components along other directions. To this end, we compute the average strength of weight changes along the common direction over all samples across different iterations . Similarly, the strength of weight changes along the -th direction is computed as .

Relationship between features and weight changes. Inspired by the above learning dynamics of weight changes made by each specific training sample, we discover a close relationship between features and weight changes. Specifically, according to Theorem 1, the change of weights made by a sample after the -th iteration can be computed as follows.

(4)

where , and . Please see Appendix Section F for the proof.

We discover that weight changes can be represented as both in Equation (4) and in Equation (2). Note that is highly related to the primary common direction in the -th layer. Thus, both terms and play dominant roles in . Therefore, we guess that the common direction is similar to , and the feature

is similar to the vector

. Figure 5 verifies the guess that the feature is in the similar direction of the vector , where .

The proof of the self-enhanced common direction. Inspired by the above relationship between features and weight changes, we aim to prove that features and weights become more and more similar to each other in the first phase. Such proof can explain the self-enhancement of the common direction in the first phase.

Theorem 2.

(Proof in Appendix Section G) Given an input sample and a common direction after the -th iteration, if the maximum singular value of is small enough, we can obtain , where , and .

Figure 5: Cosine similarity between the feature and the vector

in the first phase. We conducted experiments on 9-layer MLPs trained on the (a) CIFAR-10 dataset, and (b) Tiny ImageNet dataset. The shade in each subfigure represents the standard deviation of the cosine similarity over different samples.

Figure 6: The change of in the first phase. We trained 9-layer MLPs on the CIFAR-10 (a), the MNIST (b), and the Tiny ImageNet (c). Each layer of the MLP had 512 neurons.

Theorem 2 describes two typical learning dynamics in the first phase, i.e. the feature is pushed towards either the coefficient vector or the coefficient vector . To this end, denotes the change of features after the -th iteration. In this paper, we roughly consider that the change of features is negatively related to the gradient w.r.t. features, i.e. and , although strictly speaking, the change of features is not exactly equal to the gradient w.r.t. features.

Theorem 2 shows the following two cases. In Case 1, if , then . In other words, if the coefficient vector is pushed towards the feature , then the feature is pushed towards the coefficient vector . In this way, we consider that the feature and the coefficient vector become more and more similar to each other in the first phase. Similarly, in Case 2, given another input training sample , if , then . In other words, if the coefficient vector is pushed towards the direction of the feature , then the feature is pushed towards the coefficient vector . In this way, the feature and the coefficient vector gradually become negatively related to each other.

Verification. We conducted experiments to verify the relationship between features and weights. To this end, we measured the change of the value in the first phase. Figure 6 reports the mean and the standard deviation of over different samples at each epoch. We repeatedly conducted experiments on the CIFAR-10 dataset, the MNIST dataset, and the Tiny ImageNet dataset, respectively, to measure the . For each sample , was always positive and usually increased over epochs, which verified Theorem 2. Besides, Theorem 2 is based on the assumption that the maximum singular value of is small enough. Experimental results in Appendix Section G verified that the maximum singular value of was usually small enough in the first phase.

Theorem 3.

(Proof in Appendix Section H) For each training sample belonging to the category , we have , and , where is constant for each category for the -th layer.

Theorem 3 indicates that training samples of the same category all push the vector towards the feature in the first phase, where is a constant shared by different samples of the same category . Figure 5 verifies that are consistently positive over different samples of the same category.

Assumption 1. We assume that the MLP encodes features of very few (a single or two) categories in the first phase, while other categories have not been learned in the first phase.

To verify this assumption, Figure 7 shows that only a single or two categories exhibit much higher accuracies than the random guessing at the end of the first phase. This indicates that the learning of the MLP is dominated by training samples of a single or two categories.

The combination of Theorem 3 and Assumption 1 clarifies the overall learning dynamics in the first phase. In other words, the overall learning effects of all training samples can be roughly considered as pushing the vector towards the feature . is determined by the dominating category/categories. In addition, , which means that the change of the feature is similar to the vector . Thus, we can consider that the change of the feature is modified towards the feature . In this way, the learning process of the MLP works just like a self-enhanced system. Besides, the weight coefficient vector and the feature enhance each other.

Proof the increasing feature similarity and the increasing gradient similarity. As aforementioned, features of different samples are consistently pushed towards the same vector . Therefore, the similarity between features of different samples increases in the first phase, which makes different training samples generate similar gating states in each ReLU layer. The increasing similarity between gating states of each ReLU layer over different samples leads to the increasing the similarity between gradients w.r.t. features over different samples of the same category. Please see Appendix Section I for more discussion.

How to escape from the first phase? In the first phase, the MLP only discovers a single direction to optimize a single or two categories. However, the optimization of a single or two categories will soon saturate, and the gradient mainly comes from training samples of other categories, which destroys the dominating roles of a single or two categories in the learning of the MLP. Therefore, the learning effects of training samples from different categories may conflict with each other. Thus, the self-enhanced system is destroyed, and the learning of the MLP enters the second phase.

Figure 7: The training accuracy of MLPs on the CIFAR-10 dataset and the MNIST dataset. The accuracy was evaluated at the end of the first phase. The MLP only learned features of a single or two categories in the first phase.

3.3 Eliminating the decrease of feature diversity

Figure 8: The training loss with and without normalization operations of 9-layer MLPs trained on the CIFAR-10, the MNIST, and the Tiny ImageNet dataset, respectively.
Figure 9: The relative strength of weight changes along other directions w.r.t. the primary direction. We compared the relative strength between MLPs with and without normalization operations. The experiment was conducted on the cat category and the truck category in the CIFAR-10 dataset.
Figure 10: Cosine similarity of features between samples in different categories. We compared the cosine similarity between MLPs with and without normalization operations on the CIFAR-10 dataset.

In this section, we aim to eliminate the decrease of feature diversity in the first phase, in order to speed up the training process. The analysis in the above section shows that the similarity between features of different samples increases in the first phase. To this end, we propose to use two normalization operations to eliminate the decrease of feature diversity in the first phase.

We are given the output feature of the -th linear layer w.r.t. the input sample , where denotes the -th dimension of the feature. The first normalization operation is given as , where and denote the mean value and the standard deviation of

over different samples, respectively. This operation is similar to the batch normalization 

[19], but we do not compute the scaling and shifting parameters in the batch normalization. The second normalization operation is the batch normalization. The above two normalization operations can prevent features of different samples from being similar to each other, because the mean feature is subtracted from features of all samples.

In order to verify the two normalization operations can eliminate the two-phase phenomenon during the training process of the MLP, we trained 9-layer MLPs with and without normalization operations. Specifically, for the normalization operation , we added the normalization operation after each linear layer, expect the last linear layer. Similarly, for the MLP with the batch normalization operation, we added a batch normalization layer after each linear layer. Each linear layer in the MLP had 512 neurons. Figure 8 shows that the two normalization operations successfully eliminated the two-phase phenomenon of the learning of MLPs and sped up the training process.

Figure 9 shows that in MLPs with normalization operations, weight changes along other common directions were strengthened than those in the MLP without normalization operations. To this end, we used to measure the relative strength of weight changes along the -th common direction w.r.t. the strength of weights changes along the primary common direction . Furthermore, Figure 10 shows that the feature similarity in MLPs with normalization operations kept decreasing, while the feature similarity of the MLP without normalization operations kept increasing. This indicated that the normalization operations avoided the decreasing of feature diversity.

4 Conclusion

In this paper, we find that in the early stage of the training process, the MLP exhibits a two-phase phenomenon, and the feature diversity keeps decreasing in the first phase. We formulate and explain this phenomenon by analyzing the learning dynamics of the MLP. Furthermore, we propose to use two normalization operations to eliminate the above two-phase phenomenon and speed up the training process.

References

  • [1] A. Achille and S. Soatto (2018) Information dropout: learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2897–2905. Cited by: §2.
  • [2] B. Adlam and J. Pennington (2020) The neural tangent kernel in high dimensions: triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp. 74–84. Cited by: Appendix B.
  • [3] M. S. Advani and A. M. Saxe (2017) High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667. Cited by: Appendix B.
  • [4] S. Arora, N. Cohen, and E. Hazan (2018) On the optimization of deep networks: implicit acceleration by overparameterization. In International Conference on Machine Learning, pp. 244–253. Cited by: §2.
  • [5] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In International Conference on Machine Learning, pp. 233–242. Cited by: §2.
  • [6] B. O. Ayinde, T. Inanc, and J. M. Zurada (2019)

    Regularizing deep neural networks by enhancing diversity in feature extraction

    .
    IEEE transactions on neural networks and learning systems 30 (9), pp. 2650–2661. Cited by: §1.
  • [7] P. Baldi and K. Hornik (1989)

    Neural networks and principal component analysis: learning from examples without local minima

    .
    Neural networks 2 (1), pp. 53–58. Cited by: §2.
  • [8] H. Cheng, D. Lian, S. Gao, and Y. Geng (2018) Evaluating capability of deep neural networks for image classification via information plane. In

    European Conference on Computer Vision

    ,
    pp. 181–195. Cited by: §2.
  • [9] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun (2015) The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pp. 192–204. Cited by: §2.
  • [10] S. d’Ascoli, M. Refinetti, G. Biroli, and F. Krzakala (2020)

    Double trouble in double descent: bias and variance (s) in the lazy regime

    .
    In International Conference on Machine Learning, pp. 2280–2290. Cited by: Appendix B.
  • [11] S. d’Ascoli, L. Sagun, and G. Biroli (2020) Triple descent and the two kinds of overfitting: where & why do they appear?. In NeurIPS, Cited by: Appendix B.
  • [12] A. Daniely, R. Frostig, and Y. Singer (2016) Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. Advances In Neural Information Processing Systems 29, pp. 2253–2261. Cited by: §2.
  • [13] S. S. Du, X. Zhai, B. Poczos, and A. Singh (2018) Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, Cited by: §2.
  • [14] S. Fort, P. K. Nowak, S. Jastrzebski, and S. Narayanan (2019) Stiffness: a new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491. Cited by: §2.
  • [15] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, Cited by: §1.
  • [16] M. Hardt and T. Ma (2016)

    Identity matters in deep learning

    .
    arXiv preprint arXiv:1611.04231. Cited by: §2.
  • [17] R. Heckel and F. F. Yilmaz (2020) Early stopping in deep networks: double descent and how to eliminate it. In International Conference on Learning Representations, Cited by: Appendix B, §1, §3.1.
  • [18] E. Hoffer, I. Hubara, and D. Soudry (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1729–1739. Cited by: §2.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §3.3.
  • [20] A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572. Cited by: §2.
  • [21] A. Jacot, B. Simsek, F. Spadaro, C. Hongler, and F. Gabriel (2020) Implicit regularization of random feature models. In International Conference on Machine Learning, pp. 4631–4640. Cited by: Appendix B, §1.
  • [22] S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey (2017) Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623. Cited by: §2.
  • [23] K. Kawaguchi (2016) Deep learning without poor local minima. Advances in Neural Information Processing Systems 29, pp. 586–594. Cited by: §2.
  • [24] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §A.1, §3.1, §3.2.1.
  • [25] A. K. Lampinen and S. Ganguli (2018)

    An analytic theory of generalization dynamics and transfer learning in deep linear networks

    .
    In International Conference on Learning Representations, Cited by: §1.
  • [26] Y. Le and X. Yang (2015) Tiny imagenet visual recognition challenge. CS 231N 7, pp. 7. Cited by: §A.3, §3.1, §3.2.1.
  • [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §A.2, §3.1, §3.2.1.
  • [28] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018) Deep neural networks as gaussian processes. In International Conference on Learning Representations, Cited by: §2.
  • [29] X. Li, J. Lu, Z. Wang, J. Haupt, and T. Zhao (2018) On tighter generalization bound for deep neural networks: cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159. Cited by: §2.
  • [30] P. M. Long and H. Sedghi (2019)

    Generalization bounds for deep convolutional neural networks

    .
    In International Conference on Learning Representations, Cited by: §2.
  • [31] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al. (2013) Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, Vol. 30, pp. 3. Cited by: §3.1.
  • [32] M. Mahoney and C. Martin (2019) Traditional and heavy tailed self regularization in neural network models. In International Conference on Machine Learning, pp. 4284–4293. Cited by: §1.
  • [33] J. Mellor, J. Turner, A. Storkey, and E. J. Crowley (2021) Neural architecture search without training. In International Conference on Machine Learning, pp. 7588–7598. Cited by: §1.
  • [34] M. S. Nacson, J. Lee, S. Gunasekar, P. H. P. Savarese, N. Srebro, and D. Soudry (2019) Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3420–3428. Cited by: §2.
  • [35] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever (2019) Deep double descent: where bigger models and more data hurt. In International Conference on Learning Representations, Cited by: Appendix B, §1, §3.1.
  • [36] R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein (2018) Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, Cited by: §2.
  • [37] J. Pennington, S. S. Schoenholz, and S. Ganguli (2017) Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4788–4798. Cited by: §1.
  • [38] P. Rodríguez, J. Gonzalez, G. Cucurull, J. M. Gonfaus, and X. Roca (2016) Regularizing cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967. Cited by: §1.
  • [39] I. Safran and O. Shamir (2018) Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pp. 4433–4441. Cited by: §2.
  • [40] A. M. Saxe, J. L. McClelland, and S. Ganguli (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120. Cited by: §1, §2.
  • [41] R. Shwartz-Ziv and N. Tishby (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §1, §2.
  • [42] U. Simsekli, L. Sagun, and M. Gurbuzbalaban (2019) A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pp. 5827–5837. Cited by: §1.
  • [43] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro (2018) The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 (1), pp. 2822–2878. Cited by: §2.
  • [44] T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel (2018) Evaluating the robustness of neural networks: an extreme value theory approach. In International Conference on Learning Representations, Cited by: §2.
  • [45] Wolchover (2017) New theory cracks open the black box of deep learning. Quanta Magazine 3. Cited by: §1, §2.
  • [46] A. Xu and M. Raginsky (2017) Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems 2017, pp. 2525–2534. Cited by: §2.
  • [47] T. Xu, Y. Zhou, K. Ji, and Y. Liang (2018) Convergence of sgd in learning relu models with separable data. arXiv preprint arXiv:1806.04339. Cited by: §2.
  • [48] Z. J. Xu (2018) Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295. Cited by: §2.
  • [49] Z. Yang, Y. Yu, C. You, J. Steinhardt, and Y. Ma (2020) Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pp. 10767–10777. Cited by: Appendix B.

Appendix A Common phenomenon shared by different architectures for different tasks.

In this section, we aim to demonstrate an interesting phenomenon when we train an MLP in early iterations. Specifically, the training process of the MLP can usually be divided into the following two phases according to the training loss. In the first phase, the training loss does not decrease significantly, and the training loss suddenly begins to decrease in the second phase. More Crucially, this phenomenon is widely shared by MLPs with different architectures for different tasks.

a.1 On the CIFAR-10 dataset

In this subsection, we demonstrate that the two-phase phenomenon is shared by different MLPs on the CIFAR-10 dataset [24]. For different MLPs, we adapted learning rate , batch size , SGD optimizer, and ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. Results of MLPs trained on the CIFAR-10 are shown in Figure 1.

Figure 1: Results of different MLPs trained on the CIFAR-10 dataset. (a) The training loss of four MLPs. (b) The testing loss of four MLPs. (c) The training accuracies of four MLPs. (d) The testing accuracies of four MLPs. (e) Cosine similarity between features of different categories. The results were evaluated in the third linear layer of MLPs. (f) Cosine similarity between gradients of different samples in a category. The results were evaluated in the third linear layer of MLPs.

a.2 On the MNIST dataset

In this subsection, we demonstrate that the two-phase phenomenon is shared by different MLPs on the MNIST dataset [27]. For different MLPs, we adapted learning rate , batch size , SGD optimizer, and ReLU activation function. Results of MLPs trained on the MNIST are shown in Figure 2.

Figure 2: Results of different MLPs trained on the MNIST dataset. (a) The training loss of four MLPs. (b) The testing loss of four MLPs. (c) The training accuracies of four MLPs. (d) The testing accuracies of four MLPs. (e) Cosine similarity between features of different categories. Results were evaluated in the third linear layer of MLPs. (f) Cosine similarity between gradients of different samples in a category. The results were evaluated in the third linear layer of MLPs.

a.3 On the Tiny ImageNet dataset

In this subsection, we demonstrate that the two-phase phenomenon is shared by different MLPs on the Tiny ImageNet dataset [26]. Specifically, we selected the following 50 categories, orangutan, parking meter, snorkel, American alligator, oboe, basketball, rocking chair, hopper, neck brace, candy store, broom, seashore, sewing machine, sunglasses, panda, pretzel, pig, volleyball, puma, alp, barbershop, ox, flagpole, lifeboat, teapot, walking stick, brain coral, slug, abacus, comic book, CD player, school bus, banister, bathtub, German shepherd, black stork, computer keyboard, tarantula, sock, Arabian camel, bee, cockroach, cannon, tractor, cardigan, suspension bridge, beer bottle, viaduct, guacamole, and iPod for training. For different MLPs, we adapted learning rate , batch size , SGD optimizer, and ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. Note that we took a random cropping with 3232. Results of MLPs trained on the Tiny ImageNet are shown in Figure 3.

Figure 3: Results of different MLPs trained on 50 categories of Tiny ImageNet dataset. (a) The training loss of three MLPs. (b) The testing loss of three MLPs. (c) The training accuracies of three MLPs. (d) The testing accuracies of three MLPs. (e) Cosine similarity between features of different categories. Results were evaluated in the second linear layer of MLPs. (f) Cosine similarity between gradients of different samples in a category. Results were evaluated in the second linear layer of MLPs.

a.4 Different training batch sizes

In this subsection, we demonstrate that the two-phase phenomenon is shared by MLPs trained on the CIFAR-10 dataset with different training batch sizes. For different MLPs, we adapted learning rate , SGD optimizer, and ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained three 7-layer MLPs with 256 neurons in each layer, with respectively. Results of MLPs trained with different batch sizes are shown in Figure 4.

Figure 4: Results of different batch sizes trained on CIFAR-10 dataset. (a) The training loss of three MLPs. (b) The testing loss of three MLPs. (c) The training accuracies of three MLPs. (d) The testing accuracies of three MLPs. (e) Cosine similarity between features of different categories. Results were evaluated in the second linear layer of MLPs. (f) Cosine similarity between gradients of different samples in a category. Results were evaluated in the second linear layer of MLPs.

a.5 Different learning rates

In this subsection, we demonstrate that the two-phase phenomenon is shared by MLPs trained on the CIFAR-10 dataset with different learning rates. For different MLPs, we adapted batch size , SGD optimizer, and ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained three 7-layer MLPs with 256 neurons in each layer, with learning rate respectively. Results of MLPs trained with different learning rates are shown in Figure 5.

Figure 5: Results of different learning rates trained on CIFAR-10 dataset. (a) The training loss of two MLPs. (b) The testing loss of two MLPs. (c) The training accuracies of two MLPs. (d) The testing accuracies of two MLPs. (e) Cosine similarity between features of different categories. Results were evaluated in the second linear layer of MLPs. (f) Cosine similarity between gradients of different samples in a category. Results were evaluated in the second linear layer of MLPs.

a.6 Different activation functions

In this subsection, we demonstrate that the two-phase phenomenon is shared by MLPs trained on the CIFAR-10 dataset with different activation functions. For different MLPs, we adapted learning rate , batch size , SGD optimizer. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained three 9-layer MLPs with 512 neurons in each layer with ReLU, Leaky ReLU (slope=0.1), and Leaky ReLU (slope=0.01), respectively. Results of MLPs trained with different activation functions are shown in Figure 6.

Figure 6: Results of different activation layers trained on CIFAR-10 dataset. (a) The training loss of three MLPs. (b) The testing loss of three MLPs. (c) The training accuracies of three MLPs. (d) The testing accuracies of three MLPs. (e) Cosine similarity between features of different categories. Results were evaluated in the second linear layer of MLPs. (f) Cosine similarity between gradients of different samples in a category. Results were evaluated in the second linear layer of MLPs.

a.7 Discussion

In experiments, we notice that both the cosine similarity between features over different samples and the cosine similarity between gradients keep increasing in the first phase. Furthermore, we find that gradients w.r.t features of different samples first become more and more similar to each other, then features of different samples become more and more similar to each other. We consider that this phenomenon is reasonable. In order words, when gradients w.r.t features of different samples become similar to each other, then the features of different samples are pushed towards a specific direction. Therefore, features of different samples become more and more similar to each other.

Appendix B Double Descent

The model-wise double descent behavior has emerged in many deep learning tasks, which means that as the model size increases, performance first decreases, then increases, and finally decreases [3, 21, 49, 10]. Furthermore, some recent studies discussed the existence of the triple descent curve [11, 2]. Meanwhile, the double descent behavior also occurs with respect to training epochs [35, 17], called epoch-wise double descent, i.e. as the epoch increases, the testing error first decreases, then increases, and finally decreases.

Appendix C Proof for the lemma 1

In this section, we present the detailed proof for Lemma 1.

Lemma 1.

For the decomposition , given and , reaches its minimum if and only if , where = , and .

proof. Let denote the -th column of the matrix . Given a sample , we can represent as follows:

(1)

where