1 Introduction
Deep neural networks (DNNs) have achieved significant success in various tasks. However, the essential reason for the superior performance of DNNs is not fully investigated, and many phenomena have not been well explained. Therefore, many methods have been proposed to explain DNNs,
e.g. investigating the phenomena of the lottery ticket hypothesis [15], explaining the phenomena of the double descent [21, 35, 17], understanding the information bottleneck hypothesis [45, 41], exploring the gradient noise and regularization [42, 32], and analyzing the nonlinear learning dynamics [40, 25, 37].In this paper, we focus on the learning dynamics of the MLP. In general, the training process of the MLP can be considered to have two phases. In the first phase, the learning process does not find a clear optimization direction, and the training loss does not decrease. In the second phase, the training loss suddenly begins to decrease, as shown in Figure 1. The first phase may be very short or cannot be observed for simple tasks, but the first phase is long for complex tasks.
More specifically, this study investigates the learning dynamics of the MLP in the first phase. We find an interesting phenomenon that features of different categories become more and more similar to each other in the first phase, as Figure 2
(a) shows. The feature diversity keeps decreasing until the second phase. More crucially, we find that this phenomenon is widely shared by MLPs with different depths and widths, different activation functions, different learning rates, trained on different datasets. In fact, the investigation of feature diversity is of great value.
[38, 6], and [33] have pointed that the decrease of feature diversity may hurt the classification performance of DNNs.In this paper, we aim to investigate why and how the feature diversity decreases during the early training process of the MLP. To this end, we explain the dynamics of features and parameters in intermediate layers of the MLP. We find that both the features and parameters of the MLP are mainly optimized towards a special direction, which is called the primary direction. Furthermore, the tendency of the optimization towards the primary direction can be boosted, just like a selfenhanced system, which decreases the diversity of features of different categories.
Besides the theoretical explanation of the decrease of feature diversity, we further propose two normalization operations to eliminate the twophase phenomenon, in order to avoid the decrease of feature diversity.
Contributions of this study can be summarized as follows. (1) We discover the decrease of feature diversity in early iterations of learning the MLP. (2) We theoretically explain this phenomenon in terms of learning dynamics. (3) We further propose two normalization operations to avoid the decrease of feature diversity.
2 Related Work
The representation capacity of DNNs. Analyzing the representation capacity of DNNs is an important direction to explain DNNs. The information bottleneck theory [45, 41] quantitatively explained the information encoded by features in intermediate layers of DNNs. [46], [1], and [8] used the information bottleneck theory to evaluate and improve the DNN’s representation capacity. [5] analyzed the representation capacity of DNNs with real training data and noises. In addition, several metrics were proposed to measure the generalization capacity or robustness of DNNs, including the stiffness [14], the sensitivity metrics [36], the Fourier analysis [48], and the CLEVER score [44]. Some studies focused on proving the generalization bound of DNNs [30, 29].
In comparison, we explain the MLP from a new perspective, i.e. we explain the decrease of feature diversity in early iterations of the MLP.
The learning dynamics of DNNs. Analyzing the learning dynamics is another perspective to understand DNNs. Many studies analyzed the local minimal in the optimization landscape of linear networks [7, 40, 16, 12] and nonlinear networks [9, 23, 39]. Some studies discussed the convergence rate of gradient descent on separable data [43, 47, 34]. [18] and [22] have investigated the effects of the batch size and the learning rate on SGD dynamics. In addition, some studies analyzed the dynamics of gradient descent in the overparameterization regime [4, 20, 28, 13].
In comparison, we analyze the learning dynamics of features and weights of the MLP, in order to explain the decrease of feature diversity in the first phase.
3 Algorithm
3.1 The decrease of feature diversity
In this paper, we aim to explain an interesting phenomenon when we train an MLP in early iterations. Specifically, the training process of the MLP can usually be divided into the following two phases according to the training loss. As Figure 2(d) shows, in the first phase, the training loss does not decrease significantly, and the training loss suddenly begins to decrease in the second phase.
We discover an interesting phenomenon in the first phase, i.e. both the diversity of features and the diversity of gradients w.r.t. features in intermediate layers over different samples keep decreasing.
As Figure 2(a)(b) shows, before the 130th epoch (the first phase), both the feature diversity and the gradient diversity keep decreasing, i.e., both the cosine similarity between features over different samples and the cosine similarity between gradients keep increasing. After the 130th epoch (the second phase), the feature diversity and the gradient diversity suddenly begin to increase, i.e. their similarities begin to decrease. Therefore, the MLP has the lowest feature diversity and the lowest gradient diversity at around the 130th epoch.
Crucially, this phenomenon is widely shared by MLPs with different architectures for different tasks.
Specifically, we train MLPs with different activation functions (including ReLU and Leaky ReLU
[31] with the slope of 0.01, 0.1), different learning rates (= 0.1, 0.01), and different batch sizes (=100, 500). These MLPs are trained on the MNIST dataset [27], the CIFAR10 dataset [24], and the first fifty categories in the Tiny ImageNet dataset
[26], respectively. Please see Appendix Section A for details. Figure 3 shows that in the first phase, both the feature diversity and the gradient diversity decrease in all MLPs.Connection to the epochwise double descent. The above twophase phenomenon is closely related to the epochwise double descent behavior, which was investigated by [35] and [17]. The epochwise double descent behavior has three stages during the training process of a DNN. The testing error decreases in the first stage, then increases in the second stage, and finally continues to decrease in the third stage. As Figure 2(d) shows, the first and the second stages in the epochwise double descent behavior are temporally aligned with the first phase in the aforementioned twophase phenomenon, where the training loss does not change significantly. Please see Appendix Section B for more discussion about the epochwise double descent behavior. Instead of explaining the epochwise double descent behavior, in this paper, we mainly explain the decrease of feature diversity in the first phase.
3.2 Modeling the decrease of feature diversity
Notation. We consider an MLP with concatenated linear layers, each being followed by a ReLU layer, except the last linear layer before the softmax operation. Without loss of generality, we mainly focus on the learning dynamics of gradient descent. Let denote the weight matrix of the th linear layer with neurons , which is learned after the th iteration. Given an input sample , the layerwise forward propagation of the th layer is represented as , where denotes a ReLU layer, and denotes the output feature of the th ReLU layer in the MLP after the th iteration. denotes a diagonal matrix, where each element on the main diagonal represents the gating state of the th neuron in the ReLU layer. Accordingly, gradients of the loss w.r.t output features of the th ReLU layer are given as .
In Section 3.1, we have demonstrated the typical phenomena in the first phase of the learning process, i.e. both the similarity between features and the similarity between gradients w.r.t. features keep increasing. In this section, we mainly explore the reason for the increasing similarity between gradients w.r.t. features. Then, based on the analysis of the increasing similarity between gradients, we can explain the increasing similarity between features.
Basic idea. The theoretical explanation of the increasing similarity of gradients w.r.t. features is inspired by the following observation, i.e. weights of different neurons in the same layer are mainly changed towards a common direction.
Specifically, we disentangle the component of weight changes along the common direction. We prove that weight changes along the common direction are enhanced in the first phase. In other words, different neurons are more and more likely to be pushed towards the same direction. The enhanced common direction can explain both the increasing similarity between gradients w.r.t. features and the increasing similarity between features.
3.2.1 Observations: assuming and disentangling the common direction
We find that the increasing similarity of feature gradients in the first phase can be explained as the increasing similarity between neurons, i.e. being explained as the phenomenon that weights of different neurons are optimized towards a common direction. Given an input sample , we rewrite the back propagation in the th linear layer as follows.
(1) 
In this way, we can consider gradients computed in the back propagation as the result of a pseudoforward propagation. In this pseudoforward propagation, the input is gradients w.r.t. features of the th layer , and the output is dimensional gradients . Accordingly, the equivalent weight matrix can be considered consisting of pseudoneurons.
Therefore, the increasing similarity between feature gradients over different samples can be explained as the increase of the similarity between pseudoneurons in the weight matrix. The increasing similarity between weight changes of pseudoneurons in the pseudoforward propagation is illustrated in Figure 2
(c). Note that the initial weight of pseudoneurons are orthogonal to each other according to the law of large numbers. Thus, Figure
2(c) shows that for most pseudoneurons and , , which makes the cosine similarity between weights of different pseudoneurons in the same layer keep increasing across different iterations in the first phase.Disentangling the weight change along the common direction as the basis of the further proof. Based on the above analysis, we can roughly assume that weights of different pseudoneurons are changed along a common direction . Thus, we aim to formulate the common direction , and disentangle the component of weight changes along the common direction . This can be served as the basis of further proof of the increasing similarity between feature gradients in the first phase.
Given a training sample after the th iteration, let denote weight changes of pseudoneurons in the th layer. According to the above observation, we can decompose into the component along the common direction and the component in other directions , where denotes the common direction of weight changes .
(2)  
where denotes the weight changes of different pseudoneurons along the common direction . is relatively small and is called the “noise” term. denotes the th row of
. We can estimate the common direction
by minimizing the noise term over different samples through different iterations as follows.(3) 
Lemma 1.
(Proof in Appendix Section C) For the decomposition , given and , reaches its minimum if and only if , where = , and .
Lemma 1 indicates that given the weight change and the common direction , we can compute and explicitly, and obtain .
Experiments to illustrate the significance of the common direction. Here, we illustrate that the weight change along the common direction is much greater than the weight change along other directions. Given the overall weight change made by all training samples after the th iteration, we decompose into components along five common directions as , where = is termed the primary common direction. represents the secondary common direction, which is the most common direction when we remove the component of the weight change along the direction from the , i.e. . Then, we compute the by conducting Equation (3) on . Please see Appendix Section D for more discussion about the procedure of the decomposition. Similarly, and represent the third, forth, and fifth common directions, respectively. Theoretically, , , , , and are orthogonal to each other. In this way, measures the strength of weight changes along the th common direction.
Specifically, in experiments, we trained 9layer MLPs on the MNIST dataset [27], the CIFAR10 dataset [24]
, and the Tiny ImageNet dataset
[26], respectively. Each layer of the MLP had 512 neurons. Figure 4 shows that the strength of the component of weights changes along the primary direction was approximately ten times greater than the strength of the component along the second direction .3.2.2 Proof of the decrease of feature diversity
In the above analysis, we have illustrated, assumed, and disentangled the common direction of weight changes. Based on this common direction, we aim to prove the decrease of feature diversity in the first phase. The proof consists of the following three steps.
(1) We formulate common directions by investigating the learning dynamics of the weight change made by each specific training sample.
(2) Based on the learning dynamics made by a training sample, we prove that features and weight changes can enhance the significance of each other.
(3) Based on this, we can explain the increase of both the feature similarity and the gradient similarity.
CIFAR10 
Category  Cat  Truck  

()  Layer 2  Layer 3  Layer 4  Layer 5  Layer 6  Layer 2  Layer 3  Layer 4  Layer 5  Layer 6  
MNIST 
Category  Eight  Zero  
()  Layer 2  Layer 3  Layer 4  Layer 5  Layer 6  Layer 2  Layer 3  Layer 4  Layer 5  Layer 6  
Tiny ImageNet 
Category  Flagpole  Bottle  
()  Layer 2  Layer 3  Layer 4  Layer 5  Layer 6  Layer 2  Layer 3  Layer 4  Layer 5  Layer 6  
Lemma 2.
(Proof in Appendix Section E) For the decomposition , given and , reaches its minimum if and only if , where = , and .
Lemma 2 reveals that given the weight matrix and the common direction , we can compute and explicitly, and obtain .
Theorem 1.
(Proof in Appendix Section F) The weight change made by a sample can be decomposed into terms after the th iteration: , where , . , where the SVD of is given as , and denotes the
th singular value
. and denote the th column of the matrix and , respectively. Besides, we have .Formulating the learning dynamics made by each specific training sample. Theorem 1 enables us to decompose common directions from the learning dynamics (weight changes) made by each specific training sample .
The primary term represents the component of weight changes along the common direction . The th term represents the component of weight changes along the th direction , which is orthogonal to the common direction .
Table 1 illustrates that the strength of the component of weight changes along the primary common direction is much more significant than strengths of components along other directions. To this end, we compute the average strength of weight changes along the common direction over all samples across different iterations . Similarly, the strength of weight changes along the th direction is computed as .
Relationship between features and weight changes. Inspired by the above learning dynamics of weight changes made by each specific training sample, we discover a close relationship between features and weight changes. Specifically, according to Theorem 1, the change of weights made by a sample after the th iteration can be computed as follows.
(4) 
where , and . Please see Appendix Section F for the proof.
We discover that weight changes can be represented as both in Equation (4) and in Equation (2). Note that is highly related to the primary common direction in the th layer. Thus, both terms and play dominant roles in . Therefore, we guess that the common direction is similar to , and the feature
is similar to the vector
. Figure 5 verifies the guess that the feature is in the similar direction of the vector , where .The proof of the selfenhanced common direction. Inspired by the above relationship between features and weight changes, we aim to prove that features and weights become more and more similar to each other in the first phase. Such proof can explain the selfenhancement of the common direction in the first phase.
Theorem 2.
(Proof in Appendix Section G) Given an input sample and a common direction after the th iteration, if the maximum singular value of is small enough, we can obtain , where , and .
Theorem 2 describes two typical learning dynamics in the first phase, i.e. the feature is pushed towards either the coefficient vector or the coefficient vector . To this end, denotes the change of features after the th iteration. In this paper, we roughly consider that the change of features is negatively related to the gradient w.r.t. features, i.e. and , although strictly speaking, the change of features is not exactly equal to the gradient w.r.t. features.
Theorem 2 shows the following two cases. In Case 1, if , then . In other words, if the coefficient vector is pushed towards the feature , then the feature is pushed towards the coefficient vector . In this way, we consider that the feature and the coefficient vector become more and more similar to each other in the first phase. Similarly, in Case 2, given another input training sample , if , then . In other words, if the coefficient vector is pushed towards the direction of the feature , then the feature is pushed towards the coefficient vector . In this way, the feature and the coefficient vector gradually become negatively related to each other.
Verification. We conducted experiments to verify the relationship between features and weights. To this end, we measured the change of the value in the first phase. Figure 6 reports the mean and the standard deviation of over different samples at each epoch. We repeatedly conducted experiments on the CIFAR10 dataset, the MNIST dataset, and the Tiny ImageNet dataset, respectively, to measure the . For each sample , was always positive and usually increased over epochs, which verified Theorem 2. Besides, Theorem 2 is based on the assumption that the maximum singular value of is small enough. Experimental results in Appendix Section G verified that the maximum singular value of was usually small enough in the first phase.
Theorem 3.
(Proof in Appendix Section H) For each training sample belonging to the category , we have , and , where is constant for each category for the th layer.
Theorem 3 indicates that training samples of the same category all push the vector towards the feature in the first phase, where is a constant shared by different samples of the same category . Figure 5 verifies that are consistently positive over different samples of the same category.
Assumption 1. We assume that the MLP encodes features of very few (a single or two) categories in the first phase, while other categories have not been learned in the first phase.
To verify this assumption, Figure 7 shows that only a single or two categories exhibit much higher accuracies than the random guessing at the end of the first phase. This indicates that the learning of the MLP is dominated by training samples of a single or two categories.
The combination of Theorem 3 and Assumption 1 clarifies the overall learning dynamics in the first phase. In other words, the overall learning effects of all training samples can be roughly considered as pushing the vector towards the feature . is determined by the dominating category/categories. In addition, , which means that the change of the feature is similar to the vector . Thus, we can consider that the change of the feature is modified towards the feature . In this way, the learning process of the MLP works just like a selfenhanced system. Besides, the weight coefficient vector and the feature enhance each other.
Proof the increasing feature similarity and the increasing gradient similarity. As aforementioned, features of different samples are consistently pushed towards the same vector . Therefore, the similarity between features of different samples increases in the first phase, which makes different training samples generate similar gating states in each ReLU layer. The increasing similarity between gating states of each ReLU layer over different samples leads to the increasing the similarity between gradients w.r.t. features over different samples of the same category. Please see Appendix Section I for more discussion.
How to escape from the first phase? In the first phase, the MLP only discovers a single direction to optimize a single or two categories. However, the optimization of a single or two categories will soon saturate, and the gradient mainly comes from training samples of other categories, which destroys the dominating roles of a single or two categories in the learning of the MLP. Therefore, the learning effects of training samples from different categories may conflict with each other. Thus, the selfenhanced system is destroyed, and the learning of the MLP enters the second phase.
3.3 Eliminating the decrease of feature diversity
In this section, we aim to eliminate the decrease of feature diversity in the first phase, in order to speed up the training process. The analysis in the above section shows that the similarity between features of different samples increases in the first phase. To this end, we propose to use two normalization operations to eliminate the decrease of feature diversity in the first phase.
We are given the output feature of the th linear layer w.r.t. the input sample , where denotes the th dimension of the feature. The first normalization operation is given as , where and denote the mean value and the standard deviation of
over different samples, respectively. This operation is similar to the batch normalization
[19], but we do not compute the scaling and shifting parameters in the batch normalization. The second normalization operation is the batch normalization. The above two normalization operations can prevent features of different samples from being similar to each other, because the mean feature is subtracted from features of all samples.In order to verify the two normalization operations can eliminate the twophase phenomenon during the training process of the MLP, we trained 9layer MLPs with and without normalization operations. Specifically, for the normalization operation , we added the normalization operation after each linear layer, expect the last linear layer. Similarly, for the MLP with the batch normalization operation, we added a batch normalization layer after each linear layer. Each linear layer in the MLP had 512 neurons. Figure 8 shows that the two normalization operations successfully eliminated the twophase phenomenon of the learning of MLPs and sped up the training process.
Figure 9 shows that in MLPs with normalization operations, weight changes along other common directions were strengthened than those in the MLP without normalization operations. To this end, we used to measure the relative strength of weight changes along the th common direction w.r.t. the strength of weights changes along the primary common direction . Furthermore, Figure 10 shows that the feature similarity in MLPs with normalization operations kept decreasing, while the feature similarity of the MLP without normalization operations kept increasing. This indicated that the normalization operations avoided the decreasing of feature diversity.
4 Conclusion
In this paper, we find that in the early stage of the training process, the MLP exhibits a twophase phenomenon, and the feature diversity keeps decreasing in the first phase. We formulate and explain this phenomenon by analyzing the learning dynamics of the MLP. Furthermore, we propose to use two normalization operations to eliminate the above twophase phenomenon and speed up the training process.
References
 [1] (2018) Information dropout: learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2897–2905. Cited by: §2.
 [2] (2020) The neural tangent kernel in high dimensions: triple descent and a multiscale theory of generalization. In International Conference on Machine Learning, pp. 74–84. Cited by: Appendix B.
 [3] (2017) Highdimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667. Cited by: Appendix B.
 [4] (2018) On the optimization of deep networks: implicit acceleration by overparameterization. In International Conference on Machine Learning, pp. 244–253. Cited by: §2.
 [5] (2017) A closer look at memorization in deep networks. In International Conference on Machine Learning, pp. 233–242. Cited by: §2.

[6]
(2019)
Regularizing deep neural networks by enhancing diversity in feature extraction
. IEEE transactions on neural networks and learning systems 30 (9), pp. 2650–2661. Cited by: §1. 
[7]
(1989)
Neural networks and principal component analysis: learning from examples without local minima
. Neural networks 2 (1), pp. 53–58. Cited by: §2. 
[8]
(2018)
Evaluating capability of deep neural networks for image classification via information plane.
In
European Conference on Computer Vision
, pp. 181–195. Cited by: §2.  [9] (2015) The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pp. 192–204. Cited by: §2.

[10]
(2020)
Double trouble in double descent: bias and variance (s) in the lazy regime
. In International Conference on Machine Learning, pp. 2280–2290. Cited by: Appendix B.  [11] (2020) Triple descent and the two kinds of overfitting: where & why do they appear?. In NeurIPS, Cited by: Appendix B.
 [12] (2016) Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. Advances In Neural Information Processing Systems 29, pp. 2253–2261. Cited by: §2.
 [13] (2018) Gradient descent provably optimizes overparameterized neural networks. In International Conference on Learning Representations, Cited by: §2.
 [14] (2019) Stiffness: a new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491. Cited by: §2.
 [15] (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, Cited by: §1.

[16]
(2016)
Identity matters in deep learning
. arXiv preprint arXiv:1611.04231. Cited by: §2.  [17] (2020) Early stopping in deep networks: double descent and how to eliminate it. In International Conference on Learning Representations, Cited by: Appendix B, §1, §3.1.
 [18] (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1729–1739. Cited by: §2.
 [19] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §3.3.
 [20] (2018) Neural tangent kernel: convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572. Cited by: §2.
 [21] (2020) Implicit regularization of random feature models. In International Conference on Machine Learning, pp. 4631–4640. Cited by: Appendix B, §1.
 [22] (2017) Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623. Cited by: §2.
 [23] (2016) Deep learning without poor local minima. Advances in Neural Information Processing Systems 29, pp. 586–594. Cited by: §2.
 [24] (2009) Learning multiple layers of features from tiny images. Cited by: §A.1, §3.1, §3.2.1.

[25]
(2018)
An analytic theory of generalization dynamics and transfer learning in deep linear networks
. In International Conference on Learning Representations, Cited by: §1.  [26] (2015) Tiny imagenet visual recognition challenge. CS 231N 7, pp. 7. Cited by: §A.3, §3.1, §3.2.1.
 [27] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §A.2, §3.1, §3.2.1.
 [28] (2018) Deep neural networks as gaussian processes. In International Conference on Learning Representations, Cited by: §2.
 [29] (2018) On tighter generalization bound for deep neural networks: cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159. Cited by: §2.

[30]
(2019)
Generalization bounds for deep convolutional neural networks
. In International Conference on Learning Representations, Cited by: §2.  [31] (2013) Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, Vol. 30, pp. 3. Cited by: §3.1.
 [32] (2019) Traditional and heavy tailed self regularization in neural network models. In International Conference on Machine Learning, pp. 4284–4293. Cited by: §1.
 [33] (2021) Neural architecture search without training. In International Conference on Machine Learning, pp. 7588–7598. Cited by: §1.
 [34] (2019) Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3420–3428. Cited by: §2.
 [35] (2019) Deep double descent: where bigger models and more data hurt. In International Conference on Learning Representations, Cited by: Appendix B, §1, §3.1.
 [36] (2018) Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, Cited by: §2.
 [37] (2017) Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4788–4798. Cited by: §1.
 [38] (2016) Regularizing cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967. Cited by: §1.
 [39] (2018) Spurious local minima are common in twolayer relu neural networks. In International Conference on Machine Learning, pp. 4433–4441. Cited by: §2.
 [40] (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120. Cited by: §1, §2.
 [41] (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §1, §2.
 [42] (2019) A tailindex analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pp. 5827–5837. Cited by: §1.
 [43] (2018) The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 (1), pp. 2822–2878. Cited by: §2.
 [44] (2018) Evaluating the robustness of neural networks: an extreme value theory approach. In International Conference on Learning Representations, Cited by: §2.
 [45] (2017) New theory cracks open the black box of deep learning. Quanta Magazine 3. Cited by: §1, §2.
 [46] (2017) Informationtheoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems 2017, pp. 2525–2534. Cited by: §2.
 [47] (2018) Convergence of sgd in learning relu models with separable data. arXiv preprint arXiv:1806.04339. Cited by: §2.
 [48] (2018) Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295. Cited by: §2.
 [49] (2020) Rethinking biasvariance tradeoff for generalization of neural networks. In International Conference on Machine Learning, pp. 10767–10777. Cited by: Appendix B.
Appendix A Common phenomenon shared by different architectures for different tasks.
In this section, we aim to demonstrate an interesting phenomenon when we train an MLP in early iterations. Specifically, the training process of the MLP can usually be divided into the following two phases according to the training loss. In the first phase, the training loss does not decrease significantly, and the training loss suddenly begins to decrease in the second phase. More Crucially, this phenomenon is widely shared by MLPs with different architectures for different tasks.
a.1 On the CIFAR10 dataset
In this subsection, we demonstrate that the twophase phenomenon is shared by different MLPs on the CIFAR10 dataset [24]. For different MLPs, we adapted learning rate , batch size , SGD optimizer, and ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. Results of MLPs trained on the CIFAR10 are shown in Figure 1.
a.2 On the MNIST dataset
a.3 On the Tiny ImageNet dataset
In this subsection, we demonstrate that the twophase phenomenon is shared by different MLPs on the Tiny ImageNet dataset [26]. Specifically, we selected the following 50 categories, orangutan, parking meter, snorkel, American alligator, oboe, basketball, rocking chair, hopper, neck brace, candy store, broom, seashore, sewing machine, sunglasses, panda, pretzel, pig, volleyball, puma, alp, barbershop, ox, flagpole, lifeboat, teapot, walking stick, brain coral, slug, abacus, comic book, CD player, school bus, banister, bathtub, German shepherd, black stork, computer keyboard, tarantula, sock, Arabian camel, bee, cockroach, cannon, tractor, cardigan, suspension bridge, beer bottle, viaduct, guacamole, and iPod for training. For different MLPs, we adapted learning rate , batch size , SGD optimizer, and ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. Note that we took a random cropping with 3232. Results of MLPs trained on the Tiny ImageNet are shown in Figure 3.
a.4 Different training batch sizes
In this subsection, we demonstrate that the twophase phenomenon is shared by MLPs trained on the CIFAR10 dataset with different training batch sizes. For different MLPs, we adapted learning rate , SGD optimizer, and ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained three 7layer MLPs with 256 neurons in each layer, with respectively. Results of MLPs trained with different batch sizes are shown in Figure 4.
a.5 Different learning rates
In this subsection, we demonstrate that the twophase phenomenon is shared by MLPs trained on the CIFAR10 dataset with different learning rates. For different MLPs, we adapted batch size , SGD optimizer, and ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained three 7layer MLPs with 256 neurons in each layer, with learning rate respectively. Results of MLPs trained with different learning rates are shown in Figure 5.
a.6 Different activation functions
In this subsection, we demonstrate that the twophase phenomenon is shared by MLPs trained on the CIFAR10 dataset with different activation functions. For different MLPs, we adapted learning rate , batch size , SGD optimizer. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained three 9layer MLPs with 512 neurons in each layer with ReLU, Leaky ReLU (slope=0.1), and Leaky ReLU (slope=0.01), respectively. Results of MLPs trained with different activation functions are shown in Figure 6.
a.7 Discussion
In experiments, we notice that both the cosine similarity between features over different samples and the cosine similarity between gradients keep increasing in the first phase. Furthermore, we find that gradients w.r.t features of different samples first become more and more similar to each other, then features of different samples become more and more similar to each other. We consider that this phenomenon is reasonable. In order words, when gradients w.r.t features of different samples become similar to each other, then the features of different samples are pushed towards a specific direction. Therefore, features of different samples become more and more similar to each other.
Appendix B Double Descent
The modelwise double descent behavior has emerged in many deep learning tasks, which means that as the model size increases, performance first decreases, then increases, and finally decreases [3, 21, 49, 10]. Furthermore, some recent studies discussed the existence of the triple descent curve [11, 2]. Meanwhile, the double descent behavior also occurs with respect to training epochs [35, 17], called epochwise double descent, i.e. as the epoch increases, the testing error first decreases, then increases, and finally decreases.
Appendix C Proof for the lemma 1
In this section, we present the detailed proof for Lemma 1.
Lemma 1.
For the decomposition , given and , reaches its minimum if and only if , where = , and .
proof. Let denote the th column of the matrix . Given a sample , we can represent as follows:
(1) 
where
Comments
There are no comments yet.