1 Introduction
Although neural networks have made strong empirical progress in a diverse set of domains (e.g., computer vision
alexnet ; vgg ; resnet , speech recognition hinton2012deep ; amodei2016deep word2vec ; bert , and games alphago ; alphagozero ; darkforest ; mnih2013playing), a number of fundamental questions still remain unsolved. How can Stochastic Gradient Descent (SGD) find good solutions to a complicated nonconvex optimization problem? Why do neural networks generalize? How can networks trained with SGD fit both random noise and structured data
rethinking ; krueger2017deep ; neyshabur2017exploring , but prioritize structured models, even in the presence of massive noise rolnick2017deep ? Why are flat minima related to good generalization? Why does overparameterization lead to better generalization neyshabur2018towards ; zhang2019identity ; spigler2018jamming ; neyshabur2014search ; li2018measuring ? Why do lottery tickets exist lottery ; lotteryscale ?In this paper, we propose a theoretical framework for multilayered ReLU networks. Based on this framework, we try to explain these puzzling empirical phenomena with a unified view. We adopt a teacherstudent setting where the label provided to an overparameterized deep student ReLU network is the output of a fixed teacher ReLU network of the same depth and unknown weights (Fig. 1(a)). Here overparameterization means that at each layer, the number of nodes in student network is more than the number of nodes in the teacher network. In this perspective, hidden student nodes are randomly initialized with different activation regions (Fig. 2(a)). During optimization, student nodes compete with each other to explain teacher nodes. From this setting, Theorem 4 shows that lucky student nodes which have greater overlap with teacher nodes converge to those teacher nodes at a fast rate, resulting in winnertakeall behavior. Furthermore, Theorem 5 shows that in the 2layer case, if a subset of student nodes are close to the teachers’, they converge to them and the fanout weights of other irrelevant nodes of the same layer vanishes.
With this framework, we try to intuitively explain various neural network behaviors as follows:
Fitting both structured and random data. Under gradient descent dynamics, some student nodes, which happen to overlap substantially with teacher nodes, will move into the teacher node and cover them. This is true for both structured data that corresponds to small teacher networks with few intermediate nodes, or noisy/random data that correspond to large teachers with many intermediate nodes. This explains why the same network can fit both structured and random data (Fig. 2(ab)).
Overparameterization. In overparameterization, lots of student nodes are initialized randomly at each layer. Any teacher node is more likely to have a substantial overlap with some student nodes, which leads to fast convergence (Fig. 2(a) and (c), Thm. 4), consistent with lottery ; lotteryscale . This also explains that training models whose capacity just fit the data (or teacher) yields worse performance li2018measuring .
Flat minima
. Deep networks often converge to “flat minima” whose Hessian has a lot of small eigenvalues
sagun2016eigenvalues ; sagun2017empirical ; lipton2016stuck ; baity2018comparing . Furthermore, while controversial dinh2017sharp , flat minima seem to be associated with good generalization, while sharp minima often lead to poor generalization hochreiter1997flat ; keskar2016large ; wu2017towards ; li2018visualizing . In our theory, when fitting with structured data, only a few lucky student nodes converge to the teacher, while for other nodes, their fanout weights shrink towards zero, making them (and their fanin weights) irrelevant to the final outcome (Thm. 5), yielding flat minima in which movement along most dimensions (“unlucky nodes”) results in minimal change in output. On the other hand, sharp minima is related to noisy data (Fig. 2(d)), in which more student nodes match with the teacher.Implicit regularization. On the other hand, the snapping behavior enforces winnertakeall: after optimization, a teacher node is fully covered (explained) by a few student nodes, rather than splitting amongst student nodes due to overparameterization. This explains why the same network, once trained with structured data, can generalize to the test set.
Lottery Tickets. Lottery Tickets lottery ; lotteryscale ; zhou2019deconstructing is an interesting phenomenon: if we reset “salient weights” (trained weights with large magnitude) back to the values before optimization but after initialization, prune other weights (often of total weights) and retrain the model, the test performance is the same or better; if we reinitialize salient weights, the test performance is much worse. In our theory, the salient weights are those lucky regions ( and in Fig. 3) that happen to overlap with some teacher nodes after initialization and converge to them in optimization. Therefore, if we reset their weights and prune others away, they can still converge to the same set of teacher nodes, and potentially achieve better performance due to less interference with other irrelevant nodes. However, if we reinitialize them, they are likely to fall into unfavorable regions which cannot cover teacher nodes, and therefore lead to poor performance (Fig. 3(c)), just like in the case of underparameterization. Recently, Supermask zhou2019deconstructing shows that a supermask can be found from winning tickets. If it is applied to initialized weights, the network without training gives much better test performance than chance. This is also consistent with the intuitive picture in Fig. 3(b).
2 Mathematical Framework
Notation. Consider a student network and its associated teacher network (Fig. 1(a)). Denote the input as . For each node , denote as the activation, as the ReLU gating (for the toplayer, and are always ), and
as the backpropagated gradient, all as functions of
. We use the superscript to represent a teacher node (e.g., ). Therefore, never appears as teacher nodes are not updated. We use to represent weight between node and in the student network. Similarly, represents the weight between node and in the teacher network.We focus on multilayered ReLU networks. We use the following equality extensively: . For ReLU node , we use as the activation region of node .
Teacher network versus Dataset. The reason why we formulate the problem using teacher network rather than a dataset is the following: (1) It leads to a nice and symmetric formulation for multilayered ReLU networks (Thm. 1). (2) A teacher network corresponds to an infinite size dataset, which separates the finite sample issues from induction bias in the dataset, which corresponds to the structure of teacher network. (3) If student weights can be shown to converge to teacher ones, generalization bound can naturally follow for the student. (4) The label complexity of data generated from a teacher is automatically reduced, which could lead to better generalization bound. On the other hand, a bound for arbitrary function class can be hard.
Objective
. We assume that both the teacher and the student output probabilities over
classes. We use the output of teacher as the input of the student. At the top layer, each node in the student corresponds to each node in the teacher. Therefore, the objective is:(1) 
By the backpropagation rule, we know that for each sample , the (negative) gradient . The gradient gets backpropagated until the first layer is reached.
Note that here, the gradient sent to node is correlated with the activation of the corresponding teacher node and other student nodes at the same layer. Intuitively, this means that the gradient “pushes” the student node to align with class of the teacher. If so, then the student learns the corresponding class well. A natural question arises:
Are student nodes at intermediate layers correlated with teacher nodes at the same layers?
One might wonder this is hard since the student’s intermediate layer receives no direct supervision from the corresponding teacher layer, but relies only on backpropagated gradient. Surprisingly, the following theorem shows that it is possible for every intermediate layer:
Theorem 1 (Recursive Gradient Rule).
Note that Theorem 1 applies to arbitrarily deep ReLU networks and allows different number of nodes for the teacher and student. The role played by ReLU activation is to make the expression of concise, otherwise and can take a very complicated (and asymmetric) form.
In particular, we consider the overparameterization setting: the number of nodes on the student side is much larger (e.g., 510x) than the number of nodes on the teacher side. Using Theorem 1, we discover a novel and concise form of gradient update rule:
Assumption 1 (Separation of Expectations).
(4)  
(5) 
Theorem 2.
Here we explain the notations. is teacher weights, , and , , and . We can define similar notations for (which has columns/filters), , , and (Fig. 4(c)). At the lowest layer , , at the highest layer where there is no ReLU, we have due to Eqn. 1. According to network structure, and only depends on weights , while and only depend on .
3 Analysis on the Dynamics
In the following, we will use Eqn. 6 to analyze the dynamics of the multilayer ReLU networks. For convenience, we first define the two functions and ( is the ReLU function):
(7) 
We assume these two functions have the following property .
Assumption 2 (Lipschitz condition).
There exists and so that:
(8) 
Using this, we know that , , and so on. For brevity, denote (when notation is heavy) and so on. We impose the following assumption:
Assumption 3 (Small Overlap between teacher nodes).
There exists and so that:
(9) 
Intuitively, this means that the probability of the simultaneous activation of two teacher nodes and is small. If we have sufficient training data to cover the input space, then a sufficient condition for Assumption 3 to happen is that the teacher has negative bias, which means that they cut corners in the space spanned by the node activations of the lower layer (Fig. 4
a). We have empirically verified that the majority of biases in BatchNorm layers (after the data are whitened) are negative in VGG11/16 trained on ImageNet (Sec.
4.1).3.1 Effects of BatchNorm
Batch Normalization batchnorm has been extensively used to speed up the training, reduce the tuning efforts and improve the test performance of neural networks. Here we use an interesting property of BatchNorm: the total “energy” of the incoming weights of each node is conserved over training iterations:
Theorem 3 (Conserved Quantity in Batch Normalization).
For Linear ReLU BN or Linear BN ReLU configuration, of a filter before BN remains constant in training. (Fig. 15).
See Appendix for the proof. The similar lemma is also in arora2018theoretical . This may partially explain why BN has stabilization effect: energy will not leak from one layer to nearby ones. Due to this property, in the following, for convenience we assume , and the gradient is always orthogonal to the current weight . Note that on the teacher side we can always push the magnitude component to the upper layer; on the student side, random initialization naturally leads to constant magnitude of weights.
3.2 Same number of student nodes as teacher
We start with a simple case first. Consider that we only analyze layer without overparameterization, i.e., . We also assume that , i.e., the input of that layer is whitened, and the toplayer signal is uniform, i.e., (all entries are 1). Then the following theorem shows that weight recovery could follow (we use as ).
Theorem 4.
For dynamics , where is a projection matrix into the orthogonal complement of . , are corresponding th column in and . Denote and assume . If , then with the rate ( is learning rate). Here and .
See Appendix for the proof. Here we list a few remarks:
Faster convergence near . we can see that due to the fact that in general becomes larger when (since can be close to ), we expect a superlinear convergence near . This brings about an interesting winnertakeall mechanism: if the initial overlap between a student node and a particular teacher node is large, then the student node will snap to it (Fig. 1(c)).
Importance of projection operator . Intuitively, the projection is needed to remove any ambiguity related to weight scaling, in which the output remains constant if the toplayer weights are multiplied by a constant , while the lowlayer weights are divided by . Previous works du2017gradient also uses similar techniques while we justify it with BN. Without , convergence can be harder.
Topdown modulation. Note that here we assume the toplayer signal is uniform, which means that according to , there is no preference on which student node corresponds to which teacher node . If there is a preference (e.g., ), then from the proof, the crossterm will be suppressed due to , making convergence easier. As we will see next, such a topdown modulation plays an important role for 2layer and overparameterization case. We believe that it also plays a similar role for deep networks.
3.3 OverParameterization and Topdown Modulation in 2layer Network
In the overparameterization case (, e.g., 510x), we arrange the variables into two parts: , where contains columns (same size as ), while contains columns. We use (or set) to specify nodes , and (or set) for the remaining part.
In this case, if we want to show “the main component” converges to , we will meet with one core question: to where will converge, or whether will even converge at all? We need to consider not only the dynamics of the current layer, but also the dynamics of the upper layer. Using a 1hidden layer overparameterized ReLU network as an example, Theorem 5 shows that the upperlayer dynamics automatically apply topdown modulation to suppress the influence of , regardless of their convergence. Here , where are the weight components of set. See Fig. 5.
Theorem 5 (OverParameterization and Topdown Modulation).
See Appendix for the proof (and definition of in Eqn. 45). The intuition is: if is close to and are far away from them due to Assumption 3, the offdiagonal elements of and are smaller than diagonal ones. This causes to move towards and to move towards zero. When becomes small, so does for or . This in turn suppresses the effect of and accelerates the convergence of . exponentially so that stays close to its initial locations, and Assumption 3 holds for all iterations. A few remarks:
Flat minima. Since , can be changed arbitrarily without affecting the outputs of the neural network. This could explain why there are many flat directions in trained networks, and why many eigenvalues of the Hessian are close to zero sagun2016eigenvalues .
Understanding of pruning methods. Theorem 5 naturally relates two different unstructured network pruning approaches: pruning small weights in magnitude han2015learning ; lottery and pruning weights suggested by Hessian lecun1990optimal ; hassibi1993optimal . It also suggests a principled structured pruning method: instead of pruning a filter by checking its weight norm, pruning accordingly to its topdown modulation.
Accelerated convergence and learning rate schedule. For simplicity, the theorem uses a uniform (and conservative) throughout the iterations. In practice, is initially small (due to noise introduced by set) but will be large after a few iterations when vanishes. Given the same learning rate, this leads to accelerated convergence. At some point, the learning rate becomes too large, leading to fluctuation. In this case, needs to be reduced.
Manytoone mapping. Theorem 5 shows that under strict conditions, there is onetoone correspondence between teacher and student nodes. In general this is not the case. Two students nodes can be both in the vicinity of a teacher node and converge towards it, until that node is fully explained. We leave it to the future work for rigid mathematical analysis of manytoone mappings.
Random initialization. One nice thing about Theorem 5 is that it only requires the initial to be small. In contrast, there is no requirement for small . Therefore, we could expect that with more overparameterization and random initialization, in each layer , it is more likely to find the set (of fixed size ), or the lucky weights, so that is quite close to . At the same time, we don’t need to worry about
which grows with more overparameterization. Moreover, random initialization often gives orthogonal weight vectors, which naturally leads to Assumption
3.3.4 Extension to Multilayer ReLU networks
Using a similar approach, we could extend this analysis to multilayer cases. We conjecture that similar behaviors happen: for each layer, due to overparameterization, the weights of some lucky student nodes are close to the teacher ones. While these converge to the teacher, the final values of others irrelevant weights are initializationdependent. If the irrelevant nodes connect to lucky nodes at the upperlayer, then similar to Thm. 5, the corresponding fanout weights converge to zero. On the other hand, if they connect to nodes that are also irrelevant, then these fanout weights are notdetermined and their final values depends on initialization. However, it doesn’t matter since these upperlayer irrelevant nodes eventually meet with zero weights if going up recursively, since the topmost output layer has no overparameterization. We leave a formal analysis to future work.
4 Simulations
4.1 Checking Assumption 3
To make Theorem 4 and Theorem 5 work, we make Assumption 3 that the activation field of different teacher nodes should be wellseparated. To justify this, we analyze the bias of BatchNorm layers after the convolutional layers in pretrained VGG11/13/16/19. We check the BatchNorm bias as these models use LinearBatchNormReLU architecture. After BatchNorm first normalizes the input data into zero mean distribution, the BatchNorm bias determines how much data pass the ReLU threshold. If the bias is negative, then a small portion of data pass ReLU gating and Assumption 3 is likely to hold. From Fig. 6, it is quite clear that the majority of BatchNorm bias parameters are negative, in particular for the top layers.
4.2 Numerical Experiments of Thm. 5
We verify Thm. 5 by checking whether moves close to under different initialization. We use a network with one hidden layer. The teacher network is 102030, while the student network has more nodes in the hidden layers. Input data are Gaussian noise. We initialize the student networks so that the first nodes are close to the teacher. Specifically, we first create matrices and by first filling with i.i.d Gaussian noise, and then normalizing their columns to . Then the initial value of student is , where is a factor controlling how close is to . For we initialize with noise. Similarly we initialize with a factor . The larger and , the close the initialization and to the ground truth values.
Fig. 7 shows the behavior over different iterations. All experiments are repeated 32 times with different random seeds, and (mean
std) are reported. We can see that a close initialization leads to faster (and low variance) convergence of
to small values. In particular, it is important to have close to (large ), which leads to a clear separation between row norms of and , even if they are close to each other at the beginning of training. Having close to makes the initial gap larger and also helps convergence. On the other hand, if is small, then even if is large, the gap between row norms of and only shifts but doesn’t expand over iterations.5 Experiments
5.1 Experiment Setup
We evaluate both the fully connected (FC) and ConvNet setting. For FC, we use a ReLU teacher network of size 5075100125. For ConvNet, we use a teacher with channel size 64646464. The student networks have the same depth but with nodes/channels at each layer, such that they are substnatially overparameterized. When BatchNorm is added, it is added after ReLU.
We use random i.i.d Gaussian inputs with mean 0 and std (abbreviated as GAUS) and CIFAR10 as our dataset in the experiments. GAUS generates infinite number of samples while CIFAR10 is a finite dataset. For GAUS, we use a random teacher network as the label provider (with classes). To make sure the weights of the teacher are weakly overlapped, we sample each entry of from , making sure they are nonzero and mutually different within the same layer, and sample biases from . In the FC case, the data dimension is 20 while in the ConvNet case it is . For CIFAR10 we use a pretrained teacher network with BatchNorm. In the FC case, it has an accuracy of ; for ConvNet, the accuracy is . We repeat 5 times for all experiments, with different random seed and report min/max values.
Two metrics are used to check our prediction that some lucky student nodes converge to the teacher:
Normalized correlation
. We compute normalized correlation (or cosine similarity)
between teacher and student activations^{1}^{1}1For and , , where . evaluated on a validation set. At each layer, we average the best correlation over teacher nodes: , where is computed for each teacher and student pairs . means that most teacher nodes are covered by at least one student.Mean Rank . After training, each teacher node has the most correlated student node . We check the correlation rank of , normalized to (
=rank first), back at initialization and at different epochs, and average them over teacher nodes to yield mean rank
. Small means that student nodes that initially correlate well with the teacher keeps the lead toward the end of training.5.2 Results
Experiments are summarized in Fig. 8 and Fig. 9. indeed grows during training, in particular for low layers that are closer to the input, where moves towards . Furthermore, the final winning student nodes also have a good rank at the early stage of training, in particular after the first epoch, which is consistent with lateresetting used in lotteryscale . BatchNorm helps a lot, in particular for the CNN case with GAUS dataset. For CIFAR10, the final evaluation accuracy (see Appendix) learned by the student is often higher than the teacher. Using BatchNorm accelerates the growth of accuracy, improves , but seems not to accelerate the growth of .
The theory also predicts that the topdown modulation helps the convergence. For this, we plot at different layers during optimization on GAUS. For better visualization, we align each student node index with a teacher node according to highest
. Despite the fact that correlations are computed from the lowlayer weights, it matches well with the toplayer modulation (identity matrix structure in Fig.
11). Besides, we also perform ablation studies on GAUS.Size of teacher network. As shown in Fig. 10(a), for small teacher networks (FC 10152025), the convergence is much faster and training without BatchNorm is faster than training with BatchNorm. For large teacher networks, BatchNorm definitely increases convergence speed and growth of .
Degree of overparameterization. Fig. 12 shows the effects of different degree of overparameterization (, , , , and ). We initialize 32 different teacher network (10152025) with different random seed, and plot standard derivation with shaded area. We can clearly see that grows more stably and converges to higher values with overparameterization. On the other hand, and are slower in convergence due to excessive parameters.
Finite versus Infinite Dataset. We also repeat the experiments with a pregenerated finite dataset of GAUS in the CNN case (Fig. 10(b)), and find that the convergence of node similarity stalls after a few iterations. This is because some nodes receive very few data points in their activated regions, which is not a problem for infinite dataset. We suspect that this is probably the reason why CIFAR10, as a finite dataset, does not show similar behavior as GAUS.
6 Conclusion and Future Work
In this paper we propose a new theoretical framework that uses teacherstudent setting to understand the training dynamics of multilayered ReLU network. With this framework, we are able to conceptually explain many puzzling phenomena in deep networks, such as why overparameterization helps generalization, why the same network can fit to both random and structured data, why lottery tickets lottery ; lotteryscale exist. We backup these intuitive explanations by Theorem 4 and Theorem 5, which collectively show that student nodes that are initialized to be close to the teacher nodes converge to them with a faster rate, and the fanout weights of irrelevant nodes converge to zero. As the next steps, we aim to extend Theorem 5 to general multilayer setting (when both and are present), relax Assumption 3 and study more BatchNorm effects than what Theorem 3 suggests.
7 Acknowledgement
The first author thanks Simon Du, Jason Lee, Chiyuan Zhang, Rong Ge, Greg Yang, Jonathan Frankle and many others for the informal discussions.
References

[1]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric
Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang
Chen, et al.
Deep speech 2: Endtoend speech recognition in english and mandarin.
In
International conference on machine learning
, pages 173–182, 2016.  [2] Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto ratetuning by batch normalization. arXiv preprint arXiv:1812.03981, 2018.
 [3] Marco BaityJesi, Levent Sagun, Mario Geiger, Stefano Spigler, G Ben Arous, Chiara Cammarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing dynamics: Deep neural networks versus glassy systems. ICML, 2018.
 [4] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [5] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1019–1028. JMLR. org, 2017.
 [6] Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. ICML, 2018.
 [7] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Training pruned neural networks. ICLR, 2019.
 [8] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. The lottery ticket hypothesis at scale. arXiv preprint arXiv:1903.01611, 2019.
 [9] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 [10] Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993.

[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [12] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
 [13] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
 [14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
 [15] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. ICLR, 2017.
 [16] Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas Hofmann. Towards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694, 2018.

[17]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [18] David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via memorization. ICLR Workshop, 2017.
 [19] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
 [20] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. ICLR, 2018.
 [21] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6389–6399, 2018.
 [22] Zachary C Lipton. Stuck in a what? adventures in weight space. arXiv preprint arXiv:1602.07320, 2016.
 [23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [24] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [25] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
 [26] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of overparametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
 [27] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. ICLR Workshop, 2015.
 [28] David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
 [29] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. ICLR, 2017.
 [30] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of overparametrized neural networks. ICLR 2018 Workshop Contribution, arXiv preprint arXiv:1706.04454, 2018.
 [31] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 [32] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 [33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [34] Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from underto overparametrization affects loss landscape and generalization. arXiv preprint arXiv:1810.09665, 2018.
 [35] Yuandong Tian. A theoretical framework for deep locally connected relu network. arXiv preprint arXiv:1809.10829, 2018.
 [36] Yuandong Tian and Yan Zhu. Better computer go player with neural network and longterm prediction. ICLR, 2016.
 [37] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
 [38] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha SohlDickstein, and Samuel S Schoenholz. A mean field theory of batch normalization. ICLR, 2019.
 [39] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2017.
 [40] Chiyuan Zhang, Samy Bengio, Moritz Hardt, and Yoram Singer. Identity crisis: Memorization and generalization under extreme overparameterization. arXiv preprint arXiv:1902.04698, 2019.
 [41] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067, 2019.
8 Appendix: Proofs
8.1 Theorem 1
Proof.
The first part of gradient backpropagated to node is:
(10)  
(11)  
(12) 
Therefore, for the gradient to node , we have:
(13)  
(14) 
And similar for . Therefore, by mathematical induction, we know that all gradient at nodes in different layer follows the same form. ∎
8.2 Theorem 2
Proof.
Using Thm. 1, we can write down weight update for weight that connects node and node :
(15)  
8.3 Theorem 3
Proof.
Given a batch with size , denote prebatchnorm activations as and gradients as (See Fig. 14(a)). is its whitened version, and is the final output of BN. Here and and ,
are learnable parameters. With vector notation, the gradient update in BN has a compact form with clear geometric meaning:
Lemma 1 (Backpropagation of Batch Norm [35]).
For a topdown gradient , BN layer gives the following gradient update ( is the orthogonal complementary projection of subspace ):
(16) 
8.4 Lemmas
For simplicity, in the following, we use .
Lemma 2 (Bottom Bounds).
Proof.
Lemma 3 (Top Bounds).
Proof.
The proof is similar to Lemma 2. ∎
Lemma 4 (Quadratic falloff for diagonal elements of ).
For node , we have:
(32) 
Proof.
The intuition here is that both the volume of the affected area and the weight difference are proportional to . is their product and thus proportional to . See Fig. 16. ∎
8.5 Theorem 4
Proof.
First of all, note that . So given , we also have a bound for .
When , the matrix form can be written as the following:
(33) 
by using (and thus doesn’t matter). Since is conserved, it suffices to check whether the projected weight vector of onto the complementary space of the ground truth node , goes to zero:
Comments
There are no comments yet.