The training of Deep Neural Networks (DNN) involves a great variety of architecture choices. It is therefore crucial to find tools to understand their effects and to compare them.
For example, Batch Normalization (BN) Ioffe & Szegedy (2015) has proven to be crucial in the training of DNNs but remains ill-understood. While BN was initially introduced to solve the problem of “covariate shift”, recent results Santurkar et al. (2018) suggest an effect on the smoothness of the loss surface. Some alternatives to BN have been proposed Lei Ba et al. (2016); Salimans & Kingma (2016); Klambauer et al. (2017), yet it remains difficult to compare them theoretically. Recent theoretical results Yang et al. (2019) suggest some relation to the transition from “freeze” (also known as “order”) to “chaos” observed as the depth of the NN goes to infinity Poole et al. (2016); Daniely et al. (2016); Yang & Schoenholz (2017).
The impact of architecture is very apparent in GANs Goodfellow et al. (2014): their results are heavily affected by the architecture of the generator and discriminator Radford et al. (2015); Zhang et al. (2018); Brock et al. (2018); Karras et al. (2018) and the training may fail without BN Arpit et al. (2016); Xiang & Li (2017).
have allowed one to understand the training of DNNs when the number of neurons in each hidden layer is very large. These new results give new tools to study the asymptotic effect of BN. In particular, the Neural Tangent Kernel (NTK)Jacot et al. (2018) illustrates the effect of architecture on the training of DNNs and also describes their loss surfaceKarakida et al. (2018). The NTK can easily be extended to CNNs and other architectures Yang (2019); Arora et al. (2019), hence allowing comparison.
1.1 Our Contributions
We describe how the NTK is affected by the “freeze” and “chaos” regimes Poole et al. (2016); Daniely et al. (2016); Yang & Schoenholz (2017). For fully-connected networks (FC-NNs), the scaled NTK converges to a constant in the “freeze” regime and to a Kronecker delta in the ”chaos” regime. In deconvolutional networks (DC-NNs), a similar transition takes place: the “freeze” regime features checkerboard patterns Odena et al. (2016) and the “chaos” regime features a (translation invariant) Kronecker delta.
We then show that different normalization techniques such as Batch Normalization and our proposed Nonlinearity Normalization with hyper-parameter tuning allows the DNN to avoid the “freeze” regime.
Besides, we prove that the traditional parametrization of DC-NNs leads to border effects in the NTK and we propose a simple solution suggesting a new “parent-based” parametrization. At last, the effect of the number of channels on the NTK is discussed, giving a theoretical motivation for decreasing the number of channels after each upsampling. We show that using a layer-dependent learning rate allows to balance the contributions of the layers to the learning.
Finally, we demonstrate our findings numerically on DC-GANs: we show that in the “freeze” regime, the generator collapses to a checkerboard mode. We show how a basic DC-GAN can be effectively trained by avoiding this mode collapse: by proper hyperparameter tuning, nonlinearity normalization, parametrization and learning rate choices, without using batch normalization, we are able to reach the “chaos” regime and to get good quality samples from a very simple DC-NN generator.
In this section, we introduce the two architectures that we will consider, FC-NNs and DC-NN, and their training procedures.
2.1 Fully-Connected Neural Nets
The first type of architecture we consider are deep Fully-Connected Neural Nets (FC-NNs). An FC-NN with nonlinearity consists of layers ( hidden layers), each containing neurons. The parameters are defined by connection weight matrices
and bias vectorsfor . Following Jacot et al. (2018), the network parameters are aggregated into a single vector and initialized using iid standard Gaussians . For , the ANN is defined as , where the activations and preactivations are recursively constructed using the NTK parametrization: we set and, for ,
where is applied entry-wise and .
The NTK initialization is equivalent to the so-called Le Cun initialization
scheme LeCun et al. (2012), where each connection weight of the
-th layer is initialized with standard deviation
-th layer is initialized with standard deviation; in our approach, the appears in the parametrization of the network. While the behavior at initialization is similar, the NTK parametrization ensures that the training is consistent as the size of the layers grows, see Jacot et al. (2018).
The hyperparameter allows one to balance the relative contributions
of the connection weights and of the biases during training; in our
numerical experiments, we set . Note that the variance
of the normalized bias
. Note that the variance of the normalized biasat initialization can be tuned by .
2.2 Deconvolutional Neural Networks
The second type of architecture we consider are Deconvolutional Nets (DC-NN), also known as Transpose ConvNets or Fractionally Strided ConvNetsDumoulin & Visin (2016). A DC-NN in dimension with layers, channel numbers , windows
and padding windowsfor , consists of a composition of the following operations:
The upsampling , with stride , constructs a ‘blown-up’ image from by if (i.e. if for any , ) and if not.
The DC-filter constructs an ‘output’ where , as follows: we define by
where the matrix encodes a linear map, and denotes the cardinality of ; we apply ‘zero-padding’, setting for .
The pointwise application of the nonlinearity (to each channel of each pixel).
The parameters are aggregated into a vector and initialized as iid standard Gaussians .
For , the DC-NN is defined as the composition
2.3 Training and Setup
In this section, we describe the training of ANNs in the FC-NN case to keep the notation light; the generalization to the DC-NN case is straightforward. For a dataset , we define the output matrix by . The ANN is trained by optimizing a cost through gradient descent, defining a flow .
In this paper, we focus on the so called over-parametrized regime, where the sizes of the hidden layers (either sequentially, as in Jacot et al. (2018) or simultaneously, as in Yang (2019); Arora et al. (2019)), for fixed . In the case of FC-NNs, this amounts to taking large widths for the hidden layers, while for the DC-NNs, this amounts to taking large channel numbers.
3 Neural Tangent Kernel
The Neural Tangent Kernel (NTK) Jacot et al. (2018) is at the heart of our analysis of the overparametrized regime. It describes the evolution of in function space during training. In the FC-NN case, the NTK is defined by
For a finite dataset , the NTK Gram Matrix is defined by . The evolution of can then be written in terms of the NTK as follows
In the DC-NN case, the NTK is defined by a similar formula: the matrix represents how a ‘pressure’ to change the pixel produced by the ‘code’ influences the value of the pixel produced by the ‘code’ .
3.1 Infinite-Width Limit for FC-NNs
Following Neal (1996); Cho & Saul (2009); Lee et al. (2018), in the overparametrized regime at initialization, the pre-activations are described by iid centered Gaussian processes with covariance kernels constructed as follows. For a kernel , set
The activation kernels are defined recursively by and .
While random at initialization, in the infinite-width-limit, the NTK converges to a deterministic limit, which is moreover constant during training:
As , for any and any , the kernel converges to , where and with denoting the derivative of .
3.2 Infinite-Channel-Number for DC-NNs
The infinite-channel-number limit for DC-NNs follows a similar analysis with the difference that the weights are shared in the architecture. In Yang (2019); Arora et al. (2019), a number of results have been derived for the initialization and are generalized in our setting in Appendix E. Based on the existing results, it appears naturally to postulate the following:
As , the DC-NN NTK has a deterministic, time-constant limit.
4 Freeze and Chaos: Constant Modes, Checkerboard Artifacts
We now investigate the large behavior on the NTK (in the infinite-width limit), revealing a transition between two phases which we call “freeze” and “chaos”. We start with a few key definitions:
We say that a Lipschitz nonlinearity is standardized if . For a standardized , we define its characteristic value as where denotes the (a.e. defined) derivative of . We denote by the normalized NTK defined by
We define the standard -spheres by
Following Daniely et al. (2016), we consider standardized nonlinearities and inputs in (and for DC-NNs). This ensures that the variance of the neurons is constant for all depths: . Our techniques extend to inputs which have the same norm, as is approximately the case for high dimensional datasets: for example in GANs Goodfellow et al. (2014) the inputs of a generator are vectors of iid entries which concentrate around when the dimension is high.
4.1 Freeze and Chaos for Fully-Connected Networks
For a standardized , the large-depth behavior of the NTK is governed by the characteristic value:
Suppose that is twice differentiable and standardized.
If , we are in the frozen regime: there exists such that for ,
If , we are in the chaotic regime: for in , there exist and , such that
Theorem 2 shows that in the frozen regime, the normalized NTK converges to a constant, whereas in the chaotic regime, it converges to a Kronecker (taking value on the diagonal, elsewhere). This suggests that the training of deep FC-NN is heavily influenced by the characteristic value : when , becomes constant, thus slowing down the training, whereas when , is concentrates on the diagonal, ensuring fast training, but limiting generalization. To train very deep FC-NNs, it is best to lie “on the edge of chaos” Poole et al. (2016); Yang & Schoenholz (2017). When the depth is not too large, it appears possible to lean into the chaotic regime to speed up training without sacrificing generalization.
The standardized ReLU , has characteric value , which lies in the frozen regime for . The non-differentiability of leads to a different behavior as grows:
With the same notation as in Theorem 2, taking to be the standardized ReLU and , we are in the weakly frozen regime: there exists a constant such that .
When the characteristic value is equal to , which is very close to the transition between the two regimes. To really lie in the freeze regime, we require a larger . In Figure 1, we see that even at the edge of chaos, a ReLU Net with has a strong affinity to the constant mode as witnessed by the large average value of the normalized NTK on the circle , for a fixed point of and sampled uniformly on . In Section 5, we present a normalization technique to reach the chaotic regime with a ReLU Net.
4.2 Bulk Freeze and Chaos for Deconvolutional Nets
For DC-NNs, the value of an output neuron at a position only depends on the inputs which are ancestors of , i.e. all positions such that there is a chain of connections from to . For the same reason , the NTK only depends on the values for ancestors of and respectively.
For a stride , we denote the -valuation of as the largest such that for all . The behaviour of the NTK depends on the -valuation of the difference of the two output positions. If is strictly smaller than , the NTK converges to a constant in the infinite-width limit for any .
Again the characteristic number plays a central role in the behavior of the large-depth limit.
Take , and consider a DC-NN with upsampling stride , windows for . For a standardized twice differentiable , there exist constants such that the following holds: for , and any positions , we have:
Freeze: When , taking , taking if and , we have
Chaos: When , if either or if there exists a such that for all positions which are ancestor of , , then there exists such that
This theorem suggests that in the freeze regime, the correlations between differing positions and increase with , which is a strong feature of checkerboard patterns Odena et al. (2016). These artifacts typically appear in images generated by DC-NNs. The form of the NTK also suggests a strong affinity to these checkerboard patterns: they should dominate the NTK spectral decomposition. This is shown in Figure 2
where the eigenvectors of the NTK Gram matrix for a DC-NN are computed.
In the chaotic regime, the normalized NTK converges to a “scaled translation invariant” Kronecker delta. For two output positions and we associate the two regions and of the input space which are connected to and . Then is one if the patch is a translation of and approximately zero otherwise.
5 Batch Normalization, Hyperparameters and Nonlinearity Modifications
5.1 Batch Normalization
In Section 4
, we have seen that in the frozen scenario, the NTK is dominated by the constant mode: more precisely, the constant functions correspond to the leading eigenvalue of the NTK. In this subsection, we explain how (a type of) Batch Normalization (BN) allows one to ‘kill’ the constant mode in the NTK. We consider thepost-nonlinearity BN (PN-BN), which adds a normalizing layer to the activations (after the nonlinearity), defined by
for , and .
While incorporating the BN would modify the overparametrized regime of the NTK analysis, the following suggests that the PN-BN plays a role which can be understood in terms of the NTK. In particular, it allows to control the importance of the constant mode:
Consider FC-NN with layers, with a PN-BN after the last nonlinearity, for any and any parameter , we have .
When training the FC-NN with PN-BN to fit labels with a small value , it is important to center the labels, since the convergence is slow along the constant mode. On the other hand, a small value of allows one to consider a higher learning rate, thus accelerating the convergence along the non-constant modes, including for large values of .
5.2 Nonlinearity Normalization, Hyperparameter Tuning and Chaos-Freeze Transition
In Section 4, we showed the existence of the frozen and chaotic phases for FC-NNs and DC-NNs, which depends on the characteristic number . In this section, we show that by centering, standardizing the nonlinearity and by tuning , one can reach both phases. Let us first observe that if we standardize , since as we have , it is always possible to lie in the ordered regime. On the other hand, if we take a Lipschitz nonlinearity, by centering and standardizing , we can take sufficiently small so that , as guaranteed by the following (variant of Poincaré’s) lemma:
If and , we have , in particular if
we have .
Centering and standardizing (i.e. normalizing) the nonlinearity is similar to Layer Normalization (LN) for FC-NNs, where for each input , and each , we normalize the (post-nonlinearity) activation vectors to center and normalize their entries:
In the infinite-width limit, normalizing is equivalent to LN if the input datapoints have a norm For more details, see Appendix C.
6 New NTK Parametrization: Boundary Effects and Learning Rates
In DC-NNs, the neurons which lie at position on the border of the patches behave differently than neurons in the center. Typically, these neurons have less parent neurons in the precedent layer and as result have a lower variance at initialization. Both kernels and have lower intensity for on the border (see Appendix G for an example when , i.e. when there is one border pixel), which leads to border artifacts as seen in Figure 2.
A natural solution is to adapt the factors in the definition of the DC filters. Instead of dividing by (the squared root of) which is the maximal number of parents (only attained in center neurons) we divide by the actual number of parents :
In the Appendix E, in order to be self-contained and since we consider up-sampling, we show again that the NTK converges as the width of the layers grow to infinity sequentially. By doing so, we get formulae for the limiting NTK which allow us to prove that, with the parent-based parametrization, the border artifacts disappear for both and :
For the parent-based parametrization of DC-NNs, if the non-linearity is standardized, and do not depend neither on nor on .
6.1 Layer-dependent learning rate
The NTK is the sum over the contributions of the weights and biases . At the -th layer, the weights and biases can only contribute to checkerboard patterns of degree and , i.e. patterns with periods and respectively, in the following sense:
In a DC-NN with stride , we have if and if .
This suggests that the supports of and increase exponentially with , giving more importance to the last layers during training. This could explain why the checkerboard patterns of lower degree dominate in Figure 2. In the classical parametrization, the balance is restored by letting the number of channels decrease with depth Radford et al. (2015). In the NTK parametrization, the limiting NTK is not affected by the ratios . To achieve the same effect, we divide the learning rate of the weights and bias of the -th layer by and respectively, where is the product of the strides.
Together with the ’parent-based’ parametrization and the normalization of the non-linearity (in order to lie in the chaotic regime) this rescaling of the learning rate removes both border and checkerboard artifacts in Figure 2.
7 Generative Adverserial Networks
A common problem in the training of GANs is the collapse of the generator to a constant. This problem is greatly reduced by avoiding the “freeze” regime in which the constant mode dominates and by using the new NTK parametrization with adaptive learning rates. Figure 2 shows the results obtained with three GANs which differ only in the choice of non-linearity and/or the presence of Batch Normalization in the generator. In all cases, the discriminator is a convolutional network with the normalized ReLU as non-linearity. With the ReLU, the generator collapses and generates a single image with checkerboard patterns. With the normalized ReLU or with Batch Normalization, the generator is able to learn a variety of images. This motivates the use of normalization techniques in GANs to avoid the collapse of the generator.
This article shows how the NTK can be used theoretically to understand the effect of architecture choices (such as decreasing the number of channels or batch normalization) on the training of DNNs. We have shown that DNNs in a “freeze” regime, have a strong affinity to constant modes and checkerboard artifacts: this slows down training and can contribute to a mode collapse of the DC-NN generator of GANs. We introduce simple modifications to solve these problems: the effectiveness of normalizing the non-linearity, a parent-based parametrization and a layer-dependent learning rates is shown both theoretically and numerically.
- Allen-Zhu et al. (2018) Allen-Zhu, Zeyuan, Li, Yuanzhi, & Song, Zhao. 2018. A Convergence Theory for Deep Learning via Over-Parameterization. CoRR, abs/1811.03962.
- Arora et al. (2019) Arora, Sanjeev, Du, Simon S, Hu, Wei, Li, Zhiyuan, Salakhutdinov, Ruslan, & Wang, Ruosong. 2019. On Exact Computation with an Infinitely Wide Neural Net. arXiv preprint arXiv:1904.11955.
Arpit et al. (2016)
Arpit, Devansh, Zhou, Yingbo, Kota, Bhargava, & Govindaraju, Venu. 2016.
Normalization Propagation: A Parametric Technique for Removing
Internal Covariate Shift in Deep Networks.
Pages 1168–1176 of: Balcan, Maria Florina, & Weinberger,
Kilian Q. (eds),
Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48. New York, New York, USA: PMLR.
- Brock et al. (2018) Brock, Andrew, Donahue, Jeff, & Simonyan, Karen. 2018. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
Cho & Saul (2009)
Cho, Youngmin, & Saul, Lawrence K. 2009.
Kernel Methods for Deep Learning.Pages 342–350 of: Advances in Neural Information Processing Systems 22. Curran Associates, Inc.
- Daniely et al. (2016) Daniely, Amit, Frostig, Roy, & Singer, Yoram. 2016. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity. Pages 2253–2261 of: Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., & Garnett, R. (eds), Advances in Neural Information Processing Systems 29. Curran Associates, Inc.
- Du et al. (2019) Du, Simon S., Zhai, Xiyu, Póczos, Barnabás, & Singh, Aarti. 2019. Gradient Descent Provably Optimizes Over-parameterized Neural Networks.
- Dumoulin & Visin (2016) Dumoulin, Vincent, & Visin, Francesco. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285.
Glorot & Bengio (2010)
Glorot, Xavier, & Bengio, Yoshua. 2010.
Understanding the difficulty of training deep feedforward neural
Pages 249–256 of:
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 9. Chia Laguna Resort, Sardinia, Italy: PMLR.
- Goodfellow et al. (2014) Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, & Bengio, Yoshua. 2014. Generative Adversarial Networks. NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, jun, 2672–2680.
- Ioffe & Szegedy (2015) Ioffe, Sergey, & Szegedy, Christian. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. CoRR, abs/1502.03167.
- Jacot et al. (2018) Jacot, Arthur, Gabriel, Franck, & Hongler, Clément. 2018. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Pages 8580–8589 of: Advances in Neural Information Processing Systems 31. Curran Associates, Inc.
- Karakida et al. (2018) Karakida, Ryo, Akaho, Shotaro, & Amari, Shun-Ichi. 2018. Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. jun.
- Karras et al. (2018) Karras, Tero, Laine, Samuli, & Aila, Timo. 2018. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948.
- Klambauer et al. (2017) Klambauer, Günter, Unterthiner, Thomas, Mayr, Andreas, & Hochreiter, Sepp. 2017. Self-Normalizing Neural Networks. Pages 971–980 of: Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., & Garnett, R. (eds), Advances in Neural Information Processing Systems 30. Curran Associates, Inc.
- LeCun et al. (2012) LeCun, Yann A, Bottou, Léon, Orr, Genevieve B, & Müller, Klaus-Robert. 2012. Efficient backprop. Pages 9–48 of: Neural networks: Tricks of the trade. Springer.
- Lee et al. (2018) Lee, Jae Hoon, Bahri, Yasaman, Novak, Roman, Schoenholz, Samuel S., Pennington, Jeffrey, & Sohl-Dickstein, Jascha. 2018. Deep Neural Networks as Gaussian Processes. ICLR.
- Lei Ba et al. (2016) Lei Ba, J., Kiros, J. R., & Hinton, G. E. 2016. Layer Normalization. arXiv e-prints, July.
- Neal (1996) Neal, Radford M. 1996. Bayesian Learning for Neural Networks. Secaucus, NJ, USA: Springer-Verlag New York, Inc.
- Odena et al. (2016) Odena, Augustus, Dumoulin, Vincent, & Olah, Chris. 2016. Deconvolution and checkerboard artifacts. Distill, 1(10), e3.
- Park et al. (2018) Park, Daniel S, Smith, Samuel L, Sohl-dickstein, Jascha, & Le, Quoc V. 2018. Optimal SGD Hyperparameters for Fully Connected Networks.
- Poole et al. (2016) Poole, Ben, Lahiri, Subhaneil, Raghu, Maithra, Sohl-Dickstein, Jascha, & Ganguli, Surya. 2016. Exponential expressivity in deep neural networks through transient chaos. Pages 3360–3368 of: Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., & Garnett, R. (eds), Advances in Neural Information Processing Systems 29. Curran Associates, Inc.
- Radford et al. (2015) Radford, Alec, Metz, Luke, & Chintala, Soumith. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
- Salimans & Kingma (2016) Salimans, Tim, & Kingma, Durk P. 2016. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. Pages 901–909 of: Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., & Garnett, R. (eds), Advances in Neural Information Processing Systems 29. Curran Associates, Inc.
- Santurkar et al. (2018) Santurkar, Shibani, Tsipras, Dimitris, Ilyas, Andrew, & Madry, Aleksander. 2018. How Does Batch Normalization Help Optimization? Pages 2483–2493 of: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., & Garnett, R. (eds), Advances in Neural Information Processing Systems 31. Curran Associates, Inc.
- Xiang & Li (2017) Xiang, Sitao, & Li, Hao. 2017. On the effects of batch and weight normalization in generative adversarial networks. arXiv preprint arXiv:1704.03971.
- Yang (2019) Yang, Greg. 2019. Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation. arXiv e-prints, Feb, arXiv:1902.04760.
- Yang & Schoenholz (2017) Yang, Greg, & Schoenholz, Samuel. 2017. Mean Field Residual Networks: On the Edge of Chaos. Pages 7103–7114 of: Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., & Garnett, R. (eds), Advances in Neural Information Processing Systems 30. Curran Associates, Inc.
- Yang et al. (2019) Yang, Greg, Pennington, Jeffrey, Rao, Vinay, Sohl-Dickstein, Jascha, & Schoenholz, Samuel S. 2019. A Mean Field Theory of Batch Normalization. CoRR, abs/1902.08129.
- Zhang et al. (2018) Zhang, Han, Goodfellow, Ian, Metaxas, Dimitris, & Odena, Augustus. 2018. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318.
Appendix A Choice of Parametrization
The NTK parametrization differs slightly from the one usually used, yet it ensures that the training is consistent as the size of the layers grows. In the standard parametrization, the activations are defined by
and we denote by the output function of the ANN. Note the absence of in comparison to the NTK parametrization. The parameters are initialized using the LeCun/He initialization LeCun et al. (2012): the parameters have standard deviation (or for the ReLU but this does not change the general analysis). Using this initialization, the activations stay stochastically bounded as the widths of the ANN get large. In the forward pass, there is almost no difference between the two parametrizations and for each choice of parameters , we can scale down the connection weights by and the bias weights by to obtain a new set of parameters such that
The two parametrizations will exhibit a difference during backpropagation since:
The NTK is a sum of products of these derivatives over all parameters:
With our parametrization, all summands converge to a finite limit, while with the Le Cun or He parameterization we obtain
where some summands, namely the explode in the infinite width limit. One must therefore take a learning rate of order Karakida et al. (2018); Park et al. (2018) to obtain a meaningful training dynamics, but in this case the contributions to the NTK of the first layers connections and the bias of all layers vanish, which implies that training these parameters has less and less effect on the function as the width of the network grows. As a result, the dynamics of the output function during training can still be described by a modified kernel gradient descent: the modified learning rate compensates for the absence of normalization in the usual parametrization.
The NTK parametrization is hence more natural for large networks, as it solves both the problem of having a meaningful forward and backward passes, and to avoid tuning the learning rate, which is the problem that sparked multiple alternative initialization strategies in deep learning Glorot & Bengio (2010). Note that in the standard parametrization, the importance of the bias parameters shrinks as the width gets large; this can be implemented in the NTK parametrization by taking a small value for the parameter .
Appendix B FC-NN Freeze and Chaos
In this section, we prove Theorem 2, showing the existence of two regimes,‘freeze’ and ‘chaos’, in FC-NNs. First, we improve some results of Daniely et al. (2016), and study the rate of convergence of the activation kernels as the depth grows to infinity. In a second step, this allows us to characterise the behavior of the NTK for large depth.
Let us consider a standardized differentiable non-linearity , i.e. satisfying . Recall that the the activation kernels are defined recursively by and By induction, for any , is uniquely determined by . Defining the two functions by:
one can formulate the activation kernels as an alternate composition of and :
In particular, this shows that for any , . Since the activation kernels are obtained by iterating the same function, we first study the fixed points of the composition . When is a standardized non-linearity, the function , named the dual of , satisfies the following key properties proven in Daniely et al. (2016):
For any , ,
is convex in ,
, where denotes the derivative of ,
By definition , thus is a trivial fixed point: . This shows that for any and any :
It appears that is also a fixed point of if and only if the non-linearity is antisymmetric and . From now on, we will focus on the region . From the property 2. of and since is non decreasing, any non trivial fixed point must lie in . Since , and is convex in , there exists a non trivial fixed point of if whereas if there is no fixed point in . This leads to two regimes shown in Daniely et al. (2016), depending on the value of :
“Freeze” when : has a unique fixed point equal to and the activation kernels become constant at an exponential rate,
“Chaos” when : has another fixed point and the activation kernels converge to a kernel equal to if and to if and, if the nonlinearity is antisymmetric and , it converges to if and only if .
To establish the existence of the two regimes for the NTK, we need the following bounds on the rate of convergence of in the “freeze” region and on its values in the “chaos” region:
If is a standardized differentiable non-linearity,
If , then for any ,
If , then there exists a fixed point of such that for any ,
Let us denote . Let us suppose that . By Daniely et al. (2016), we know that and where . From now on, we will omit to specify the distribution asumption on . The previous equalities and inequalities imply that , thus we obtain: