Freeze and Chaos for DNNs: an NTK view of Batch Normalization, Checkerboard and Boundary Effects

07/11/2019 ∙ by Arthur Jacot, et al. ∙ EPFL 2

In this paper, we analyze a number of architectural features of Deep Neural Networks (DNNs), using the so-called Neural Tangent Kernel (NTK). The NTK describes the training trajectory and generalization of DNNs in the infinite-width limit. In this limit, we show that for (fully-connected) DNNs, as the depth grows, two regimes appear: "freeze" (also known as "order"), where the (scaled) NTK converges to a constant (slowing convergence), and "chaos", where it converges to a Kronecker delta (limiting generalization). We show that when using the scaled ReLU as a nonlinearity, we naturally end up in the "freeze". We show that Batch Normalization (BN) avoids the freeze regime by reducing the importance of the constant mode in the NTK. A similar effect is obtained by normalizing the nonlinearity which moves the network to the chaotic regime. We uncover the same "freeze" and "chaos" modes in Deep Deconvolutional Networks (DC-NNs). The "freeze" regime is characterized by checkerboard patterns in the image space in addition to the constant modes in input space. Finally, we introduce a new NTK-based parametrization to eliminate border artifacts and we propose a layer-dependent learning rate to improve the convergence of DC-NNs. We illustrate our findings by training DCGANs using our setup. When trained in the "freeze" regime, we see that the generator collapses to a checkerboard mode. We also demonstrate numerically that the generator collapse can be avoided and that good quality samples can be obtained, by tuning the nonlinearity to reach the "chaos" regime (without using batch normalization).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The training of Deep Neural Networks (DNN) involves a great variety of architecture choices. It is therefore crucial to find tools to understand their effects and to compare them.

For example, Batch Normalization (BN) Ioffe & Szegedy (2015) has proven to be crucial in the training of DNNs but remains ill-understood. While BN was initially introduced to solve the problem of “covariate shift”, recent results Santurkar et al. (2018) suggest an effect on the smoothness of the loss surface. Some alternatives to BN have been proposed Lei Ba et al. (2016); Salimans & Kingma (2016); Klambauer et al. (2017), yet it remains difficult to compare them theoretically. Recent theoretical results Yang et al. (2019) suggest some relation to the transition from “freeze” (also known as “order”) to “chaos” observed as the depth of the NN goes to infinity Poole et al. (2016); Daniely et al. (2016); Yang & Schoenholz (2017).

The impact of architecture is very apparent in GANs Goodfellow et al. (2014): their results are heavily affected by the architecture of the generator and discriminator Radford et al. (2015); Zhang et al. (2018); Brock et al. (2018); Karras et al. (2018) and the training may fail without BN Arpit et al. (2016); Xiang & Li (2017).

Recently, a number of important advances Jacot et al. (2018); Du et al. (2019); Allen-Zhu et al. (2018)

have allowed one to understand the training of DNNs when the number of neurons in each hidden layer is very large. These new results give new tools to study the asymptotic effect of BN. In particular, the Neural Tangent Kernel (NTK)

Jacot et al. (2018) illustrates the effect of architecture on the training of DNNs and also describes their loss surfaceKarakida et al. (2018). The NTK can easily be extended to CNNs and other architectures Yang (2019); Arora et al. (2019), hence allowing comparison.

1.1 Our Contributions

We describe how the NTK is affected by the “freeze” and “chaos” regimes Poole et al. (2016); Daniely et al. (2016); Yang & Schoenholz (2017). For fully-connected networks (FC-NNs), the scaled NTK converges to a constant in the “freeze” regime and to a Kronecker delta in the ”chaos” regime. In deconvolutional networks (DC-NNs), a similar transition takes place: the “freeze” regime features checkerboard patterns Odena et al. (2016) and the “chaos” regime features a (translation invariant) Kronecker delta.

We then show that different normalization techniques such as Batch Normalization and our proposed Nonlinearity Normalization with hyper-parameter tuning allows the DNN to avoid the “freeze” regime.

Besides, we prove that the traditional parametrization of DC-NNs leads to border effects in the NTK and we propose a simple solution suggesting a new “parent-based” parametrization. At last, the effect of the number of channels on the NTK is discussed, giving a theoretical motivation for decreasing the number of channels after each upsampling. We show that using a layer-dependent learning rate allows to balance the contributions of the layers to the learning.

Finally, we demonstrate our findings numerically on DC-GANs: we show that in the “freeze” regime, the generator collapses to a checkerboard mode. We show how a basic DC-GAN can be effectively trained by avoiding this mode collapse: by proper hyperparameter tuning, nonlinearity normalization, parametrization and learning rate choices, without using batch normalization, we are able to reach the “chaos” regime and to get good quality samples from a very simple DC-NN generator.

2 Setup

In this section, we introduce the two architectures that we will consider, FC-NNs and DC-NN, and their training procedures.

2.1 Fully-Connected Neural Nets

The first type of architecture we consider are deep Fully-Connected Neural Nets (FC-NNs). An FC-NN with nonlinearity consists of layers ( hidden layers), each containing neurons. The parameters are defined by connection weight matrices

and bias vectors

for . Following Jacot et al. (2018), the network parameters are aggregated into a single vector and initialized using iid standard Gaussians . For , the ANN is defined as , where the activations and preactivations are recursively constructed using the NTK parametrization: we set and, for ,

where is applied entry-wise and .

Remark 1.

The NTK initialization is equivalent to the so-called Le Cun initialization scheme LeCun et al. (2012), where each connection weight of the

-th layer is initialized with standard deviation

; in our approach, the appears in the parametrization of the network. While the behavior at initialization is similar, the NTK parametrization ensures that the training is consistent as the size of the layers grows, see Jacot et al. (2018).

The hyperparameter allows one to balance the relative contributions of the connection weights and of the biases during training; in our numerical experiments, we set

. Note that the variance of the normalized bias

at initialization can be tuned by .

2.2 Deconvolutional Neural Networks

The second type of architecture we consider are Deconvolutional Nets (DC-NN), also known as Transpose ConvNets or Fractionally Strided ConvNets

Dumoulin & Visin (2016). A DC-NN in dimension with layers, channel numbers , windows

and padding windows

for , consists of a composition of the following operations:

The upsampling , with stride , constructs a ‘blown-up’ image from by if (i.e. if for any , ) and if not.

The DC-filter constructs an ‘output’ where , as follows: we define by

where the matrix encodes a linear map, and denotes the cardinality of ; we apply ‘zero-padding’, setting for .

The pointwise application of the nonlinearity (to each channel of each pixel).

The parameters are aggregated into a vector and initialized as iid standard Gaussians .

For , the DC-NN is defined as the composition

2.3 Training and Setup

In this section, we describe the training of ANNs in the FC-NN case to keep the notation light; the generalization to the DC-NN case is straightforward. For a dataset , we define the output matrix by . The ANN is trained by optimizing a cost through gradient descent, defining a flow .

In this paper, we focus on the so called over-parametrized regime, where the sizes of the hidden layers (either sequentially, as in Jacot et al. (2018) or simultaneously, as in Yang (2019); Arora et al. (2019)), for fixed . In the case of FC-NNs, this amounts to taking large widths for the hidden layers, while for the DC-NNs, this amounts to taking large channel numbers.

3 Neural Tangent Kernel

The Neural Tangent Kernel (NTK) Jacot et al. (2018) is at the heart of our analysis of the overparametrized regime. It describes the evolution of in function space during training. In the FC-NN case, the NTK is defined by

For a finite dataset , the NTK Gram Matrix is defined by . The evolution of can then be written in terms of the NTK as follows

In the DC-NN case, the NTK is defined by a similar formula: the matrix represents how a ‘pressure’ to change the pixel produced by the ‘code’ influences the value of the pixel produced by the ‘code’ .

3.1 Infinite-Width Limit for FC-NNs

Following Neal (1996); Cho & Saul (2009); Lee et al. (2018), in the overparametrized regime at initialization, the pre-activations are described by iid centered Gaussian processes with covariance kernels constructed as follows. For a kernel , set

The activation kernels are defined recursively by and .

While random at initialization, in the infinite-width-limit, the NTK converges to a deterministic limit, which is moreover constant during training:

Theorem 1.

As , for any and any , the kernel converges to , where and with denoting the derivative of .

Jacot et al. (2018) gives a proof for the sequential limit and Yang (2019); Arora et al. (2019) a proof in the simultaneous limit . As a consequence, in the infinite-width limit, the dynamics of the function acquires a simple form:

3.2 Infinite-Channel-Number for DC-NNs

The infinite-channel-number limit for DC-NNs follows a similar analysis with the difference that the weights are shared in the architecture. In Yang (2019); Arora et al. (2019), a number of results have been derived for the initialization and are generalized in our setting in Appendix E. Based on the existing results, it appears naturally to postulate the following:

Conjecture 1.

As , the DC-NN NTK has a deterministic, time-constant limit.

4 Freeze and Chaos: Constant Modes, Checkerboard Artifacts

We now investigate the large behavior on the NTK (in the infinite-width limit), revealing a transition between two phases which we call “freeze” and “chaos”. We start with a few key definitions:

Definition 1.

We say that a Lipschitz nonlinearity is standardized if . For a standardized , we define its characteristic value as where denotes the (a.e. defined) derivative of . We denote by the normalized NTK defined by

We define the standard -spheres by

Following Daniely et al. (2016), we consider standardized nonlinearities and inputs in (and for DC-NNs). This ensures that the variance of the neurons is constant for all depths: . Our techniques extend to inputs which have the same norm, as is approximately the case for high dimensional datasets: for example in GANs Goodfellow et al. (2014) the inputs of a generator are vectors of iid entries which concentrate around when the dimension is high.

4.1 Freeze and Chaos for Fully-Connected Networks

For a standardized , the large-depth behavior of the NTK is governed by the characteristic value:

Theorem 2.

Suppose that is twice differentiable and standardized.

If , we are in the frozen regime: there exists such that for ,

If , we are in the chaotic regime: for in , there exist and , such that

Theorem 2 shows that in the frozen regime, the normalized NTK converges to a constant, whereas in the chaotic regime, it converges to a Kronecker (taking value on the diagonal, elsewhere). This suggests that the training of deep FC-NN is heavily influenced by the characteristic value : when , becomes constant, thus slowing down the training, whereas when , is concentrates on the diagonal, ensuring fast training, but limiting generalization. To train very deep FC-NNs, it is best to lie “on the edge of chaos” Poole et al. (2016); Yang & Schoenholz (2017). When the depth is not too large, it appears possible to lean into the chaotic regime to speed up training without sacrificing generalization.

The standardized ReLU , has characteric value , which lies in the frozen regime for . The non-differentiability of leads to a different behavior as grows:

Theorem 3.

With the same notation as in Theorem 2, taking to be the standardized ReLU and , we are in the weakly frozen regime: there exists a constant such that .

When the characteristic value is equal to , which is very close to the transition between the two regimes. To really lie in the freeze regime, we require a larger . In Figure 1, we see that even at the edge of chaos, a ReLU Net with has a strong affinity to the constant mode as witnessed by the large average value of the normalized NTK on the circle , for a fixed point of and sampled uniformly on . In Section 5, we present a normalization technique to reach the chaotic regime with a ReLU Net.

[scale=0.35]NTK_circle_L6.pdf[scale=0.35]constant_mode_depth.pdf

Figure 1: (Left) the NTK on the unit circle for and (right) average of the normalized NTK on the circle as a function of . Four architectures are plotted: vanilla ReLU network with (freeze ) and (edge of chaos ), with a normalized ReLU (chaos ) and with Batch Norm.

4.2 Bulk Freeze and Chaos for Deconvolutional Nets

For DC-NNs, the value of an output neuron at a position only depends on the inputs which are ancestors of , i.e. all positions such that there is a chain of connections from to . For the same reason , the NTK only depends on the values for ancestors of and respectively.

For a stride , we denote the -valuation of as the largest such that for all . The behaviour of the NTK depends on the -valuation of the difference of the two output positions. If is strictly smaller than , the NTK converges to a constant in the infinite-width limit for any .

Again the characteristic number plays a central role in the behavior of the large-depth limit.

Theorem 4.

Take , and consider a DC-NN with upsampling stride , windows for . For a standardized twice differentiable , there exist constants such that the following holds: for , and any positions , we have:

Freeze: When , taking , taking if and , we have

Chaos: When , if either or if there exists a such that for all positions which are ancestor of , , then there exists such that

This theorem suggests that in the freeze regime, the correlations between differing positions and increase with , which is a strong feature of checkerboard patterns Odena et al. (2016). These artifacts typically appear in images generated by DC-NNs. The form of the NTK also suggests a strong affinity to these checkerboard patterns: they should dominate the NTK spectral decomposition. This is shown in Figure 2

where the eigenvectors of the NTK Gram matrix for a DC-NN are computed.

In the chaotic regime, the normalized NTK converges to a “scaled translation invariant” Kronecker delta. For two output positions and we associate the two regions and of the input space which are connected to and . Then is one if the patch is a translation of and approximately zero otherwise.

5 Batch Normalization, Hyperparameters and Nonlinearity Modifications

5.1 Batch Normalization

In Section 4

, we have seen that in the frozen scenario, the NTK is dominated by the constant mode: more precisely, the constant functions correspond to the leading eigenvalue of the NTK. In this subsection, we explain how (a type of) Batch Normalization (BN) allows one to ‘kill’ the constant mode in the NTK. We consider the

post-nonlinearity BN (PN-BN), which adds a normalizing layer to the activations (after the nonlinearity), defined by

for , and .

While incorporating the BN would modify the overparametrized regime of the NTK analysis, the following suggests that the PN-BN plays a role which can be understood in terms of the NTK. In particular, it allows to control the importance of the constant mode:

Lemma 1.

Consider FC-NN with layers, with a PN-BN after the last nonlinearity, for any and any parameter , we have .

When training the FC-NN with PN-BN to fit labels with a small value , it is important to center the labels, since the convergence is slow along the constant mode. On the other hand, a small value of allows one to consider a higher learning rate, thus accelerating the convergence along the non-constant modes, including for large values of .

5.2 Nonlinearity Normalization, Hyperparameter Tuning and Chaos-Freeze Transition

In Section 4, we showed the existence of the frozen and chaotic phases for FC-NNs and DC-NNs, which depends on the characteristic number . In this section, we show that by centering, standardizing the nonlinearity and by tuning , one can reach both phases. Let us first observe that if we standardize , since as we have , it is always possible to lie in the ordered regime. On the other hand, if we take a Lipschitz nonlinearity, by centering and standardizing , we can take sufficiently small so that , as guaranteed by the following (variant of Poincaré’s) lemma:

Proposition 1.

If and , we have , in particular if

we have .

Remark 2.

Centering and standardizing (i.e. normalizing) the nonlinearity is similar to Layer Normalization (LN) for FC-NNs, where for each input , and each , we normalize the (post-nonlinearity) activation vectors to center and normalize their entries:

In the infinite-width limit, normalizing is equivalent to LN if the input datapoints have a norm For more details, see Appendix C.

6 New NTK Parametrization: Boundary Effects and Learning Rates

In DC-NNs, the neurons which lie at position on the border of the patches behave differently than neurons in the center. Typically, these neurons have less parent neurons in the precedent layer and as result have a lower variance at initialization. Both kernels and have lower intensity for on the border (see Appendix G for an example when , i.e. when there is one border pixel), which leads to border artifacts as seen in Figure 2.

A natural solution is to adapt the factors in the definition of the DC filters. Instead of dividing by (the squared root of) which is the maximal number of parents (only attained in center neurons) we divide by the actual number of parents :

In the Appendix E, in order to be self-contained and since we consider up-sampling, we show again that the NTK converges as the width of the layers grow to infinity sequentially. By doing so, we get formulae for the limiting NTK which allow us to prove that, with the parent-based parametrization, the border artifacts disappear for both and :

Proposition 2.

For the parent-based parametrization of DC-NNs, if the non-linearity is standardized, and do not depend neither on nor on .

6.1 Layer-dependent learning rate

The NTK is the sum over the contributions of the weights and biases . At the -th layer, the weights and biases can only contribute to checkerboard patterns of degree and , i.e. patterns with periods and respectively, in the following sense:

Proposition 3.

In a DC-NN with stride , we have if and if .

This suggests that the supports of and increase exponentially with , giving more importance to the last layers during training. This could explain why the checkerboard patterns of lower degree dominate in Figure 2. In the classical parametrization, the balance is restored by letting the number of channels decrease with depth Radford et al. (2015). In the NTK parametrization, the limiting NTK is not affected by the ratios . To achieve the same effect, we divide the learning rate of the weights and bias of the -th layer by and respectively, where is the product of the strides.

Together with the ’parent-based’ parametrization and the normalization of the non-linearity (in order to lie in the chaotic regime) this rescaling of the learning rate removes both border and checkerboard artifacts in Figure 2.

90            FREEZE[scale=0.5]PCA_net_noncentered_badborder_badlr[scale=0.5]PCA_net_noncentered_fixborder_goodlr[viewport=0bp -65bp 326bp 390bp,scale=0.185]petit_bad-relu-img_00_001500

90            CHAOS[scale=0.5]PCA_net_centered_badborder_badlr[scale=0.5]PCA_net_centered_goodborder_goodlr[viewport=0bp -65bp 326bp 390bp,scale=0.185]ReLU-norm-img_08_001500
90    BATCH NORM[scale=0.5]PCA_net_BN_badborder_badlr[scale=0.5]PCA_net_BN_goodborder_goodlr[viewport=0bp -65bp 326bp 390bp,scale=0.185]BN-img_99_002200_small
.                standard                             parent-based + layer dependent lr.           GAN

Figure 2: The first 8 eigenvectors of the NTK Gram matrix of a DC-NN (L=3) on 4 inputs (left) with the parametrization of Section 2.2 and (middle) with the proposed modifications of Section 6 (right) results of a GAN on CelebA. Each line correspond to a choice of non-linearity/normalization for the generator: (top) ReLU, (middle) normalized ReLU and (bottom) ReLU with Batch Normalization.

7 Generative Adverserial Networks

A common problem in the training of GANs is the collapse of the generator to a constant. This problem is greatly reduced by avoiding the “freeze” regime in which the constant mode dominates and by using the new NTK parametrization with adaptive learning rates. Figure 2 shows the results obtained with three GANs which differ only in the choice of non-linearity and/or the presence of Batch Normalization in the generator. In all cases, the discriminator is a convolutional network with the normalized ReLU as non-linearity. With the ReLU, the generator collapses and generates a single image with checkerboard patterns. With the normalized ReLU or with Batch Normalization, the generator is able to learn a variety of images. This motivates the use of normalization techniques in GANs to avoid the collapse of the generator.

8 Conclusion

This article shows how the NTK can be used theoretically to understand the effect of architecture choices (such as decreasing the number of channels or batch normalization) on the training of DNNs. We have shown that DNNs in a “freeze” regime, have a strong affinity to constant modes and checkerboard artifacts: this slows down training and can contribute to a mode collapse of the DC-NN generator of GANs. We introduce simple modifications to solve these problems: the effectiveness of normalizing the non-linearity, a parent-based parametrization and a layer-dependent learning rates is shown both theoretically and numerically.

References

Appendix A Choice of Parametrization

The NTK parametrization differs slightly from the one usually used, yet it ensures that the training is consistent as the size of the layers grows. In the standard parametrization, the activations are defined by

and we denote by the output function of the ANN. Note the absence of in comparison to the NTK parametrization. The parameters are initialized using the LeCun/He initialization LeCun et al. (2012): the parameters have standard deviation (or for the ReLU but this does not change the general analysis). Using this initialization, the activations stay stochastically bounded as the widths of the ANN get large. In the forward pass, there is almost no difference between the two parametrizations and for each choice of parameters , we can scale down the connection weights by and the bias weights by to obtain a new set of parameters such that

The two parametrizations will exhibit a difference during backpropagation since:

The NTK is a sum of products of these derivatives over all parameters:

With our parametrization, all summands converge to a finite limit, while with the Le Cun or He parameterization we obtain

where some summands, namely the explode in the infinite width limit. One must therefore take a learning rate of order Karakida et al. (2018); Park et al. (2018) to obtain a meaningful training dynamics, but in this case the contributions to the NTK of the first layers connections and the bias of all layers vanish, which implies that training these parameters has less and less effect on the function as the width of the network grows. As a result, the dynamics of the output function during training can still be described by a modified kernel gradient descent: the modified learning rate compensates for the absence of normalization in the usual parametrization.

The NTK parametrization is hence more natural for large networks, as it solves both the problem of having a meaningful forward and backward passes, and to avoid tuning the learning rate, which is the problem that sparked multiple alternative initialization strategies in deep learning Glorot & Bengio (2010). Note that in the standard parametrization, the importance of the bias parameters shrinks as the width gets large; this can be implemented in the NTK parametrization by taking a small value for the parameter .

[scale=0.18]petit_ReLU-norm-img_08_001500    [scale=0.18]BN-img_99_002200

Figure 3: Result of two GANs on CelebA. (Left) with Nonlinearity Normalization and (Right) with Batch Normalization. In both cases the discriminator uses a Normalized ReLU.

Appendix B FC-NN Freeze and Chaos

In this section, we prove Theorem 2, showing the existence of two regimes,‘freeze’ and ‘chaos’, in FC-NNs. First, we improve some results of Daniely et al. (2016), and study the rate of convergence of the activation kernels as the depth grows to infinity. In a second step, this allows us to characterise the behavior of the NTK for large depth.

Let us consider a standardized differentiable non-linearity , i.e. satisfying . Recall that the the activation kernels are defined recursively by and By induction, for any , is uniquely determined by . Defining the two functions by:

one can formulate the activation kernels as an alternate composition of and :

In particular, this shows that for any , . Since the activation kernels are obtained by iterating the same function, we first study the fixed points of the composition . When is a standardized non-linearity, the function , named the dual of , satisfies the following key properties proven in Daniely et al. (2016):

  1. ,

  2. For any , ,

  3. is convex in ,

  4. , where denotes the derivative of ,

  5. .

By definition , thus is a trivial fixed point: . This shows that for any and any :

It appears that is also a fixed point of if and only if the non-linearity is antisymmetric and . From now on, we will focus on the region . From the property 2. of and since is non decreasing, any non trivial fixed point must lie in . Since , and is convex in , there exists a non trivial fixed point of if whereas if there is no fixed point in . This leads to two regimes shown in Daniely et al. (2016), depending on the value of :

  1. “Freeze” when : has a unique fixed point equal to and the activation kernels become constant at an exponential rate,

  2. “Chaos” when : has another fixed point and the activation kernels converge to a kernel equal to if and to if and, if the nonlinearity is antisymmetric and , it converges to if and only if .

To establish the existence of the two regimes for the NTK, we need the following bounds on the rate of convergence of in the “freeze” region and on its values in the “chaos” region:

Lemma 2.

If is a standardized differentiable non-linearity,

If , then for any ,

If , then there exists a fixed point of such that for any ,

Proof.

Let us denote . Let us suppose that . By Daniely et al. (2016), we know that and where . From now on, we will omit to specify the distribution asumption on . The previous equalities and inequalities imply that , thus we obtain: