Gradients explode - Deep Networks are shallow - ResNet explained

Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities "solve" the exploding gradient problem, we show that this is not the case in general and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the *collapsing domain problem*, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks, which we show is a consequence of a surprising mathematical property. By noticing that *any neural network is a residual network*, we devise the *residual trick*, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/28/2017

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

A long-standing obstacle to progress in deep learning is the problem of ...
02/24/2020

Batch Normalization Biases Deep Residual Networks Towards Shallow Paths

Batch normalization has multiple benefits. It improves the conditioning ...
02/14/2020

Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets

Skip connections are an essential component of current state-of-the-art ...
05/20/2016

Residual Networks Behave Like Ensembles of Relatively Shallow Networks

In this work we propose a novel interpretation of residual networks show...
12/10/2017

Gradient Normalization & Depth Based Decay For Deep Learning

In this paper we introduce a novel method of gradient normalization and ...
08/03/2020

Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment

We propose a new metric (m-coherence) to experimentally study the alignm...
11/08/2016

Gradients of Counterfactuals

Gradients have been used to quantify feature importance in machine learn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Arguably, the primary reason for the success of neural networks is their “depth”, i.e. their ability to compose and jointly train nonlinear functions so that they co-adapt. A large body of work has detailed the benefits of depth (e.g. Montafur et al. (2014); Delalleau and Bengio (2011); Martens et al. (2013); Bianchini and Scarselli (2014); Shamir and Eldan (2015); Telgarsky (2015); Mhaskar and Shamir (2016)).

The exploding gradient problem has been a major challenge for training very deep feedforward neural networks at least since the advent of gradient-based parameter learning (Hochreiter, 1991)

. In a nutshell, it describes the phenomenon that as the gradient is backpropagated through the network, it may grow exponentially from layer to layer. This can, for example, make the application of vanilla SGD impossible. Either the step size is too large for updates to lower layers to be useful or it is too small for updates to higher layers to be useful. While this intuitive notion is widely understood, there are significant gaps in the understanding of this important phenomenon. In this paper, we take a significant step towards closing those gaps.

Defining exploding gradients

To begin with, there is no well-accepted metric for determining the presence of pathological exploding gradients. Should we care about the length of the gradient vector? Should we care about the size of individual components of the gradient vector? Should we care about the eigenvalues of the Jacobians of individual layers? Depending on the metric used, different strategies arise for combating exploding gradients. For example, manipulating the width of layers as suggested by e.g.

Yang and Schoenholz (2018); Han et al. (2017) can greatly impact the size of gradient vector components but leaves the length of the gradient vector relatively unchanged.

The problem is that it is unknown whether exploding gradients, when defined according to any of these metrics, necessarily lead to training difficulties. There is much evidence that gradient explosion when defined according to some metrics is associated with poor results when certain architectures are paired with certian optimization algorithms (e.g. Schoenholz et al. (2017); Glorot and Bengio (2015)). But, can we make general statements about entire classes of algorithms and architectures?

Prevalence

It has become a common notion that techniques such as introducing normalization layers (e.g. Ioffe and Szegedy (2015), Ba et al. (2016), Chunjie et al. (2017), Salimans and Kingma (2016)) or careful initial scaling of weights (e.g. He et al. (2015), Glorot and Bengio (2015), Saxe et al. (2014), Mishking and Matas (2016)) largely eliminate exploding gradients by stabilizing forward activations. This notion was espoused in landmark papers. The paper that introduced batch normalization (Ioffe and Szegedy, 2015) states:

In traditional deep networks, too-high learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima. Batch Normalization helps address these issues.

The paper that introduced ResNet (He et al., 2016b) states:

Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients, which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization and intermediate normalization layers, …

We argue that these claims are too optimistic. While scaling weights or introducing normalization layers can reduce gradient growth defined according to some metrics in some situations, these techniques are not effective in general and can cause other problems even when they are effective. We intend to add nuance to these ideas which have been widely adopted by the community (e.g. Chunjie et al. (2017); Balduzzi et al. (2017)). In particular, we intend to correct the misconception that stabilizing forward activations is sufficient for avoiding exploding gradients (e.g. Klambauer et al. (2017)).

Impact

Algorithms such as RMSprop

(Tieleman and Hinton, 2012), Adam (Kingma and Ba, 2015) or vSGD (Schaul et al., 2013) are light modifications of SGD that rescale different parts of the gradient vector and are known to be able to lead to improved training outcomes. This raises an important unanswered question. Are exploding gradients merely a numerical quirk to be overcome by simply rescaling different parts of the gradient vector or are they reflective of an inherently difficult optimization problem that cannot be easily tackled by simple modifications to a stock algorithm?

Origins and tradeoffs

The exploding gradient problem is often discussed in conjunction with the vanishing gradient problem, and often the implication is that the best networks exist on the edge between the two phenomena and exhibit stable gradients (e.g.

Glorot and Bengio (2015), Schoenholz et al. (2017)). But is avoiding gradient pathology simply a matter of designing the network to have a rate of gradient change per layer as close to one as possible, or are there more fundamental reasons why gradient pathology arises in so many popular architectures? Are there tradeoffs that cannot be escaped as easily as the exploding / vanishing gradient dichotomy suggests?

Solutions

ResNet (He et al., 2016b) and other neural network architectures utilizing skip connections (e.g. Huang et al. (2017a), Szegedy et al. (2016)) have been highly successful recently. While the performance of networks without skip connections starts to degrade when depth is increased beyond a certain point, the performance of ResNet continues to improve until a much greater depth is reached. While favorable changes to properties of the gradient brought about by the introduction of skip connections have been demonstrated for specific architectures (e.g. Yang and Schoenholz (2017); Balduzzi et al. (2017)), a general explanation of the power of skip connections has not been given.

Our contributions are as follows:

  1. We introduce the ‘gradient scale coefficient’ (GSC), a novel measurement for assessing the presence of pathological exploding gradients (section 2). It is robust to confounders such as network scaling (section 2) and layer width (section 3) and can be used directly to show that training is difficult (section 4). We propose that this metric can standardize research on gradient pathology.

  2. We demonstrate that exploding gradients are in fact present in a variety of popular MLP architectures, including architectures utilizing techniques that supposedly combat exploding gradients. We show that introducing normalization layers may even exacerbate the exploding gradient problem (section 3).

  3. We introduce the ‘residual trick’ (section 4), which reveals that arbitrary networks can be viewed as residual networks. This enables the application of analysis devised for ResNet, including the popular notion of ‘effective depth’, to arbitrary networks.

  4. We show that exploding gradients as defined by the GSC are not a numerical quirk to be overcome by rescaling different parts of the gradient vector, but are indicative of an inherently complex optimization problem and that they limit the depth to which MLP architectures can be effectively trained, rendering very deep MLPs effectively much shallower (section 4).

  5. For the first time, we explain why exploding gradients are likely to occur in deep networks even when the forward activations do not explode (section 5). We argue that this is a fundamental reason for the difficulty of constructing very deep trainable networks.

  6. We define the ‘collapsing domain problem’ for training very deep feedforward networks. We show how this problem can arise precisely in architectures that avoid exploding gradients and that it can be at least as damaging to the training process (section 6).

  7. For the first time, we show that the introduction of skip connections has a strong gradient-reducing effect on general deep network architectures and that this follows directly from the Pythagorean equation (section 7).

  8. We reveal that ResNets are mathematically simpler version of networks without skip connections and thus approximately achieve what we term the ‘orthogonal initial state’. This provides, we argue, the major reason for their superior performance at great depths as well as an important criterion for neural network design in general (section 7).

In section 8, we discuss related work. In section 9, we conclude and derive practical recommendations for designing and training deep networks as well as key implications of our work for deep learning research in general. We provide further discussion in section A, including future work (A.4).

2 Exploding gradients defined - the gradient scale coefficient

2.1 Notation and terminology

For the purpose of this paper, we define a neural network as a succession of layers , , where each layer is a function that maps a vector of fixed dimensionality to another vector of fixed but potentially different dimensionality. We assume a prediction framework, where the ‘prediction layer’ is considered to output the prediction of the network and the goal is to minimize the value of the error layer over the network’s prediction and the true label , summed over some dataset .

(1)

Note that in contrast to standard notation, we denote by the lowest layer and by the highest layer of the network as we are primarily interested in the direction of gradient flow. Let the dimensionality / width of layer be with and the dimensionality of the data input be .

Each layer except is associated with a parameter sub-vector that collectively make up the parameter vector . This vector represents the trainable elements of the network. Depending on the type of the layer, the sub-vector might be empty. For example, a layer that computes the componentwise function of the incoming vector has no trainable elements, so its parameter sub-vector is empty. We call these layers ‘unparametrized’. In contrast, a fully-connected linear layer has trainable weights, which are encompassed in the parameter sub-vector. We call these layers ‘parametrized’.

We say a network that has layers through has ‘nominal depth’

. In contrast, we say the ‘compositional depth’ is equal to the number of parametrized layers in the network, which approximately encapsulates what is commonly referred to as “depth”. For example, a network composed of three linear layers, two tanh layers and a softmax layer has nominal depth 6, but compositional depth 3. We also refer to the data input

as the ‘input layer’ or the ’st layer and write for and for .

Let be the Jacobian of the ’th layer with respect to the ’th layer evaluated with parameter at , where . Similarly, let be the Jacobian of the ’th layer with respect to the parameter sub-vector of the ’th layer .

Let the ‘quadratic expectation’

of a random variable

be defined as , i.e. the generalization of the quadratic mean to random variables. Similarly, let the ‘inverse quadratic expectation’ of a random variable be defined as . Further terminology, notation and conventions used only in the appendix are given in section B.

2.2 The colloquial notion of exploding gradients

Colloquially, the exploding gradient problem is understood approximately as follows:

When the error is backpropagated through a neural network, it may increase exponentially from layer to layer. In those cases, the gradient with respect to the parameters in lower layers may be exponentially greater than the gradient with respect to parameters in higher layers. This makes the network hard to train if it is sufficiently deep.

We might take this colloquial notion to mean that if and / or grow exponentially in , according to some to-be-determined norm , the network is hard to train if it is sufficiently deep. However, this notion is insufficient because we can construct networks that can be trained successfully yet have Jacobians that grow exponentially at arbitrary rates. In a nutshell, all we have to do to construct such a network is to take an arbitrary network of desired depth that can be trained successfully and scale each layer function and each parameter sub-vector by for some constant . During training, all we have to do to correct for this change is to scale the gradient sub-vector corresponding to each layer by .

Proposition 1.

Consider any and any neural network which can be trained to some error level in a certain number of steps by some gradient-based algorithm. There exists a network with the same nominal and compositional depth as that can also be trained to the same error level as and to make the same predictions as in the same number of steps using the same algorithm, and has exponentially growing Jacobians with rate . (See section D.1 for details.)

Therefore, we need a definition of ‘exploding gradients’ different from ‘exponentially growing Jacobians’ if we hope to derive from it that training is difficult and that exploding gradients are not just a numerical issue to be overcome by gradient rescaling.

Note that all propositions and theorems are stated informally in the main body of the paper, for the purpose of readability and brevity. In the appendix in sections D and E respectively, they are re-stated in rigorous terms, proofs are provided and the conditions are discussed.

2.3 The gradient scale coefficient

In this section, we outline our definition of ‘exploding gradients’ which can be used to show that training is difficult. It does not suffer from the confounding effect outlined in the previous section.

Definition 1.

Let the ‘quadratic mean norm’ or ‘qm norm’ of an matrix

be the quadratic mean of its singular values where the sum of squares is divided by the number of columns

. If , , .., are the singular values of , we have:

Proposition 2.

Let be an matrix and

a uniformly distributed,

unit length vector. Then . (See section D.2 for the proof.)

In plain language, the qm norm measures the expected impact the matrix has on the length of a vector with uniformly random orientation. The qm norm is also closely related to the norm. We will use to denote the norm of both vectors and matrices.

Proposition 3.

Let be an matrix. Then . (See section D.3 for the proof.)

Definition 2.

Let the ‘gradient scale coefficient (GSC)’ for be as follows:

Definition 3.

We say that the network has ‘exploding gradients with rate and intercept ’ at some point if for all and we have , and in particular .

Of course, under this definition, any network of finite depth has exploding gradients for sufficiently small and . There is no objective threshold for and beyond which exploding gradients become pathological. Informally, we will say that a network has ‘exploding gradients’ if the GSC can be well-approximated by an exponential function, though we will restrict our attention to the GSC of the error, , in this paper.

The GSC combines the norm of the Jacobian with the ratio of the lengths of the forward activation vectors. In plain language, it measures the size of the gradient flowing backward relative to the size of the activations flowing forward. Equivalently, it measures the relative sensitivity of layer with respect to small random changes in layer .

Proposition 4.

Let be a uniformly distributed, -dimensional unit length vector. Then measures the quadratic expectation of the relative size of the change in the value of in response to a change in that is a small multiple of . (See section D.4 for details.)

What about the sensitivity of layers with respect to changes in the parameter? For fully-connected linear layers, we obtain a similar relationship.

Proposition 5.

Let be a uniformly distributed, -dimensional unit length vector. Assume is a fully-connected linear layer without trainable bias parameters and contains the entries of the weight matrix. Then measures the quadratic expectation of the relative size of the change in the value of in response to a change in that is a small multiple of . Further, if the weight matrix is randomly initialized,

(See section D.5 for details.)

For reasons of space and mathematical simplicity, we focus our analysis for now on multi-layer perceptrons (MLPs) which are comprised only of fully-connected linear layers with no trainable bias parameters, and unparametrized layers. Therefore we also do not use trainable bias and variance parameters in the normalization layers. Note that using very deep MLPs with certain architectural constraints as a testbed to advance the study of exploding gradients and related concepts is a well-established practice (e.g.

Balduzzi et al. (2017); Yang and Schoenholz (2017); Raghu et al. (2017)). As Schoenholz et al. (2017), we focus on training error rather than test error in our analysis as we do not consider the issue of generalization. While exploding gradients have important implications for generalization, this goes beyond the scope of this paper.

In section 2.2, we showed that we can construct trainable networks with arbitrarily growing Jacobians by simple multiplicative rescaling of layers, parameters and gradients. Crucially, the GSC is invariant to this rescaling as it affects both the forward activations and the Jacobian equally, so the effects cancel out.

Proposition 6.

is invariant under multiplicative rescalings of the network that do not change the predictions or error values of the network. (See section D.6 for details.)

3 The prevalence of exploding gradients - gradients explode despite bounded activations

In this section, we show that exploding gradients exist in a range of popular MLP architectures. Consider the decomposability of the GSC.

[width=]GSC.pdf

Figure 1:

Key metrics for architectures in their randomly initialized state evaluated on Gaussian noise. The x axis in A shows depth in terms of the number of linear layers counted from the input. The x axis in B-F counts nonlinearity layers, starting from the input. Note: The curve for layer-ReLU is shadowed by tanh in A, by ReLU in E and F and by SELU among others in C.

Proposition 7.

Assuming the approximate decomposability of the qm norm of the product of Jacobians, i.e. , we have . (See section D.7 for the proof.)

This indicates that as long as the GSC of individual layers is approximately and as long as the qm norm of the product of layer-wise Jacobians approximately decomposes, we have an exponential growth of in . In figure 1A, we show for seven MLP architectures. A linear layer is followed by (i) a ReLU nonlinearity (‘ReLU’), (ii) layer normalization (Ba et al., 2016) followed by a ReLU nonlinearity (‘layer-ReLU’), (iii) batch normalization plus ReLU (‘batch-ReLU’), (iv) a tanh nonlinearity, (v) layer norm plus tanh (‘layer-tanh’), (vi) batch norm plus tanh (‘batch-tanh’), (vii) a SELU nonlinearity (Klambauer et al., 2017). SELU was recently introduced by Klambauer et al. (2017) and has since seen practical use (e.g. Jurman et al. (2017); Zhang and Shi (2017); Huang et al. (2017b); Malekzadeh et al. (2017); Bhat and Goldman-Mellor (2017)

). All networks have compositional depth 50 (i.e. 50 linear layers) and each layer has width 100. Both data input and labels are Gaussian noise and the error layer computes the dot product between the label and the prediction. The entries of the weight matrices are dawn independently from a Gaussian distributions with mean zero. Weight matrix entries for ReLU architectures are initialized with variance

as suggested by He et al. (2015), weight matrix entries for tanh architectures with variance as suggested by Glorot and Bengio (2015), and weight matrix entries for SELU architectures with variance as suggested by Klambauer et al. (2017). For further details about the experimental protocol, architecture composition and normalization / nonlinearity operations used, see section H.

We find that in four architectures (batch-ReLU, layer-tanh, batch-tanh and SELU), grows almost perfectly linearly in log-space. This corresponds to gradient explosion. We call those architectures ‘exploding architectures’. Among these architectures, a range of techniques that supposedly reduce or eliminate exploding gradients are used: careful initial scaling of weights, normalization layers, SELU nonlinearities. Adding normalization layers may even bring about exploding gradients, as observed when comparing ReLU with batch-ReLU or tanh with layer-tanh or batch-tanh.

In light of proposition 6, it is not surprising that these techniques are not effective in general at combating exploding gradients as defined by the GSC, as this metric is invariant under multiplicative rescaling. Normalization layers scale the forward activations. Carefully choosing the initial scale of weights corresponds to a multiplicative scaling of the parameter. SELU nonlinearities, again, act to scale down large activations and scale up small activations. While these techniques may of course impact the GSC by changing the fundamental mathematical properties of the network, as observed when comparing e.g. ReLU and batch-ReLU, they do not reduce it simply by virtue of controlling the size of forward activations. Note that while we focus on gradient explosion as defined by the GSC in this section, the four exploding architectures would exhibit gradient explosion under any reasonable metric.

In contrast, the other three architectures (ReLU, layer-ReLU and tanh) do not exhibit exploding gradients. However, this apparent advantage comes at a cost, as we further explain in sections 6 and 7.3.

All curves in figure 1A exhibit small jitters. This is because we plotted the value of the GSC at every linear layer, every normalization layer and every nonlinearity layer in this figure and then connected the points corresponding to these values. Layers were placed equispaced on the x axis in the order they occurred in the network. Not every type of layer affects the GSC equally. In fact, we find that as gradients pass through linear layers, they tend to shrink relative to forward activations. In the exploding architectures, this is more than counterbalanced by the relative increase the gradient experiences as it passes through e.g. normalization layers. Despite these differences, it is worth noting that each individual layer used in the architectures studied has only a small impact on the GSC. This impact would be larger for either the forward activations or gradients taken by themselves. For example, passing through a ReLU layer reduces the length of both forward activation and gradient vector by . The relative invariance of the GSC to individual layers suggests that it measures not just a superficial quantity, but a deep property of the network. This hypothesis is confirmed in the following sections.

Finally, we note that the GSC is also robust to changes in width and depth. Changing the depth has no impact on the rate of explosion of the four exploding architectures as the layer-wise GSC, i.e. , is itself independent of depth. In figure 1

A, we also show the results for the SELU architecture where each layer contains 200 neurons instead of 100 (‘SELU wide’). We found that the rate of gradient explosion decreases slightly when width increases. We also studied networks with exploding architectures where the width oscillated from layer to layer.

still increased approximately exponentially and at a similar rate to corresponding networks with constant width.

A summary of results can be found in table 1.

4 The impact of exploding gradients - exploding gradients limit depth

4.1 Notation and terminology - ResNet

We denote a ResNet as a sequence of ‘blocks’ , where each block is the sum of a fixed ‘skip block’ and a ‘residual block’ . Residual blocks are composed of a sequence of layers. We define the optimization problem for a ResNet analogously to equation 1.

(2)

4.2 Background: Effective depth

In this section, we introduce the concept of ‘effective depth’ as defined for ResNet architectures by Veit et al. (2016). Let’s assume for the sake of simplicity that the dimensionality of each block

in a ResNet is identical. In that case, the skip block is generally chosen to be the identity function. Then, writing the identity matrix as

, we have

Multiplying out, this becomes the sum of terms. Almost all of those terms are the product of approximately identity matrices and residual block Jacobians. If the operator norm of the residual block Jacobians is less than for some , the norm of terms decreases exponentially in the number of residual block Jacobians they contain. Let the terms in containing or more residual block Jacobians be called ‘-residual’ and let be the sum of all -residual terms. Then:

Again, if , the right hand side decreases exponentially in for sufficiently large , for example when . So the combined size of -residual terms is exponentially small. Therefore, Veit et al. (2016) argue, the full set of blocks does not jointly co-adapt during training because the information necessary for such co-adaption is contained in gradient terms that contain many or all residual block Jacobians. Only sets of blocks of size at most where is not negligably small co-adapt. The largest such , multiplied by the depth of each block, is called the ‘effective depth’ of the network. Veit et al. (2016) argue that a ResNet is not really as deep as it appears, but rather behaves as an ensemble of relatively shallow networks where each member in that ensemble has depth less than or equal to the effective depth of the ResNet. This argument is bolstered by the success of the stochastic depth (Huang et al., 2016) training technique, where random sets of residual blocks are deleted for each mini-batch update.

Veit et al. (2016) introduced the concept of effective depth somewhat informally. We give our formal definition in section C. There, we also provide a more detailed discussion of the concept and point out limitations.

4.3 The residual trick

Now we make a crucial observation. Any neural network can be expressed in the framework of equation 2. We can simply cast each layer as a block, choose arbitrary and define . Specifically, if we train a network from some fixed initial parameter , we can set and thus . Then training begins with all the being zero functions. Therefore, all analysis devised for ResNet that relies on the small size of the residual block Jacobians can then be brought to bear on arbitrary networks. We term this the ‘residual trick’. Indeed, the analysis by Veit et al. (2016) does not rely on the network having skip connections in the computational sense, but only on the mathematical framework of equation 2. Therefore, as long as the operator norms of are small, is effectively shallow.

4.4 Notation and terminology - residual network

From now on, we will make a distinction between the terms ‘ResNet’ and ‘residual network’. ‘ResNet’ will be used to refer to networks that have an architecture as in He et al. (2016b) that uses skip connections as outlined in section 4.1. In contrast, networks without skip connections will be referred to as ‘vanilla networks’. Define the fixed ‘initial function’ as and the ‘residual function’ as . Then we refer to a ‘residual network’ as an arbitrary network expressed as

(3)

The gradient becomes

After multiplying out, a term containing or more residual Jacobians is called ‘-residual’.

4.5 Theoretical analysis

In this section, we will show that an exploding gradient as defined by the GSC causes the effective training time of deep MLPs to be exponential in depth and thus limits the effective depth that can be achieved.

The proof is based on the insight that the relative size of a gradient-based update on is bounded by the inverse of the GSC if that update is to be useful. The basic assumption underlying gradient-based optimization is that the function optimized is locally well-approximated by a linear function as indicated by the gradient. Any update made based on a local gradient computation must be small enough so that the updated value lies in the region around the original value where the linear approximation is sufficiently accurate. Let’s assume we apply a random update to with relative size , i.e. . Then under the local linear approximation, according to proposition 5, this would change the value of approximately by a value with quadratic expectation

. Hence, with significant probability, the error would become negative. This is not reflective of the true behavior of

in response to changes in of this magnitude. Since is even more sensitive to updates in the direction of the gradient than it is to random updates, useful gradient-based updates are even more likely to be bounded in relative magnitude by .

[width=0.95]im_expl_singleton.pdf

Figure 2: Illustrations of networks of different architectures as functions of the parameter in a single linear layer. For each network architecture as indicated under (C-K) with 50 linear layers, three random weight configurations are chosen that differ only at a single linear layer as indicated. For each location on the sphere centered on the origin containing those three configurations, the input shown in A from the CIFAR10 dataset is propagated through the network with weights indicated by that location. The length of the 3-dimensional output of the prediction layer is then normalized. Each location on the sphere is colored according to this output as shown in B. Weight configurations where the input is assigned class 1/2/3 are shown in red/green/blue respectively. Discs B through K are azimuthal projections. See section H for details.

Figure 2 illustrates this phenomenon. There, we depict the value of the output of the prediction layer of various network architectures given the input in 2A over a 2-dimensional subspace of the parameter space of a single linear layer. We set . Different colors correspond to different regions of the output space, which is shown in 2B. Figures 2C-H illustrate that the complexity of a batch-ReLU network as a function of the parameter grows exponentially with as indicated by figure 1A. As this complexity grows, the region of the parameter space that is well-approximated by the local gradient vanishes accordingly. Figures 2I-K compares 3 linear layers in different architectures with comparable values. We can observe that while the visual properties of vary, the complexity is comparable.

In a nutshell, if decreases exponentially in , so must the relative size of updates. So for a residual function to reach a certain size relative to the corresponding initial function, an exponential number of updates is required. But to reach a certain effective depth, a certain magnitude of -residual terms is required and thus a certain magnitude of residual functions, and thus exponentially many updates.

Theorem 1.

Under certain conditions, if an MLP has exploding gradients with explosion rate and intercept on some dataset, then there exists a constant such that training this MLP with a gradient-based algorithm to have effective depth takes at least updates. (See section E.1 for details.)

Importantly, this lower bound on the number of updates required to reach a certain effective depth is independent of the nominal depth of the network. While the constant depends on some constants that arise in the conditions of the theorem, as long as those constants do not change when depth is increased, neither does the lower bound on the number of updates.

Corollary 1.

In the scenario of theorem 1, if the number of updates to convergence is bounded, so is effective depth.

Here we simply state that if we reach convergence after a certain number of updates, but theorem 1 indicates that more would be required to attain a greater effective depth, then that greater effective depth is unreachable with that algorithm.

4.6 Experiments

To practically validate our theory of limited effective depth, we train our four exploding architectures (batch-ReLU, layer-tanh, batch-tanh and SELU) on CIFAR10. All networks studied have a compositional depth of 51, i.e. there are 51 linear layers. The width of each layer is 100, except for the input, prediction and error layers. Full experimental details can be found in section H.

First, we determined the approximate best step size for SGD for each individual linear layer. We started by pre-training the highest layers of each network with a small uniform step size until the training classification error was below 85%, but at most for 10 epochs. Then, for each linear layer, we trained only that layer for 1 epoch with various step sizes while freezing the other layers. The step size that achieved the lowest training classification error after that epoch was selected. Note that we only considered step sizes that induce relative update sizes of 0.1 or less, because larger updates often cause weight instability. The full algorithm for step size selection and a justification is given in section

H.4.1.

[width=0.96]dyna.pdf

Figure 3: Key metrics for exploding architectures trained on CIFAR10. See main text for explanation.

In figure 3A, we show the relative update size induced on each linear layer by what was selected to be the best step size as well as as a dashed line. In section 4.5, we argued that

is an upper bound for the relative size of a useful update. We find that this bound holds and is conservative except for a small number of outliers. Even though our algorithm for determining the best step size for each layer gives noisy results, there is a clear trend that lower layers require relatively smaller updates, and that this effect is more pronounced if the gradient explodes with a larger rate. Therefore the foundational assumption underlying theorem

1 holds.

We then smoothed these best step size estimates and trained each network for 500 epochs with those smoothed estimates. Periodically, we scaled all step sizes jointly by

. In figure 3B, we show the training classification error of each architecture. Final error values are shown in table 2. There is a trend that architectures with less gradient explosion attain a lower final error. In fact, the final error values are ordered according to the value of in the initialized state. Note that, of course, all these error values are still much higher than the state of the art on CIFAR10. This is not a shortcoming of our analysis however, as the goal of this section is to study and understand pathological architectures rather than find optimal ones. Those architectures, by definition, attain high errors.

In figure 3C, we show as training progresses. During the initial pre-training phase, this value drops significantly but later regains or even exceeds its original value. In figure 3A, the dashed line indicates the inverse of for each after pre-training. We find that the GSC actually falls below 1 as the gradient passes through the pre-trained layers, but then resumes explosion once it reached the layers that were not pre-trained. We find this behavior surprising and unexpected. We conclude that nonstandard training procedures can have a significant impact on the GSC but that there is no evidence that when all layers are trained jointly, which is the norm, the GSC either significantly increases or decreases during training.

We then went on to measure the effective depth of each network. We devised a conservative, computationally tractable estimate of the cumulative size of updates that stem from -residual terms. See section C.2 for details. The effective depth depicted in figure 3D is the largest value of such that this estimate has a length exceeding . As expected, none of the architectures reach an effective depth equal to their compositional depth, and there is a trend that architectures that use relatively smaller updates achieve a lower effective depth. It is worth noting that the effective depth increases most sharply at the beginning of training. Once all step sizes have been multiplied by several times, effective depth no longer changes significantly while the error, on the other hand, is still going down. This suggests that, somewhat surprisingly, high-order co-adaption of layers takes place towards the beginning of training and that as the step size is reduced, layers are fine-tuned relatively independently of each other.

SELU and especially tanh-batch reach an effective depth close to their compositional depth according to our estimate. In figure 3E, we show the operator norm of the residual weight matrices after training. All architectures except SELU, which has a close to 1 after pre-training, show a clear downward trend in the direction away from the error layer. If this trend were to continue for networks that have a much greater compositional depth, then those networks would not achieve an effective depth significantly greater than our 51-linear layer networks.

Veit et al. (2016) argue that a limited effective depth indicates a lack of high-order co-adaptation. We wanted to verify that our networks, especially layer-tanh and batch-ReLU, indeed lack these high-order co-adaptations by using a strategy independent of the concept of effective depth to measure this effect. We used Taylor expansions to do this. Specifically, we replaced the bottom layers of the fully-trained networks by their first-order Taylor expansion around the initial functions. See section F for how this is done. This reduces the compositional depth of the network by . In figure 3F, we show the training classification error in response to compositional depth reduction. We find that the compositional depth of layer-tanh and batch-ReLU can be reduced enormously without suffering a significant increase in error. In fact, the resulting layer-tanh network of compositional depth 15 greatly outperforms the original batch-tanh and batch-ReLU networks. This confirms that these networks lack high-order co-adaptations. Note that cutting the depth by using the Taylor expansion not only eliminates high-order co-adaptions among layers, but also co-adaptions of groups of 3 or more layers among the bottom layers. Hence, we expect the increase in error induced by removing only high-order co-adaptions to be even lower than what is shown in figure 3F. Unfortunately, this cannot be tractably computed.

Finally, we trained each of the exploding architectures by using only a single step size for each layer that was determined by grid search, instead of custom layer-wise step sizes. As expected, the final error was higher. The results are found in table 2.

Summary

For the first time, we established a direct link between exploding gradients and severe training difficulties for general gradient-based training algorithms. These difficulties arise in MLPs composed of popular layer types, even if those MLPs utilize techniques that are believed to combat exploding gradients by stabilizing forward activations. The gradient scale coefficient not only underpins this analysis, but is largely invariant to the confounders of network scaling (section 2.3), layer width and individual layers (section 3). Therefore we propose that the GSC can standardize research on gradient pathology.

4.7 A note on batch normalization and other sources of noise

We used minibatches of size 1000 to train all architectures except batch-ReLU, for which we conducted full-batch training. When minibatches were used on batch-ReLU, the training classification error stayed above 89% throughout training. (Random guessing achieves a 90% error.) In essence, no learning took place. This is because of the pathological interplay between exploding gradients and the noise inherent in batch normalization. Under batch normalization, the activations at a neuron are normalized by their mean and standard deviation. These values are estimated using the current batch. Hence, if a minibatch has size

, we expect the noise induced by this process to have relative size . But we know that according to proposition 4, under the local linear approximation, this noise leads to a change in the error layer of relative size . Hence, if the GSC between the error layer and the first batch normalization layer is larger than , learning should be seriously impaired. For the batch-ReLU architecture, this condition was satisfied and consequently, the architecture was untrainable using minibatches. Ironically, the gradient explosion that renders the noise pathological was introduced in the first place by adding batch normalization layers. Note that techniques exist to reduce the dependence of batch normalization on the current minibatch, such as using running averages (Ioffe, 2017). Other prominent techniques that induce noise and thus can cause problems in conjunction with large gradients are dropout (Srivastava et al., 2014), stochastic nonlinearities (e.g. Gulcehre et al. (2016)) and network quantization (e.g. Wu et al. (2018)).

5 The origins of exploding gradients - quadratic vs geometric means

Why do exploding gradients occur? As mentioned in section 3, gradients explode with rate as long as we have (i) and (ii) for all and . Our results from figure 1A suggest that (i) holds in practical networks. Indeed, we can justify (i) theoretically by viewing the parameters of linear layers as random variables.

Theorem 2.

Under certain conditions, for any neural network composed of layer functions that are parametrized by randomly initialized ,

. (See section E.2 for details.)

Let’s turn to (ii). The common perception of the exploding gradient problem is that it lies on a continuum with the vanishing gradient problem and that all we need to do to avoid both is to hit a sweet spot by avoiding design mistakes. According to this viewpoint, building a network with should not be difficult.

Definition 4.

Assume

is a random variable with a real-valued probability density function

on . Then we define its ‘exponential entropy’ as .

In comparison, entropy is defined as .

Theorem 3.

Let be a random variable with a real-valued probability density function on and let be an endomorphism on . Let be the standard deviation of the absolute singular values of and let be the mean of the absolute singular values of . Assume with probability . Then

Further, assume that the value of is independent of . Then

(See section E.3 for details.)

Figure 4: Illustration of theorem 3. See main text for details.
Corollary 2.

Let and fulfill the same conditions as and above. Also let and be fixed. Then

And under the further assumption from above we have 222We present results in terms of and because requires less assumptions but provides the tighter bound.

The assumption of layer functions having a fixed output length is fulfilled approximately in sufficiently wide networks due to the law of large numbers.

Theorem 3 suggests that layer-wise GSC’s larger than 1 occur even in networks where forward activations are stable and cannot be eliminated via simple design choices, unless (I) information loss is present or (II) Jacobians are constrained. It turns out that these are exactly the strategies employed by popular architectures ReLU and ResNet to avoid exploding gradients, as we will show in the next two sections respectively. Both strategies incur drawbacks in practice. While we have not observed a vanishing gradient in any architecture we studied in this paper, we conjecture that an architecture built on popular design principles that exhibits them would suffer those drawbacks to an even larger degree.

The mechanism underlying theorem 3

is that information propagation is governed by the geometric mean of absolute singular values of the Jacobian, whereas the GSC is governed by the quadratic mean of the absolute singular values. We illustrate this in figure

4. Say a random input variable with domain is mapped by a nonlinear function onto a domain and say the small red patch is mapped onto the small blue patch. The difference in entropy between the patches is equal to the logarithm of the absolute determinant of the Jacobian at the red patch. This quantity is related to the geometric mean of the absolute singular values. Conversely, the GSC at the red patch is based on the qm norm of the Jacobian, which is itself the quadratic mean of the absolute singular values. As the quadratic mean is larger than the geometric mean, with the size of the difference governed by how “spread out” the absolute singular values are, we obtain the result of the theorem.

In current practice, avoiding exploding gradients does not seem to be a matter of simply avoiding design mistakes, but involves tradeoffs with other potentially harmful effects. As our theoretical analysis in this section does rely on several conditions such as a constant width and the presence of a probability density function, we leave open the possibility of designing novel architectures that avoid exploding gradients by exploiting these conditions.

6 Exploding gradient tradeoffs - the collapsing domain problem

In the previous section, we showed how gradients explode if the entropy or exponential entropy is perserved relative to the scale of forward activations. This suggests that we can avoid exploding gradients via a sufficiently large entropy reduction. This corresponds to a contraction of the latent representations of different datapoints, a collapsing domain.

Consider a contraction of the domain around a single point. If we shrink the co-domain of some layer function by a factor , we reduce the eigenvalues of the Jacobian and hence its qm norm by . If we also ensure that the length of the output stays the same, the GSC is also reduced by . Similarly, inflating the co-domain would cause the qm norm to increase. A contraction around a single point would cause the activation values at each individual neuron to be biased. We call this domain bias.

This is precisely what we find. Returning to figure 1, we now turn our attention to graphs B through F. In B, we plot the standard deviation of the activation values in the layers before each nonlinearity layer (‘pre-activations’). Each standard deviation is taken at an individual neuron and across datapoints. The standard deviations of neurons in the same layer are combined by taking their quadratic mean. In C, we plot the quadratic expectation of the pre-activations. The two quantities diverge significantly for 2 architectures: ReLU and layer-ReLU. This divergence implies that activation values become more and more clustered away from zero with increasing depth, which implies domain bias. In D, we plot the fraction of the signal explained by the bias (bias squared divided by quadratic expectation squared). This value increases significantly only for ReLU and layer-ReLU. Hence, we term those two architectures ‘biased architectures’.

But why can domain bias be a problem? There are at least two reasons.

Domain bias can cause pseudo-linearity
Figure 5: The phenomenon of pseudo-linearity in ReLU and tanh nonlinearities. The nonlinearity function is shown in blue, the nonlinearity applied to 50 individual pre-activations drawn from a Gaussian with mean and standard deviation are shown as red dots. The closest linear fit to the 50 post-activations is shown as a red line, and it approximates these post-activations very closely. A: ReLU, , . B: ReLU, , . C: tanh, ,

If the pre-activations that are fed into a nonlinearity are sufficiently similar, the nonlinearity can be well-approximated by a linear function. In an architecture employing ReLU nonlinearities, if either all or most pre-activations are positive or all or most pre-activations are negative, the nonlinearity can be well-approximated by a linear function. If all or most pre-activations are negative, ReLU can be approximated by the zero function (figure 5A). If all or most pre-activations are positive, ReLU can be approximated by the identity function (figure 5B). But if nonlinearity layers become well-approximated by linear layers, the entire network becomes equivalent to a linear network. We say the network becomes ‘pseudo-linear’. Of course, linear networks of any depth have the representational capacity of a linear network of depth 1 and are unable to model nonlinear functions. Hence, a network that is pseudo-linear beyond compositional depth approximately has the representational capacity of a compositional depth network.

In figure 1E, we plot the proportion of the pre-activations at each neuron that are positive or negative, whichever is smaller for that neuron. Values are averaged over each layer. We call this metric ‘sign diversity’. For ReLU and layer-ReLU, sign diversity decreases rapidly. Because of the properties of ReLU discussed above, this implies pseudo-linearity. Finally, in figure 1F we plot the error incurred by replacing each neuron in a nonlinearity layer by its respective best fit linear function, measured as one minus the ratio of the signal power (squared quadratic expectation) of the approximated post-activations over the signal power of the true post-activation. We find that in the two biased architectures, pseudo-linearity takes hold substantially after layer 10 and completely after layer 25.

Domain bias can mask exploding gradients

In theorem 1, we used the fact that the output of the error layer of the network was positive to bound the size of a useful gradient-based update. In other words, we used the fact that the domain of the error layer is bounded. However, domain bias causes not just a reduction of the size of the domain of the error layer, but of all intermediate layers. This should ultimately have the same effect on the largest useful update size as exploding gradients, that is to reduce them and thus cause a low effective depth.

Figure 6: Illustration of a 50-layer ReLU network as a function of the parameter of the first linear layer, using the same methodology as figure 2. Figure B is a contrast-heightened version of figure A. See section H for details.

We illustrate this effect in figure 6. In figure 6A, we depict the output of a 50-layer ReLU network over a 2-dimensional subspace of the parameter space of the first linear layer, using the same methodology as in figure 2. Domain bias causes the output to be restricted to a narrow range. While at first glance the function looks simple, heightening the contrast (6B) reveals that there are oscillations of small amplitude that were not present in exploding architectures. Because any color shift is confined to a small amplitude, we necessarily obtain oscillations. Those oscillations then lead to local gradient information being uninformative, just as with exploding architectures.

[width=]dynaCollapse.pdf

Figure 7: Key metrics for architectures that avoid exploding gradients trained on CIFAR10. The top left graph shows the estimated optimal relative update size in each layer according to the algorithm described in section H.4.1. Remaining graphs show results obtained from training with a single step size as this achieved lower error than training with layer-wise step sizes (see table 2). The top two rows are equivalent to graphs in figure 3. The bottom row shows pre-activation standard deviation and pre-activation sign diversity (see section H.2 for definition) of the highest nonlinearity layer as training progresses.

In table 2, we show the final error values achieved by training ReLU and layer-ReLU on CIFAR10. The errors are substantially higher than those achieved by the exploding architectures, except for batch-ReLU. Also, training with layer-wise step sizes did not help compared to training with a single step size. In figure 7A, we show the estimated best relative update size for each layer. This time, there is no downward trend towards lower layers, which is likely why training with a single step size is “sufficient”. As conjectured, the difference between the bound and the empirical estimate is much larger for the biased architectures than it is for the exploding architectures (see figure 3A), indicating that both suffer from reduced useful update sizes. In figure 7D, we find the effective depth reached by ReLU and layer-ReLU is significantly lower than the compositional depth of the network and is comparable to that of architectures with exploding gradients (see figure 3D).

In figure 7G and H, we plot the pre-activation standard deviation and sign diversity at the highest nonlinearity layer throughout training. Interestingly, sign diversity increases significantly early in training. The networks become less linear through training.

Summary

In neural network design, there is an inherent tension between avoiding exploding gradients and collapsing domains. Avoiding one effect can bring about or exacerbate the other. Both effects are capable of severely hampering training. This tension is brought about by the discrepancy of the geometric and quadratic mean of the singular values of layer-wise Jacobians and is a foundational reason for the difficulty in constructing very deep trainable networks.

Many open questions remain. Is it possible to measure or at least approximate the entropy of latent representations? What about latent representations that have varying dimensionality or lack a probability density function? In what ways other than domain bias can collapsing domains manifest and how would those manifestations hamper training? An example of such a manifestation would be a clustering of latent representations around a small number of principle components as observed in Gaussian initialized linear networks (Saxe et al., 2014; Pennington and Worah, 2017).

7 Exploding gradient solutions - ResNet and the orthogonal initial state

ResNet and related architectures that utilize skip connections have been very successful recently. One reason for this is that they can be successfully trained to much greater depths than corresponding vanilla networks. In this section, we show how skip connections are able to greatly reduce the GSC and thus largely circumvent the exploding gradient problem. Please refer back to section 4.1 for the notation and terminology we employ for ResNets.

Definition 4.

We say a function is ‘-diluted’ with respect to a random vector , a matrix and a function if and .

-dilution expresses the idea that the kinds of functions that a block represents are of a certain form if is restricted to matrix multiplication. (Note that the identity function can be viewed as matrix multiplication with the identity matrix.) The larger the value of , the more is “diluted” by a linear function, bringing itself closer and closer to a linear function. represents the incoming forward activations to the block .

Theorem 4.

If a block would cause the to grow with expected rate , -diluting

with an uncorrelated linear transformation

reduces this rate to . (See section E.4 for details.)

This reveals the reason why ResNet circumvents the exploding gradient problem. -diluting does not just reduce the growth of the GSC by , but by . Therefore what appears to be a relatively mild reduction in representational capacity achieves, surprisingly, a relatively large amount of gradient reduction, and therefore ResNet can be trained successfully to “unreasonably” great depths for general architectures.

Figure 8: Illustration of theorem 4. See main text for details.

The mechanism underlying theorem 4 is illustrated in figure 8. In the upper half of this figure, we show a block function that causes the GSC to grow with rate , i.e. . represents the incoming error gradient. In the bottom half, we add an identity skip connection to . Assuming both and as well as and are uncorrelated and thus orthogonal, the Pythagorean theorem ensures that the growth of the GSC is severely reduced.

[width=]GSCresFinal.pdf

Figure 9: Key metrics for ResNet architectures at in their randomly initialized state evaluated on Gaussian noise. In A and C, only values obtained between blocks are plotted. The x axis shows depth in terms of the number of linear layers counted from the input. The x axis in B counts nonlinearity layers, starting from the input. In C, batch-tanh shadows layer-tanh and SeLU shadows ReLU and layer-ReLU.

To validate our theory, we repeated the experiments in figure 1 with 5 ResNet architectures: layer-ReLU, batch-ReLU, layer-tanh, batch-tanh and layer-SELU. Each residual block is bypassed by an identity skip connection and composed of 2 sub-blocks of 3 layers each: first a normalization layer, then a nonlinearity layer, and then a linear layer, similar to He et al. (2016a); Zaguroyko and Komodakis (2016). For further details, see section H. Comparing figure 9A to figure 1A, we find the gradient growth is indeed much lower for ResNet compared to corresponding vanilla networks, with much of it taking place in the lower layers. In figure 9B we find that the growth of domain bias for layer-ReLU, as measured by pre-activation sign diversity, is also significantly slowed.

We then went on to check whether the gradient reduction experienced is in line with theorem 4. We measured the -dilution level induced by the skip block as well as , the growth rate of , at each individual block . We then replaced the growth rate with , obtaining new GSC curves which are shown in figure 9D. Indeed, the GSC of the exploding architectures now again grows almost linearly in log space, with the exception of batch-ReLU in the lowest few layers. The explosion rates closely track those in figure 1A, being only slightly higher. This confirms that the estimate of the magnitude of gradient reduction from theorem 4 is accurate in practical architectures. The -dilution levels are shown in figure 9C. They grows as as the skip block is the accumulation of approximately uncorrelated residual blocks of equal size.

We then repeated the CIFAR10 experiments shown in figure 3 with our 5 ResNet architectures. The results are shown in figure 12. As expected, in general, ResNet enables higher relative update sizes, achieves lower error, a higher effective depth and is less “robust” to Taylor approximation than corresponding vanilla networks. The only exception to this trend is the layer-SELU ResNet when compared to the SELU vanilla network, which already has a relatively slowly exploding gradient to begin with. Note that the severe reduction of the GSC persists throughout training (figure 12C). Also see table 2 to compare final error values. Note that in order to make the effective depth results in figure 12D comparable to those in figure 3D, we applied the residual trick to ResNet. We let the initial function encompass not just the skip block , but also the initial residual block function . In fact, we broke down each layer in the residual block into its initial and residual function. See section C.2 for more details. Note that our effective depth values for ResNet are much higher than those of Veit et al. (2016). This is because we use a much more conservative estimate of this intractable quantity for both ResNet and vanilla networks.

Gradient reduction is achieved not just by identity skip connections but, as theorem 4

suggests, also by skip blocks that multiply the incoming value with e.g. a Gaussian random matrix. Using Gaussian skip connections, the amount of gradient reduction achieved in practice is not quite as great (table

1).

Veit et al. (2016) argues that deep ResNets behave like an ensemble of relatively shallow networks. We argue that comparable vanilla networks often behave like ensembles of even shallower networks. Jastrzebski et al. (2018) argues that deep ResNets are robust to lesioning. We argue that comparable vanilla networks are often even more robust to depth reduction when considering the first order Taylor expansion.

7.1 The limits of dilution

-dilution has its limits. Any -diluted function with large is close to a linear function. Hence, we can view -dilution as another form of pseudo-linearity that can damage representational capacity. It also turns out that at least in the randomly initialized state, dilution only disappears slowly as diluted functions are composed. If the diluting linear functions are identity functions, this corresponds to feature refinement as postulated by Jastrzebski et al. (2018).

Theorem 5.

Under certain conditions, the composition of randomly initialized blocks that are -diluted in expectation respectively is -diluted in expectation. (See section E.5 for details.)

More simply, assume all the are equal to some . Ignoring higher-order terms, the composition is -diluted. Under the conditions of theorem 5, the flipside of an reduction in gradient via dilution is thus the requirement of blocks to eliminate that dilution. This indicates that the overall amount of gradient reduction achievable through dilution without incurring catastrophic pseudo-linearity is limited.

7.2 Choosing dilution levels

The power of our theory lies in exposing the GSC-reducing effect of skip connections for general neural network architectures. As far as we know, all comparable previous works (e.g. Yang and Schoenholz (2017); Balduzzi et al. (2017)) demonstrated similar effects only for specific architectures. Our argument is not that certain ResNets achieve a certain level of GSC reduction, but that ResNet users have the power to choose the level of GSC reduction by controlling the amount of dilution. While the level of dilution increases as we go deeper in the style of ResNet architecture we used for experiments in this section, this need not be so.

The skip block and residual block can be scaled with constants to achieve arbitrary, desired levels of dilution (Szegedy et al., 2016; Chang et al., 2018). Alternatively, instead of putting all normalization layers in the residual blocks, we could insert them between blocks. This would keep the dilution level constant.

7.3 On the relationship of dilution, linear approximation error and the standard deviation of absolute Jacobian eigenvalues

The insights presented in this section cast a new light on earlier results. For example, the concept of linear approximation error as shown in figure 1F is similar to the concept of dilution. Therefore we might expect that a low linear approximation error would be associated with a low gradient growth. This is precisely what we find. The explosion rates from figure 1A display a similar magnitude as the linear approximation errors in figure 1F. This also explains how the tanh architecture avoids exploding gradients - via extreme pseudo-linearity, as depicted in figure 5C. tanh also does not perform well (table 2 / figure 7).

Conversely, we can interpret dilution in terms of the linear approximation error. If we view the skip block as the signal and the residual block as the “noise”, then increasing the dilution corresponds to an increase in the signal relative to the noise. Specifically, increasing dilution has a squared effect on the signal-to-noise ratio, which suggests a squared increase in the number of blocks is needed to bring the signal-to-noise ratio back to a given level. This leads us back to theorem 5.

In section 5

, we pointed to a reduction in the standard deviation of absolute singular values of the layer-wise Jacobian as a strategy for reducing gradient growth. Dilution with the identity or an orthogonal matrix, and to a lesser extent dilution with a Gaussian random matrix, achieves exactly that. Furthermore, we note that in theorem

3, we have as . Of course, -dilution with the identity or an orthogonal matrix leads to a -fold in reduction . So theorem 3 suggests that a -fold reduction in may lead to a -fold reduction in gradient growth, which leads us back to theorem 4.

7.4 The orthogonal initial state

Applying the residual trick to ResNet reveals several insights. The difference between ResNet and vanilla networks in terms of skip connections is somewhat superficial, because both ResNet and vanilla networks can be expressed as residual networks in the framework of equation 3. Also, both ResNet and vanilla networks have nonlinear initial functions, because is initially nonzero and nonlinear. However, there is one key difference. The initial functions of ResNet are closer to a linear transformation and indeed closer to an orthogonal transformation because they are composed of a nonlinear function that is significantly diluted by what is generally chosen to be an orthogonal transformation . Therefore, ResNet, while being conceptually more complex, is mathematically simpler.

We have shown how ResNets achieve a reduced gradient via -diluting the initial function. And just as with effective depth, the residual trick allows us to generalize this notion to arbitrary networks.

Definition 5.

We say a residual network has an ‘orthogonal initial state’ (OIS) if each initial function is a multiplication with an orthogonal matrix or a slice / multiple thereof.

Any network that is trained from an (approximate) OIS can benefit from reduced gradients via dilution to the extent to which initial and residual function are uncorrelated. ResNet is a style of architecture that achieves this, but it is far from being the only one. Balduzzi et al. (2017) introduced the ‘looks-linear initialization’ (LLI) for ReLU networks, where initial weights are set in a clever way to bypass the nonlinear effect of the ReLU layer. We detail this initialization scheme in section G. A plain ReLU network with weights set by LLI achieves not only an approximate OIS, but outperformed ResNet in the experiments of Balduzzi et al. (2017). In table 2, we show that applying LLI to our ReLU architecture causes it to outperform ResNet in our CIFAR10 experiments as well. In figure 13C, we find that indeed LLI reduces the gradient growth of batch-ReLU drastically not just in the initialized state, but throughout training even as the residual functions grow beyond the size achieved under Gaussian initialization (compare figure 13E to 3E and 7E).

DiracNet (Zagoruyko and Komodakis, 2017) also achieves an approximate OIS. A simpler but much less powerful strategy is to initialize weight matrices as orthogonal matrices instead of Gaussian matrices. This reduces the gradient growth in the initialized state somewhat (table 1).

7.5 The power of initial vs residual dilution

An orthogonal initial state is not enough to attain high performance. Trivially, an orthogonal linear network without nonlinearity or normalization layers achieves an orthogonal initial state, but does not attain high performance. Clearly, we need to combine orthogonal initial functions with sufficiently non-linear residual functions.

The ensemble view of deep networks detailed in section 4.2 reveals the power of this approach. With high probability, the input to an ensemble member must pass through a significant number of initial function to reach the prediction layer. Therefore, having non-orthogonal initial functions is akin to taking a shallow network and adding additional, untrainable non-orthogonal layers to it. This has obvious downsides such as a collapsing domain and / or exploding gradient, and an increasingly unfavorable eigenspectrum of the Jacobian (Saxe et al., 2014). One would ordinarily not make the choice to insert such untrainable layers. While there has been some success with convolutional networks where lower layers are not trained (e.g. Saxe et al. (2011); He et al. (2016c)), it is not clear whether such networks are capable of outperforming other networks where such layers are trained.

Conversely, using non-linear residual functions means that the input to an ensemble member passes through a significant number of trainable, composed, nonlinear residual functions to reach the prediction layer. This is precisely what ensures the representational capacity.

While tools that dilute the initial function such as skip connections or LLI do not resolve the tension between exploding gradients and collapsing domains, they reduce the pathology by specifically avoiding untrainable, and thus potentially unnecessary non-orthogonality contained in the initial functions. To further validate this idea, we compared these strategies against another that dilutes both initial and residual functions - leaky ReLU.

Leaky ReLU (Maas et al., 2013) is a generalization of ReLU. Instead of assigning a zero value to all negative pre-activations, leaky ReLU multiplies them with the ‘leakage parameter’ . Maas et al. (2013) set set to 0.01, but this is not a strict requirement. In fact, when , leaky ReLU becomes the identity function. If , we recover ReLU. Therefore, by varying

we can interpolate between the linear identity function and the nonlinear ReLU function. The closer

is to 1, the more ReLU is diluted. When , the leaky ReLU network achieves an initial linear state. In contrast to ResNet, the dilution of the ReLU layer affects the signal that passes through the residual weight matrices.

We repeat our CIFAR10 experiments with the leaky-ReLU and batch-leaky ReLU architectures. The latter is comparable to batch-ReLU ResNet, the former to LLI ReLU. To make the comparison to LLI ReLU even more faithful, we initialized the batch-leaky ReLU network with orthogonal weight matrices. Therefore, when , the leaky ReLU architecture achieves an OIS. The results are shown in figure 10 and table 2. While varying the leakage parameter can have a positive effect on performance, as expected, initial-only dilution schemes perform much better.

Figure 10: Comparing different dilution styles via their performance on CIFAR10.

The big question is now: What is the purpose of not training a network from an orthogonal initial state? We are not aware of such a purpose. Since networks with orthogonal initial functions are mathematically simpler than other networks, we argue they should be the default choice. Using non-orthogonality in the initial function, we believe, is what requires explicit justification.

Balduzzi et al. (2017) asks in the title: If ResNet is the answer, then what is the question? We argue that a better question would be: Is there a question to which vanilla networks are the answer?

8 Related work

In this paper, we have discussed exploding gradients and collapsing domains. In this section, we review related metrics and concepts from literature.

Our work bears similarity to a recent line of research studying deep networks using mean field theory (Poole et al., 2016; Schoenholz et al., 2017; Yang and Schoenholz, 2017, 2018). The authors study infinitely wide and deep networks in their randomly initialized state. They identify two distinct regimes, order and chaos, based on whether the correlation between two forward activation vectors corresponding to two different datapoints converges exponentially to one (‘order’), exponentially to a value less than one (‘chaos’) or sub-exponentially (‘edge of chaos’), as the vectors are propagated towards infinite depth. They show that for MLPs where the forward activation vector length converges, order corresponds to gradient vanishing according to the metric of e.g. gradient vector length. If the network is also a tanh MLP, chaos corresponds to gradient explosion according to the same metrics. They show how to use mean field theory as a powerful and convenient tool for the static analysis of network architectures and obtain a range of interesting results. Our work differs from and extends this line of work in several ways, as we discuss in detail in section A.1.

Recently, Haber et al. (2017); Haber and Ruthotto (2017); Chang et al. (2017, 2018) introduced the concept of stability according to dynamical systems theory for ResNet architectures. A central claim is that in architectures that achieve such stability, both forward activations and gradients (and hence the GSC) are bounded as depth goes to infinity. These papers derive a range of valuable strategies such as deepening a ResNet gradually by duplicating residual blocks and achieving effective regularization by tying weights in consecutive blocks. In our work, we showed how dilution can suppress gradient growth drastically (theorem 4) and how dilution can disappear very slowly with increasing depth (theorem 5). We are not convinced that the strategies these papers introduce offer significant additional benefit over general dilution in terms of reducing gradient growth. We provide experimental results and further discussion in section A.2.

We build on the work of Balduzzi et al. (2017), who introduced the concept of gradient shattering. This states that in deep networks, gradients with respect to nearby points become more and more uncorrelated with depth. This is very similar to saying that the gradient is only informative in a smaller and smaller region around the point at which it is taken. This is precisely what happens when gradients explode and also, as we argue in section 6, under domain bias. Therefore, the exploding gradient problem and domain bias problem can be viewed as a further specification of the shattering gradient problem rather than as a counter-theory or independent phenomenon.

We extend the work of Balduzzi et al. (2017) in several ways. First, they claim that the exploding gradient problem “has been largely overcome”. We show that this is not the case, especially in the context of very deep batch-ReLU MLPs, which Balduzzi et al. (2017) investigate. Second, by using effective depth we make a rigorous argument as to why exploding gradients cause training difficulty. While Balduzzi et al. (2017) point out that shattering gradients interfere with theoretical guarantees that exist for specific optimization algorithms, they do not provide a general argument as to why shattering gradients are in fact a problem. Third, our analysis extends beyond ReLU networks.

We also build on the work of Raghu et al. (2017). They showed that both trajectories and small perturbations, when propagated forward, can increase exponentially in size. However, they do not distinguish two important cases: (i) an explosion that is simply due to an increase in the scale of forward activations and (ii) an explosion that is due to an increase in the gradient relative to forward activations. We are careful to make this distinction and focus only on case (ii). Since this is arguably the more interesting case, we believe the insights generated in our paper are more robust.

Saxe et al. (2014) and Pennington et al. (2017)

investigated another important pathology of very deep networks: the divergence of singular values in multi-layer Jacobians. As layer-wise Jacobians are multiplied, the variances of their singular values compound. This leads to the direction of the gradient being determined by the dominant eigenvectors of the multi-layer Jacobian rather than the label, which slows down training considerably.

In their seminal paper, Ioffe and Szegedy (2015) motivated batch normalization with the argument that changes to the distribution of latent representations, which they term ‘covariate shift’, are pathological and need to be combated. This argument was then picked up by e.g. Salimans and Kingma (2016) and Chunjie et al. (2017) to motivate similar normalization schemes. We are not aware of any rigorous definition of the ‘covariate shift’ concept nor do we understand why it is undesirable. After all, isn’t the very point of training deep networks to have each layer change the function it computes, to which other layers co-adapt, to which then other layers co-adapt and so on? Having each layer fine-tune its weights in response to shifts in other layers seems to us to be the very mechanism by which deep networks achieve high accuracy.

A classical notion of trainability in optimization theory is the conditioning of the Hessian. This can also deteriorate with depth. Recently, Luo (2017) introduced an architecture that combats this pathology in an effective and computationally tractable way via iterative numerical methods and matrix decomposition. Matrix decomposition has also been used by e.g. Arjovsky et al. (2016); Helfrich et al. (2017) to maintain orthogonality of recurrent weight matrices. Maybe such techniques could also be used to reduce the divergence of singular values of the layer-wise Jacobians during training.

9 Conclusion

Summary

In this paper, we demonstrate that contrary to popular belief, many MLP architectures composed of popular layer types exhibit exploding gradients (section 3), and those that do not exhibit collapsing domains (section 6) or extreme pseudo-linearity (section 7.3). This tradeoff is caused by the discrepancy between geometric and quadratic means of the absolute singular values of layer-wise Jacobians (section 5). Both sides of this tradeoff can cause pathologies. Exploding gradients, when defined by the GSC (section 2) cause low effective depth (section 4). Collapsing domains can cause pseudo-linearity and also low effective depth (section 6). However, both pathologies can be avoided to a surprisingly large degree by eliminating untrainable, and thus potentially unnecessary non-orthogonality contained in the initial functions. Making the initial functions more orthogonal via e.g. skip connections leads to improved outcomes (section 7).

Figure 11: Illustration of the functional complexity of neural networks versus classical nonlinear models as represented by Gaussian kernel machines. Note: This figure does not display experimental results.

The picture of deep learning that emerges throughout this paper is considerably different from classical machine learning. In classical nonlinear models such as Gaussian kernel machines, we experience the curse of dimensionality, where the complexity of the function computed by the model grows as

, where is the number of kernels and is the dimentionality of the data. Conversely, the complexity of the function computed by neural networks increases exponentially with depth, independently of the dimensionality of the data, assuming that the network exhibits e.g. a constant rate of gradient explosion. A visual high-level summary of this paper is shown in figure 11.

Practical Recommendations
  • Train from an orthogonal initial state, i.e. initialize the network such that it is a series of orthogonal linear transformations. This can greatly reduce the growth of the GSC and domain collapse not just in the initial state, but also as training progresses. It can prevent the forward activations from having to pass through unnecessary non-orthogonal transformations. Even if a perfectly orthogonal initial state is not achievable, an architecture that approximates this such as ResNet can still confer significant benefit.

  • When not training from an orthogonal initial state, avoid low effective depth. A low effective depth signifies that the network is composed of an ensemble of networks significantly shallower than the full network. If the initial functions are not orthogonal, the values computed by these ensemble members have to pass through what may be unnecessary and harmful untrainable non-orthogonal transformations. Low effective depth may be caused by, for example, exploding gradients or a collapsing domain.

  • Avoid pseudo-linearity. For the representational capacity of a network to grow with depth, linear layers must be separated by nonlinearities. If those nonlinearities can be approximated by linear functions, they are ineffective. Pseudo-linearity can be caused by, for example, a collapsing domain or excessive dilution.

  • Keep in mind that skip connections help in general, but other techniques do not Diluting a nonlinear residual function with an uncorrelated linear initial function can greatly help with the pathologies described in this paper. Techniques such as normalization layers, careful initialization of weights or SELU nonlinearities can prevent the explosion or vanishing of forward activations. Adam, RMSprop or vSGD can improve performance even if forward activations explode or vanish. While those are important functionalities, these techniques in general neither help address gradient explosion relative to forward activations as indicated by the GSC nor the collapsing domain problem.

  • As the GSC grows, adjust the step size. If it turns out that some amount of growth of the GSC is unavoidable or desirable, weights in lower layers could benefit from experiencing a lower relative change during each update. Optimization algorithms such as RMSprop or Adam may partially address this.

  • Control dilution level to control network properties. Skip connections, normalization layers and scaling constants can be placed in a network to trade off gradient growth and representational capacity. Theorem 4 can be used for a static estimate of the amount of gradient reduction achieved. Similarly, theorem 5 can be used for a static estimate of the overall dilution of the network.

  • Great compositional depth may not be optimal. Networks with more than 1000 layers have recently been trained (He et al., 2016b). Haber and Ruthotto (2017) gave a formalism for training arbitrarily deep networks. However, ever larger amounts of dilution are required to prevent gradient explosion (Szegedy et al., 2016). This may ultimately lead to an effective depth much lower than the compositional depth and individual layers that have a very small impact on learning outcomes, because functions they represent are very close to linear functions. If there is a fixed parameter budget, it may be better spent on width than extreme depth (Zaguroyko and Komodakis, 2016).

Implications for deep learning research
  • Exploding gradients matter. They are not just a numerical quirk to be overcome by rescaling but are indicative of an inherently difficult optimization problem that cannot be solved by a simple modification to a stock algorithm.

  • GSC is an effective benchmark for gradient explosion. For the first time, we established a rigorous link between a metric for exploding gradients and training difficulty. The GSC is also robust to network rescaling, layer width and individual layers.

  • Any neural network is a residual network. The residual trick allows the application of ResNet-specific tools such as the popular theory of effective depth to arbitrary networks.

  • Step size matters when studying the behavior of networks. We found that using different step sizes for different layers had a profound impact on the training success of various architectures. Many studies that investigate fundamental properties of deep networks either do not consider layerwise step sizes (e.g. Schoenholz et al. (2017)) or do not even consider different global step sizes (e.g. Keskar et al. (2017)). This can lead to inaccurate conclusions.

We provide continued discussion in section A.

Nonlinearity Normalization Matrix type Skip type Width St. Dev. Sign Div.
ReLU none Gaussian none 100 1.52 0.22 0.030
ReLU layer Gaussian none 100 1.16 0.096 0.029
ReLU batch Gaussian none 100 5728 1.00 0.41
tanh none Gaussian none 100 1.26 0.096 0.50
tanh layer Gaussian none 100 72.2 1.00 0.50
tanh batch Gaussian none 100 93.6 1.00 0.50
SELU none Gaussian none 100 6.36 0.97 0.42
ReLU batch Gaussian none 200 5556 1.00 0.42
ReLU batch Gaussian none 100/200 5527 1.00 0.41
SELU none Gaussian none 200 5.86 0.99 0.45
SELU none Gaussian none 100/200 6.09 0.98 0.43
ReLU none orthogonal none 100 1.29 0.20 0.03
ReLU layer orthogonal none 100 1.00 0.10 0.03
ReLU batch orthogonal none 100 5014 1.00 0.42
tanh none orthogonal none 100 1.18 0.10 0.50
tanh layer orthogonal none 100 56.3 1.00 0.50
tanh batch orthogonal none 100 54.6 1.00 0.50
SELU none orthogonal none 100 5.47 1.00 0.49
ReLU none looks-linear none 100 1.00 1.00 0.50
ReLU layer looks-linear none 100 1.00 1.00 0.50
ReLU batch looks-linear none 100 1.00 1.00 0.50
ReLU layer Gaussian identity 100 1.08 0.56 0.19
ReLU batch Gaussian identity 100 4.00 1.00 0.48
tanh layer Gaussian identity 100 1.63 1.00 0.50
tanh batch Gaussian identity 100 1.57 1.00 0.50
SELU layer Gaussian identity 100 1.31 0.99 0.48
ReLU layer Gaussian Gaussian 100 1.17 0.56 0.18
ReLU batch Gaussian Gaussian 100 4.50 1.00 0.48
tanh layer Gaussian Gaussian 100 1.97 1.00 0.50
tanh batch Gaussian Gaussian 100 1.71 1.00 0.50
SELU layer Gaussian Gaussian 100 1.53 9.97 0.48
Table 1: Key metrics for architectures in their randomly initialized state evaluated on Gaussian noise. In the ‘Normalization’ column, ‘layer’ refers to layer normalization, ‘batch’ refers to batch normalization and ‘none’ refers to an absence of a normalization layer. In the ‘Matrix type’ column, ‘Gaussian’ refers to matrices where each entry is drawn independently from a Gaussian distribution. ‘orthogonal’ refers to a uniformly random orthogonal matrix and ‘looks-linear’ refers to the initialization scheme proposed by Balduzzi et al. (2017) and expounded in section G. In the ‘Skip type’ column, ‘identity’ refers to identity skip connections and ‘Gaussian’ refers to skip connections that multiply the incoming value with a matrix where each entry is drawn independently from a Gaussian distribution. ‘none’ refers to an absence of skip connections. In the ‘Width’ column, ‘100/200’ refers to linear layers having widths alternating between 100 and 200. ‘St. Dev.’ refers to pre-activation standard deviation at the highest nonlinearity layer. ‘Sign Div.’ refers to pre-activation sign diversity at the highest nonlinearity layer. For further details, see section H. Red values indicate gradient explosion or pseudo-linearity.
Nonlinearity Norm. Matrix type Skip type Error (custom s.s.) Error (single s.s.)
ReLU none Gaussian none 31.48% 19.24%
ReLU layer Gaussian none 42.48% 21.23%
ReLU batch Gaussian none 34.83% 76.65%
tanh none Gaussian none 23.42% 16.22%
tanh layer Gaussian none 1.92% 17.5%
tanh batch Gaussian none 12.31% 23.8%
SELU none Gaussian none 0.24% 1.78%
ReLU none looks-linear none 0.002% 0.008%
ReLU layer looks-linear none 0.77% 1.2%
ReLU batch looks-linear none 0.38% 0.19%
tanh layer Gaussian id 0.35% 0.27%
tanh batch Gaussian id 0.13% 0.24%
ReLU layer Gaussian id 2.09% 1.49%
ReLU batch Gaussian id 0.06% 0.096%
SELU layer Gaussian id 1.55% 1.55%
leaky ReLU () batch Gaussian none - 5.6%
leaky ReLU () batch Gaussian none - 4.64%
leaky ReLU () batch Gaussian none - 3.84%
leaky ReLU () batch Gaussian none - 4.86%
leaky ReLU () batch Gaussian none - 8.41%
leaky ReLU () batch Gaussian none - 23.14%
none batch Gaussian none - 53.55%
ReLU none orthogonal none - 12.51%
leaky ReLU () none orthogonal none - 2.72%
leaky ReLU () none orthogonal none - 2.66%
leaky ReLU () none orthogonal none - 0.85%
leaky ReLU () none orthogonal none - 0.22%
leaky ReLU () none orthogonal none - 0.13%
leaky ReLU () none orthogonal none - 0.11%
leaky ReLU () none orthogonal none - 0.12%
leaky ReLU () none orthogonal none - 0.09%
leaky ReLU () none orthogonal none - 20.51%
none none orthogonal none - 54.64%
Table 2: Training classificaion error for architectures trained on CIFAR10. In the ‘Normalization’ column, ‘layer’ refers to layer normalization, ‘batch’ refers to batch normalization and ‘none’ refers to an absence of a normalization layer. In the ‘Matrix type’ column, ‘Gaussian’ refers to matrices where each entry is drawn independently from a Gaussian distribution. ‘looks-linear’ refers to the looks-linear initialization scheme proposed by Balduzzi et al. (2017) and expounded in section G. ‘orthogonal’ refers to uniformly random orthogonal matrices. In the ‘Skip type’ column, ‘identity’ refers to identity skip connections and ‘none’ refers to an absence of skip connections. In the two rightmost columns, we show the training classification error achieved when using a single step size and when using a custom step size for each layer, whenever this experiment was conducted. If two error values are given, the lower one is shown in bold. For further details, see section H. For a detailed breakdown of these results, see figures 3, 7, 12 and 13.

[width=]dynaRes.pdf

Figure 12: Key metrics for ResNet architectures trained on CIFAR10. The top left graph shows the estimated optimal relative update size in each layer according to the algorithm described in section H.4.1. Remaining graphs show results obtained from training with either a custom step sizes or a single step size, whichever achieved a lower error (see table 2). The top two rows are equivalent to graphs in figure 3. The bottom row shows pre-activation standard deviation and pre-activation sign diversity (see section H.2 for definition) at the highest nonlinearity layer as training progresses.

[width=]dynaLL.pdf

Figure 13: Key metrics for ReLU-based architectures with looks-linear initialization trained on CIFAR10. The top left graph shows the estimated optimal relative update size in each layer according to the algorithm described in section H.4.1. Remaining graphs show results obtained from training with either custom step sizes or a single step size, whichever achieved a lower error (see table 2). The top two rows are equivalent to graphs in figure 3. The bottom row shows pre-activation standard deviation and pre-activation sign diversity (see section H.2 for definition) at the highest nonlinearity layer as training progresses.

Appendix A Further discussion

a.1 Mean field theory - exploding gradients / collapsing domain vs order / chaos

Our work bears similarity to a recent line of research studying deep networks using mean field theory (Poole et al., 2016; Schoenholz et al., 2017; Yang and Schoenholz, 2017, 2018). The authors study infinitely wide and deep networks in their randomly initialized state. They identify two distinct regimes, order and chaos, based on whether the correlation between two forward activation vectors corresponding to two different datapoints converges exponentially to one (‘order’), exponentially to a value less than one (‘chaos’) or sub-exponentially (‘edge of chaos’), as the vectors are propagated towards infinite depth. For MLPs where the forward activation vector length converges, order corresponds to gradient vanishing according to the metric of e.g. gradient vector length. If the network is also a tanh MLP, chaos corresponds to gradient explosion according to the same metrics. They show how to use mean field theory as a powerful and convenient tool for the static analysis of network architectures and obtain a range of interesting results.

There are three core similarities between our and their work. Firstly, they discuss the exploding / vanishing gradient dichotomy. Second, the concept of order is very similar to an increasing domain bias. Hence, they show a gradient explosion / domain bias dichotomy for tanh MLPs with stable forward activations. Third, both works rely, at least in part, on the emerging behavior of networks in their randomly initialized state.

We extend their line of work in several ways. Firstly, we argue that the GSC is a better metric for determining the presence of pathological exploding or vanishing gradients than the quadratic mean of gradient vector entries, which is their metric of choice. Using the GSC, we obtain very different regions of gradient explosion, vanishing and stability for popular architectures. For example, for a constant width ReLU MLP with no biases, using the quadratic mean of gradient vector entries, vanishing is obtained for , stability for and explosion for . ( is defined as the standard deviation of weight matrix entries times the square root of the width, as in Poole et al. (2016).) For a constant width ReLU MLP with no biases, using the GSC, stability is inevitable. In such networks, the correlation of two forward activation vectors converges sub-exponentially for all weight scales. Hence, such networks are on the edge of chaos for all weight scales, which matches the gradient behavior when considering the GSC. Therefore, the GSC allows us to discard the assumptions of a converging forward activation vector length and still obtain a correspondence between gradient and correlation behavior.

Yang and Schoenholz (2018) propose to combat the exploding gradient problem by varying the width of intermediate layers. Such variation can indeed reduce the quadratic mean of gradient vector entries. However, our analysis in section 4.6 suggests that this technique is not effective in reducing the growth of the GSC. While we argue that an exploding GSC must cause training difficulties, to out knowledge, no such argument exists for a an exploding quadratic mean of gradient vector entries. In fact, our results suggest that width variation is ineffective at combating gradient pathology.

A second extension is that we show how both gradient explosion and domain bias can directly harm training. Neither is obvious. Gradient explosion might be a numerical quirk to be overcome by rescaling. Correlation information is a rather small part of the information present in the data, so losing that information via domain bias / order might be irrelevant. As a simple example, consider k-means. Performing k-means on an arbitrary dataset yields the same result as first adding a large constant to the data and then performing k-means, even though the addition destroys correlation information.

Thirdly, we demonstrate the importance of using different step sizes for different layers when comparing architectures. While Schoenholz et al. (2017) show experimentally that architectures on the edge of chaos perform best at great depths, we obtain somewhat contrary evidence. Our two best performing vanilla architectures, SELU and layer-tanh, are both inside the chaotic regime whereas ReLU, layer-ReLU and tanh, which are all on the edge of chaos, exhibit a higher training classification error. Our chaotic architectures avoid pseudo-linearity and domain bias. The difference between our experiments and those in Schoenholz et al. (2017) is that we allowed the step size to vary between layers. This had a significant impact, as can be seen in table 2.

Fourthly, Yang and Schoenholz (2017) show that pathologies such as gradient explosion that arise in vanilla networks are reduced in specific ResNet architectures. We extend this finding to general ResNet architectures.

a.2 ResNet from a dynamical systems view

Recently, Haber et al. (2017); Haber and Ruthotto (2017); Chang et al. (2017, 2018)

proposed ResNet architectures inspired by dynamical systems and numerical methods for ordinary differential equations. A central claim is that these architectures achieve bounded forward activations and gradients (and hence GSC) as depth goes to infinity. They propose four practical strategies for building ResNet architectures: (a) ensuring that residual and skip blocks compute vectors orthogonal to each other by using e.g. skew-symmetric weight matrices (b) ensuring that the Jacobian of the residual block has eigenvalues with negative real part by using e.g. weight matrices factorized as

(c) scaling each residual block by where is the number of blocks in the network and (d) regularizing weights in successive blocks to be similar via a fusion penalty.

Architecture (base 10 log) dilution-corrected (base 10 log)
batch-ReLU (i) 0.337 4.23
batch-ReLU (ii) 0.329 4.06
batch-ReLU (iii) 6.164 68.37
batch-ReLU (iv) 0.313 7.22
layer-tanh (i) 0.136 2.17
layer-tanh (ii) 0.114 1.91
layer-tanh (iii) 3.325 5.46
layer-tanh (iv) 0.143 2.31
Table 3: Key metrics for architectures derived from dynamical systems theory.

We evaluated those strategies empirically. In table 3, we show the value of for 8 different architectures in their initialized state applied to Gaussian noise. All architectures use residual blocks containing a single normalization layer, a single nonlinearity layer and a single fully-connected linear layer. We initialize the linear layer in four different ways: (i) Gaussian initialization, (ii) skew-symmetric initialization, (iii) initialization as - where C is Gaussian initialized and (iv) Gaussian initialization where weight matrices in successive blocks have correlation 0.5. Initializations (ii), (iii) and (iv) mimic strategies (a), (b) and (d) respectively. To enable the comparison of the four initialization styles, we normalize each weight matrix to have a unit qm norm. We study all four initializations for both batch-ReLU and layer-tanh. See section H for details.

Initialization (ii) reduces the gradient slightly relative to initialization (i). This is expected given theorem 4. One of the key assumptions is that skip and residual block are orthogonal in expectation. While initialization (i) achieves this, under (ii), the two functions are orthogonal not just in expectation, but with probability 1.

Initialization (iii) has gradients that grow much faster than initialization (i). On the one hand, this is surprising as Haber and Ruthotto (2017) state that eigenvalues with negative real parts in the residual block Jacobian slow gradient growth. On the other hand, it is not surprising because introducing correlation between the residual and skip blocks breaks the conditions of theorem 4.

Initialization (iv) performs comparably to initialization (i) in reducing gradient growth, but requires a larger amount of dilution to achieve this result. Again, introducing correlation between successive blocks and thus between skip and residual blocks breaks the conditions of theorem 4 and weakens the power of dilution.

While we did not investigate the exact architectures proposed in Haber and Ruthotto (2017); Chang et al. (2017), our results show that more theoretical and empirical evaluation is necessary to determine whether architectures based on (a), (b) and (d) are indeed capable of significantly improved forward activation and gradient stability. Of course, those architectures might still confer benefits in terms of e.g. inductive bias or regularization.

Finally, strategy (c), the scaling of either residual and/or skip blocks with constants is a technique already widely used in regular ResNets. In fact, our study suggests that in order to bound the GSC at arbitrary depth in a regular ResNet, it is sufficient to downscale each residual blocks by only instead of as the dynamical systems papers suggest.

a.3 Exploding and vanishing gradients in RNNs

Exploding and vanishing gradients have been studied more extensively in the context of RNNs (e.g. Pascanu et al. (2013); Bengio et al. (1994)). It is important to note that the problem as it arises in RNNs is similar but also different from the exploding gradient problem in feedforward networks. The goal in RNNs is often to absorb information early on and store that information through many time steps and sometimes indefinitely. In the classical RNN architecture, signals acquired early would be subjected to a non-orthogonal transformation at every time step which leads to all the negative consequences described in this paper. LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Cho et al., 2014), which are the most popular solutions to exploding / vanishing gradients in RNNs, are capable of simply leaving each neuron that is considered part of the latent state completely unmodified from time step to time step by gating the incoming signal unless new information is received that is pertinent to that specific neuron. This solution does not apply in feedforward networks, because it is the very goal of each layer to modify the signal productively. Hence, managing exploding gradients in feedforward networks is arguably more difficult.

Nevertheless, there is similarity between LSTM and the orthogonal initial state because both eliminate non-orthogonality “as much as possible”. LSTM can eliminate non-orthogonality completely from time step to time step whereas in the orthogonal initial state, non-orthogonality is eliminated only from the initial function. Again, viewing feedforward networks as ensembles of shallower networks, orthogonal initial functions ensure that information extracted from each ensemble member does not have to pass through non-orthogonal transformations without clear reason. This is precisely what LSTM attempts to achieve.

a.4 Open research questions and future work

Biases, convolutional and recurrent layers

In this paper, we focus our analysis on MLPs without trainable bias and variance parameters. Theorem 1, in its formulation, applies only to such MLPs. The other theorems use conditions that are potentially harder to fulfill, even approximately, in non-MLP architectures. Our experimental evaluation is limited to MLPs.

We think that results very similar to those presented in this paper are acheivable for other types of neural networks, such as those containing trainable biases, convolutional layers or recurrent layers, although we suspect the gap between theory and practice may increase.

Analysis of deep gradients has so far focused on MLPs (e.g. Balduzzi et al. (2017); Schoenholz et al. (2017); Yang and Schoenholz (2017); Saxe et al. (2014)), so a principled extension of these results to other network types would break new and important ground.

Understanding collapsing domains

It is difficult to assess or measure the degree to which the domain collapses in a given network. Neither entropy nor exponential entropy can be computed directly. How should we evaluate domains that are composed of sub-spaces that have varying intrinsic dimensionality?

A domain can collapse in many different ways. For example, in a deep linear, Gaussian-initialized network, the domain collapses onto the line through the principal eigenvector of the product of weight matrices, but never onto a single point. In a ReLU network, the domain collapses onto a ray from the origin. In layer-ReLU, the normalization operation then collapses the domain onto a single point. In a tanh network with very large weights, each tanh layer collapses the domain onto the corners of the hypercube. In what other ways can a collapsing domain manifest? How can those manifestations harm training?

What gradient scale is best?

indicates the relative responsiveness of the prediction layer with respect to changes in the input layer. Of course, the goal in deep learning, at least within a prediction setting, is to model some ground truth function that maps data inputs to true labels. That function has itself a GSC at each input location that measures the relative responsiveness of to changes in . If the network is to perfectly represent the ground truth function, the GSCs would also have to match up. If, on the other hand, the GSC of the network differs significantly from that of , the network is not fitting well. This suggests that in fact, the “best” value of the GSC is one that matches that of the ground truth. If the GSC of the network is too low, we may experience underfitting. If the GSC of the network is too high, we may experience overfitting.

How to achieve the “right” gradient?

To model the ground truth function, we may not just want to consider the overall magnitude of the GSC across the dataset, but to enable the network to have gradients of different magnitudes from one data input to the next; or to learn highly structured gradients. For example, given an image of a dog standing in a meadow, we might desire a high gradient with respect to pixels signifying e.g. facial features of the dog but a low gradient with respect to pixels that make up the meadow, and a uniformly low gradient given an image of a meadow. Such gradients would be very valuable not just in modelling real world functions more accurately and improving generalization, but in making the output of neural networks more explainable and avoiding susceptibility to attacks with adversarial inputs.

What is the relationship between compositional depth, effective depth, linear approximation error, dilution, gradient scale and representational capacity?

Throughout this paper, we have discussed various metrics that can influence the performance of deep networks. We proved and discussed many relationships between these metrics. However, there are still many open questions regarding how these concepts interrelate. Is effective depth truly a better tool for measuring “depth” than compositional depth? Does depth provide additional modeling benefits beyond its power to exponentially increase gradient scale? Is there a reason to prefer a deeper network if its gradient scale is the same as a shallower network? Is there a reason to prefer a network with higher linear approximation error if its gradient scale is the same as that of a network with lower linear approximation error? Does dilution bring about harms or benefits independently of its impact on gradient scale?

How far does the orthogonal initial state take us?

An orthogonal initial state reduces gradients via dilution, which allows for relatively larger updates, which enables increased growth of residual functions, which allows for greater effective depth. However, as residual functions grow, dilution decreases, so the gradient increases, so updates must shrink, so the growth of residual functions slows, so the growth of effective depth slows.

In other words, for the network to become deeper, it needs to be shallow.

Therefore, while training from an orthogonal initial state can increase effective depth, we expect this effect to be limited. Additional techniques could be required to learn functions which require a compositional representation beyond this limit.

Appendix B Further terminology, notation and conventions

  • and are generally used to refer to the components of a datapoint. Then, we have .

  • generally refers to a vector of dimensionality , i.e. the same dimensionality as the component of datapoints. Similarly, refers to an element of the domain of possible labels. We refer to and as ‘data input’ .

  • refers to a vector of dimensionality , i.e. the same dimensionality as .

  • We write as a short form of . Sometimes, we omit and / or . In that case, and / or remain implicit. is an analogous short form.

  • We write as a short form of . Sometimes, we omit . In that case, remains implicit. is an analogous short form.

  • We use , and interchangeably with or .

  • We say a random vector is ‘radially symmetric’ if its length is independent of its orientation and its orientation is uniformly distributed.

  • We say a random matrix is ‘Gaussian initialized’ if its entries are drawn independently from a mean zero Gaussian distribution.

  • We say an random matrix is ‘orthogonally initialized’ if it is a fixed multiple of an submatrix of a uniformly random orthogonal matrix.

  • We use parentheses to denote vector and matrix elements, i.e. is the fourth element in the third row of the matrix .

  • Throughout sections D and E, we assume implicitly that the GSC is defined and thus that neural networks are differentiable. All results can be trivially extended to cover networks that are almost surely differentiable and directionally differentiable everywhere, which includes SELU and ReLU networks.

  • We discuss the conditions that arise in theoretical results only in the context of MLPs. Note that several of our theoretical results apply to varying degrees to non-MLPs. We will not discuss the degree of applicability.

Appendix C Effective depth: details

c.1 Formal definition

Let a ‘gradient-based algorithm’ for training a mutable parameter vector from an initial value for a network be defined as a black box that is able to query the gradient at arbitrary query points but only at the current value of the mutable parameter vector . It is able to generate updates which are added to the mutable parameter vector . Let the sequence of updates be denoted as . We define the successive states of recursively as . For simplicity, assume the algorithm is deterministic.

In a residual network defined according to equation 3, we can write the gradient with respect to a parameter sub-vector as . Multiplying this out, we obtain terms. We call a term ‘-residual’ if it contains or more Jacobians of residual functions, as opposed to Jacobians of initial functions. Let be the sum of all -residual terms in .

Now consider two scenarios. In scenario (1), when the algorithm queries the gradient, it receives i.e. the “regular” gradient. In scenario (2), it receives , i.e. a version of the gradient where all -residual terms are removed. Let the parameter vector attain states , , .. in scenario (1) and , , .. in scenario (2). Then we say the ‘-contribution’ at time is . Finally, we say the ‘effective depth at time with threshold ’ is the largest such that there exists an with , plus one. We add one because we include the residual function at layer , which is co-adapting to the residual Jacobians contained in the gradient term.

There is no objectively correct value for the threshold . In practice, we find that the -contribution decreases quickly when is increased beyond a certain point. Hence, the exact value of is not important when comparing different networks by effective depth.

The impact that the shift has on the output of the network is influenced by the scale of as well as . If those values vary enormously between layers or architectures, it may be advisable to set different thresholds for different layers or architectures, though we did not find this necessary.

c.2 Computational estimate

Unfortunately, computing the effective depth measure is intractable as it would require computing exponentially many gradient terms. In this section, we explain how we estimate effective depth in our experiments.

In this paper, we train networks only by stochastic gradient descent with either a single step size for all layers or a custom step size for each layer. Our algorithm for computing effective depth assumes this training algorithm.

Vanilla networks

Assume that the network is expressed as a residual network as in equation 3. Let be the batch size for the ’th update, let be the step size used at layer for the ’th update and let be the batch of query points used to compute the ’th update. Then SGD computes for all