## 1 Introduction

Gradient-based optimization enables learning of powerful deep NN models DarganShaveta2019ASoD; rumelhart1986learning

. However, most learning algorithms remain sensitive to learning rate, scale of data values, and the choice of activation function—making deep NN models hard to train

srivastava2015training; du2019gradient. Stochastic gradient descent with momentum

sutskever2013importance; adam, normalizing data values to have zero mean and unit variance lecun2012efficient, and employing rectified linear units (ReLUs) in NNs lecun2015deep; ramachandran2017searching; nair2010rectified have emerged as an empirically motivated and popular practice. In this paper, we analyze a specific failure case of this practice, referred to as “dying” ReLU.The ReLU activation function, is a popular choice of activation function and has been shown to have superior training performance in various domains glorot2011deep; sun2015deeply. ReLUs can sometimes be collapse to a constant (zero) function for a given set of inputs. Such a ReLU is considered “dead” and does not contribute to a learned model. ReLUs can be initialized dead lu2019dying or die during optimization, the latter being a major obstacle in training deep NNs trottier2017parametric; agarap2018deep. Once dead, gradients are zero making recovery possible only if inputs change distribution. Over time, large parts of a NN can end up dead which reduces model capacity.

Mitigations to dying ReLU include modifying the ReLU to also activate for negative inputs maas2013rectifier; clevert2016fast; he2015delving, training procedures with normalization steps ba2016layer; ioffe2015batch, and initialization methods lu2019dying. While these approaches have some success in practice, the underlying cause for ReLUs dying during optimization is, to our knowledge, still not understood.

In this paper, we analyze the observation illustrated in Figure 1
that regression performance degrades with smaller target variances, and along with
momentum optimization leads to dead ReLU. Although target normalization is a
common pre-processing step, we believe a scientific understanding of *why*

it is important is missing, especially with the connection to momentum optimization. For our theoretical results, we first show that an affine approximator trained with gradient descent and momentum corresponds to a discrete-time linear autonomous system. Introducing momentum into this system results in complex eigenvalues and parameters that oscillate. We further show that a single-ReLU model has two cones in parameter space; one for which the properties of the linear system is shared, and one that corresponds to dead ReLU.

We derive analytic gradients for the single-ReLU model to further gain insight and to identify critical points (i.e. global optima and saddle points) in parameter space. By inspection of numerical examples, we also identify regions where ReLUs tend to converge to the global optimum (without dying) and how these regions change with momentum. Lastly, we show empirically that the problem of dying ReLU is aggravated in deeper models, including residual neural networks.

## 2 Related work

In a recent paper lu2019dying, the authors identify dying ReLUs as a cause of vanishing gradients. This is a fundamental problem in NNs poole2016exponential; hanin2018neural. In general, this can be caused by ReLUs being initialized dead or dying during optimization. Theoretical results about initialization and dead ReLU NNs are presented by lu2019dying. Growing NN depth towards infinity and initializing parameters from symmetric distributions both lead to dead models. However, asymmetric initialization can effectively prevent this outcome. Empirical results about ReLUs dying during optimization are presented by wu2018ans. Similar to us, they observe a relationship between dying ReLUs and the scale of target values. In contrast to us, they do not investigate the underlying cause.

### Normalization of layer input values.

The effects of input value distribution has been studied for a long time, e.g. sola1997importance. Inputs with zero mean have been shown to result in gradients that more closely resemble the natural gradient, which speeds up training raiko2012deep. In fact, a range of strategies to normalize layer input data exists (ioffe2015batch; ba2016layer; ioffe2017self) along with theoretical analysis of the problem (santurkar2018does). Another studied area for maintaining statistics throughout the NN is initialization of the parameters (glorot2010understanding; he2015delving; lu2019dying). However, subsequent optimization steps may change the parameters such that the desired input mean and variance no longer is fulfilled.

### Normalization of target values.

When the training data are available before optimization, target normalization is trivially executed. More challenging is the case where training data are accessed incrementally, e.g. as in reinforcement learning or for very large training data. Here, normalization and scaling of target values are important for the learning outcome

van2016learning; henderson2018deep; wu2018ans. For on-line regression and reinforcement learning, adaptive target normalization improves results and removes the need of gradient clipping

van2016learning. In reinforcement learning, scaling rewards by a positive constant is crucial for learning performance, and is often equivalent to the scaling of target values henderson2018deep. Small reward scales have been seen to increase the risk of dying ReLUs wu2018ans. All of these works motivate the use of target normalization empirically and a theoretical understanding is still lacking. In this paper, we provide more insight into the relationship between dying ReLUs and target normalization.## 3 Preliminaries

We consider regression of a target function from training data consisting of pairs of inputs and target values . We analyze different regression models , such as an affine transformation in Sec. 4 and a ReLU-activated model in Sec. 5

, which are both parameterized by a vector

. Below, we provide definitions, notations, and equalities needed for our analysis.### Target normalization.

Before regression, we transform target values according to

(1) |

where and

are mean and standard deviation of the target values from the training data. When the parameters of the transform are set to scale

and bias , new target values correspond to -normalization goldin1995similarity with zero mean and unit variance. In our analysis, we are interested in the effects of changing from to smaller values closer to .### Target function.

Similar to douglas2018relu, we consider the case where inputs in are distributed as . For any

, we can find a unitary matrix

such that and . From this follow the equalities(3) |

Since and are identically distributed due to our assumption on , we can equivalently study the target function

(4) |

and assume that for the remainder of this paper.

### Regression and Objective.

We consider gradient descent (GD) optimization with momentum for the parameters . The update from step to is given as

(6) |

for the momentum variable and

(7) |

for the parameters , where

is the loss function,

is the rate of momentum, and is the step size.### Regressions Models and Parameterization.

In Sec. 4 we model the respective target function with an affine transform

(8) |

and in Sec. 5 we consider the nonlinear ReLU model

(9) |

In both cases, the parameters are weights and bias .

We optimize the mean squared error (MSE), such that

(10) |

where is the signed error given by for the affine target and for the ReLU-activated target.

To make gradient calculation easier, and interpretable, we will approximate the gradients for the ReLU-model by replacing with . This is still a reasonable approximation, since for any choice of we have that if

(11) |

and

(12) |

That is, the error and the gradient is either identical or has the same sign evaluated at any point and .

To make calculating expected gradients of easier, without introducing any further approximations, we define a unitary matrix (abbreviated ) such that the vector rotated by is mapped to

(13) |

We refer to the rotated vector as . Again, if , the variables and are identically distributed and for , we also get from Eq. (13)

(14) |

### Dying ReLU.

A ReLU is considered dying if all inputs are negative. For inputs with infinite support, we consider the ReLU as dying if outputs are non-zero with probability less than some (small)

(15) |

## 4 Regression with affine model

In this section, we analyze the regression of the target function from Eq. (4) with the affine model from Eq. (8). From the perspective of the input and output space of this model, it is identical to the ReLU model in Eq. (9) for all inputs that map to positive values. On the other hand, from the parameter space perspective, we will show that the parameter evolution is identical in certain regions of the parameter space. The global optimum is also the same for both functions. This allows us to re-use some of the following results later.

To study the evolution of parameters and momentum during GD optimization, we formulate the GD optimization as an equivalent linear autonomous system (goh2017momentum) and analyze the behavior by inspection of the eigenvalues. For this analysis, we assume that the inputs are distributed as in the training data.

### Analytic gradients.

By inserting into Eq. (10), the optimization objective can be formulated as

(16) |

using new shorthand definitions and . Considering that from Eq. (4), the derivatives are

(17) |

for the weights and

(18) |

for the bias . From these results, we can see that both and are zero when arrives at a critical point—in this case the global optimum.

### Parameter evolution.

The parameter evolution as given by Eqs. (6) and (7) can be formulated as a linear autonomous system in terms of , , and . The state consists of stacked pairs of momentum and parameters. We write the update equations Eq. (6) and Eq. (7) in the form

(19) |

where the state evolves according to the constant matrix

(20) |

which determines the evolution of each pair of momentum and parameter independently of other pairs.

Since the pairs evolve independently, we can study their convergence by analyzing the eigenvalues of . Given that and are from the ranges defined in Sec. 3, all eigenvalues are strictly inside the right half of the complex unit circle which guarantees convergence to the ground truth. For step sizes eigenvalues will still be inside the complex unit circle, still guaranteeing convergence, but eigenvalues can be real and negative. This means that parameters will alternate signs every gradient update, denoted “ripples” goh2017momentum. Although this sign-switching can cause dying ReLU in theory, in practice learning rates are usually .

We plot the eigenvalues of in Figure 2 (left) for as increases from , i.e. GD without momentum, towards . We observe that the eigenvalues eventually become complex () resulting in oscillation goh2017momentum (seen on the right side). The fraction , as we will show in Sec. 5, is a good measure of to what extent the ReLU is dying (smaller means dying), and hence we plot this quantity in particular. Note that the eigenvalues, and thus the behavior is entirely parameterized by the learning rate and momentum, and independent of . Thus, we can not adjust as a function of to make the system behave as the case . We now continue by showing how these properties translate to the case of a ReLU activated unit.

## 5 Regression with single ReLU-unit

We now want to understand the behavior of regressing to Eq. (5) with the ReLU model in Eq. (9). As discussed in Sec. 3, we will approximate the gradients by considering the linear target function in Eq. 4. Although this target can not be fully recovered, the optimal solution is still the same, and gradients share similarities, as previously discussed. Again, we consider the evolution and convergence of parameters and momentum during GD optimization, and assume that the inputs are distributed as in the training data.

### Similarity to affine model.

The ReLU will output non-zero values for any that satisfies . We can equivalently write this condition as

(21) |

We can further simplify the condition from Eq. (21) using from Sec. 3, which lets us consider the constraint solely in the -th dimension

(22) |

From our assumption about the distribution of training inputs follows that .

With this result, we can compute the probability of a non-zero output from the ReLU as

(23) |

where

is the cumulative distribution function (CDF) of the standard normal distribution.

Using Eq. (15), we see that dead ReLU is equivalent of . This is equivalent of which defines a “dead” cone in parameter space. We can also formulate a corresponding “linear” cone. In this case we get which is the same cone mirrored along the -axis. In the linear cone, because of the similarity to the affine model, we know that parameters evolve as described in Sec. 4 with increased oscillations as momentum is used. We will now investigate the analytical gradients to see how these properties translate into that perspective.

### Analytic gradients.

By inserting into Eq. (10), the optimization objective can be formulated as

(24) |

where we use the indicator function to model the ReLU activation. The derivatives are

(25) |

for the weights and

(26) |

for the bias. As in Sec. 4, the optimal fit is given by weights and bias . These parameters give and and set the gradients in Eq. (25) and Eq. (26) to zero.

By changing base using , we can compute the derivatives,

(27) |

for the weights in dimensions , and

(28) |

for dimension , where we used as density function of the standard normal distribution. For the bias we have

(29) |

Full derivations are listed in Appendix B.

### Critical points in parameter space.

With the gradients from above, we can analyze the possible parameters (i.e.
optima and saddle points) to which GD can converge when the derivatives are
zero. First of all, *global optimum* is easily verified to have gradient zero when
and .

*Saddle points* correspond to dead ReLU and occur when since then , , and
. This equals the case that
for any which can be verified by plugging these values into
Eq. (24). These saddle points form the center of the dead cone.
Note in practice that these limits in practice
occur already at, for
example , since then
. The implication of this is that an entire dead
cone can be considered as saddle points in practice, and that parameters will
converge on the surface of the cone rather than in its interior.

For , we instead have and thus and and the gradients can be verified to equal those of the affine model in Sec. 4, as expected. This verifies that parameter evolution in the linear cone will be approximately identical to the affine model.

### Simplification to enable visualization.

To continue our investigation of the parameter evolution and, in particular, focus on how the target variance and momentum evolve parameters into the dead cone we will make some simplifications. For this, we will assume that which enables us to express gradients without as

(30) |

For the weight we can prove, see Appendix C, that

(31) |

where and . For the bias we get

(32) |

This means, if all for , then only the weight and bias evolve. We can now plot the vector fields in these parameters to see how they change w.r.t. .

### Influence of on convergence without momentum.

The first key take-away when decreasing is that the global optimum will be closer to the origin and eventually between the dead and the linear cone. This location of the optimum is particularly sensitive, since in this case the parameters in the linear cone evolve towards the dead cone, and in addition exhibits oscillatory behavior for large . The color scheme in Figure 3 verifies that, like the probability of non-zero outputs in the dead cone, the gradients also tend to zero there. In the lower right quadrant, we can see an attracting line that is shifted towards the dead cone as decreases, eventually ending up inside the dead cone. For this case, when , we can follow the lines to see that most parameters originating from the right side end up in the dead cone. Parameters originating in and near the linear cone approach the ground truth, and the lower left quadrant evolves first towards the linear cone before evolving towards the ground truth.

When adding momentum, remember that parameter evolution in and near the linear cone exhibits oscillatory behavior. The hypothesis at this moment is that parameters originating from the linear cone can oscillate over into the dead cone and get stuck there. For the other regions they either evolve as before into the dead cone, or first into the linear cone and then into the dead cone by oscillation. We will now evaluate and visualize this.

### Regions that converge to global optimum.

We are interested in distinguishing the regions from which parameters will converge to the global optimum and dead cone respectively. For this, we apply the update rules, Eqs. (6) and (7), with , until updates are smaller (in norm) than .

Figure 4 shows the results for GD without momentum (). We see that the region that converges to the dead cone changes with , and eventually switches sign, when becomes small. The majority of initializations still converges at the ground truth.

Figure 5 shows the results with momentum. The linear autonomous system in Sec. 4 with has complex eigenvalues for which lead to oscillation. This property approximately translates to the ReLU in the linear cone, where we expect oscillations for . Indeed, we observe the same results as without momentum for , but worse results for larger . Eventually, only small regions of initializations are able to converge to the global optimum.

## 6 Deeper architectures

In this section we will address two questions. (1) Does the problem persist in deeper architectures, including residual networks with batch normalization? (2) Since the ReLU is a linear operator for positive coefficients, can we simply scale the weight initialization and learning rate and have an equivalent network and learning procedure? For the latter we will show that it is not possible for deeper architectures.

### Relevance for models used in practice.

We performed regression on the same datasets as in Sec. 1, with . We confirm the findings of lu2019dying that ReLUs are initialized dead in deeper “vanilla” NNs, but not in residual networks due to the batch normalization layer that precedes the ReLU. Results are shown in Figure 6. We further find that ReLUs die more often, and faster, the deeper the architecture, even for residual networks with batch normalization. We can also conclude that, in these experiments, stochastic gradient descent does not produce more dead ReLU during optimization.

### Parameter re-scaling.

For , the ReLU function has the property and thus . By rescaling the weight and bias initializations (not learning rate) by , the parameter trajectories during learning will proportional no matter the choice of . That is, ReLUs will die independent of . This is a special case though, since for any architecture with one or more hidden layers, it is not possible to multiply parameters by a single scalar such that the function is identical to . Proof is provided in Appendix D. Also, we still have a problem if we do not know in advance, which holds for example in the reinforcement learning setting.

## 7 Conclusion and future work

Target normalization is indeed a well-motivated and used practice, although we
believe a theoretical understanding of *why*

is both important and lacking. We take a first stab at the problem by understanding the properties in the smallest moving part of a neural network, a single ReLU-activated affine transformation. Gradient descent on parameters of an affine transformation can be expressed as a discrete-time linear system, for which we have tools to explain the behavior. We provide a geometrical understanding of how these properties translate to the ReLU case, and when it is considered dead. We further illustrated that weight initialization from large regions in parameter space lead to dying ReLUs as target variance is decreased along with momentum optimization, and show that the problem is still relevant, and even aggravated, for deeper architectures. Remaining questions to be answered include how we can extend the analysis for the single ReLU to the full network. Here, the implicit target function of the single ReLU is likely neither of the linear or piecewise linear functions, and inputs are not Gaussian distributed and vary over time.

## References

## Acknowledgements

This work was funded by the *********** through the project ************.

## Appendix A Regression experiment details

NN used for regression was of the form

(33) |

where and where
is the dimensionality of the input data. All elements in
and were initialized from .
The elements of and were initialized from
. Batch size was and the Adam
optimizer was set to use the same parameters as suggested by adam.
Every experiment was run times with different seeds, effecting the random
ordering of mini-batches and initialization of parameters. Each experiment
was allowed gradient updates before evaluation. Evaluation was performed
on a subset of the *training set*, as we wanted to investigate the fit to the
data rather than generalization.

## Appendix B Single ReLU gradients

For brevity we will use in place of below. Subscript notation on non-bold symbols here represents single dimensions of the vector counterpart. The gradient w.r.t. is

Multiplying by from the left and looking at a dimension , we get:

For we instead have

(34) |

We can calculate the second term with integration by parts

The third term:

By substituting these expressions back in to Eq. (34) and simplifying we get Eq. (28). The derivation of gradients w.r.t. the bias is very similar to the gradients w.r.t. to and are omitted here for brevity.

## Appendix C Gradients after convergence in the first dimensions

If we assume the first dimensions of are zero, we can show that the first dimensions of all have gradient zero. We can then also express the gradient without the unitary matrix that otherwise needs to be calculated for every new . We present this as a proposition by the end of this section. First, we need to show some preliminary steps.

###### Lemma C.1.

If

then

where is either or .

###### Proof.

As defined in Eq. (13), we have . Since

we must have . Since U is unitary we must have which is solved by . ∎

###### Lemma C.2.

If

then

###### Proof.

We have from Lemma C.1 that and lie in the same one-dimensional linear subspace spanned by . Since is unitary and also is in , then so must . This implies that . ∎

###### Lemma C.3.

If

then for any vector in the linear subspace spanned by we have

where .

###### Proof.

###### Proposition C.1.

If

then

and

for and where and . The gradient w.r.t. the bias is

## Appendix D Scaling of parameters

Before stating the problem, we define an N-hidden layer neural network recursively

(35) |

where

(36) |

and

(37) |

Denote the joint collection of parameters , , and as . We denote to make explicit the dependence on . We now investigate whether a scaling of can be achieved by scaling all parameters by a single number.

###### Theorem D.1.

Given some , , there exists no such that

(38) |

###### Proof.

First, denote and factorize into factors

(39) |

Multiplying by gives:

(40) | ||||

(41) | ||||

(42) |

and similarly

(43) | ||||

(44) |

We finally define .

If a exists that satisfies Eq. (38), then we must have all . If , then and thus in Eqs. (42) and (44) the weights and biases are not multiplied by the same number . Similarly, this holds for .

∎

Comments

There are no comments yet.