Gradient-based optimization enables learning of powerful deep NN models DarganShaveta2019ASoD; rumelhart1986learning
. However, most learning algorithms remain sensitive to learning rate, scale of data values, and the choice of activation function—making deep NN models hard to trainsrivastava2015training; du2019gradient
. Stochastic gradient descent with momentumsutskever2013importance; adam, normalizing data values to have zero mean and unit variance lecun2012efficient, and employing rectified linear units (ReLUs) in NNs lecun2015deep; ramachandran2017searching; nair2010rectified have emerged as an empirically motivated and popular practice. In this paper, we analyze a specific failure case of this practice, referred to as “dying” ReLU.
The ReLU activation function, is a popular choice of activation function and has been shown to have superior training performance in various domains glorot2011deep; sun2015deeply. ReLUs can sometimes be collapse to a constant (zero) function for a given set of inputs. Such a ReLU is considered “dead” and does not contribute to a learned model. ReLUs can be initialized dead lu2019dying or die during optimization, the latter being a major obstacle in training deep NNs trottier2017parametric; agarap2018deep. Once dead, gradients are zero making recovery possible only if inputs change distribution. Over time, large parts of a NN can end up dead which reduces model capacity.
Mitigations to dying ReLU include modifying the ReLU to also activate for negative inputs maas2013rectifier; clevert2016fast; he2015delving, training procedures with normalization steps ba2016layer; ioffe2015batch, and initialization methods lu2019dying. While these approaches have some success in practice, the underlying cause for ReLUs dying during optimization is, to our knowledge, still not understood.
In this paper, we analyze the observation illustrated in Figure 1 that regression performance degrades with smaller target variances, and along with momentum optimization leads to dead ReLU. Although target normalization is a common pre-processing step, we believe a scientific understanding of why
it is important is missing, especially with the connection to momentum optimization. For our theoretical results, we first show that an affine approximator trained with gradient descent and momentum corresponds to a discrete-time linear autonomous system. Introducing momentum into this system results in complex eigenvalues and parameters that oscillate. We further show that a single-ReLU model has two cones in parameter space; one for which the properties of the linear system is shared, and one that corresponds to dead ReLU.
We derive analytic gradients for the single-ReLU model to further gain insight and to identify critical points (i.e. global optima and saddle points) in parameter space. By inspection of numerical examples, we also identify regions where ReLUs tend to converge to the global optimum (without dying) and how these regions change with momentum. Lastly, we show empirically that the problem of dying ReLU is aggravated in deeper models, including residual neural networks.
2 Related work
In a recent paper lu2019dying, the authors identify dying ReLUs as a cause of vanishing gradients. This is a fundamental problem in NNs poole2016exponential; hanin2018neural. In general, this can be caused by ReLUs being initialized dead or dying during optimization. Theoretical results about initialization and dead ReLU NNs are presented by lu2019dying. Growing NN depth towards infinity and initializing parameters from symmetric distributions both lead to dead models. However, asymmetric initialization can effectively prevent this outcome. Empirical results about ReLUs dying during optimization are presented by wu2018ans. Similar to us, they observe a relationship between dying ReLUs and the scale of target values. In contrast to us, they do not investigate the underlying cause.
Normalization of layer input values.
The effects of input value distribution has been studied for a long time, e.g. sola1997importance. Inputs with zero mean have been shown to result in gradients that more closely resemble the natural gradient, which speeds up training raiko2012deep. In fact, a range of strategies to normalize layer input data exists (ioffe2015batch; ba2016layer; ioffe2017self) along with theoretical analysis of the problem (santurkar2018does). Another studied area for maintaining statistics throughout the NN is initialization of the parameters (glorot2010understanding; he2015delving; lu2019dying). However, subsequent optimization steps may change the parameters such that the desired input mean and variance no longer is fulfilled.
Normalization of target values.
When the training data are available before optimization, target normalization is trivially executed. More challenging is the case where training data are accessed incrementally, e.g. as in reinforcement learning or for very large training data. Here, normalization and scaling of target values are important for the learning outcomevan2016learning; henderson2018deep; wu2018ans
. For on-line regression and reinforcement learning, adaptive target normalization improves results and removes the need of gradient clippingvan2016learning. In reinforcement learning, scaling rewards by a positive constant is crucial for learning performance, and is often equivalent to the scaling of target values henderson2018deep. Small reward scales have been seen to increase the risk of dying ReLUs wu2018ans. All of these works motivate the use of target normalization empirically and a theoretical understanding is still lacking. In this paper, we provide more insight into the relationship between dying ReLUs and target normalization.
We consider regression of a target function from training data consisting of pairs of inputs and target values . We analyze different regression models , such as an affine transformation in Sec. 4 and a ReLU-activated model in Sec. 5
, which are both parameterized by a vector. Below, we provide definitions, notations, and equalities needed for our analysis.
Before regression, we transform target values according to
are mean and standard deviation of the target values from the training data. When the parameters of the transform are set to scaleand bias , new target values correspond to -normalization goldin1995similarity with zero mean and unit variance. In our analysis, we are interested in the effects of changing from to smaller values closer to .
In Sec. 4 we study regression of target functions of the form
where and .
Similar to douglas2018relu, we consider the case where inputs in are distributed as . For any
, we can find a unitary matrixsuch that and . From this follow the equalities
Since and are identically distributed due to our assumption on , we can equivalently study the target function
and assume that for the remainder of this paper.
For Sec. 5 we consider a ReLU-activated target function
where is the ReLU activation function .
Regression and Objective.
We consider gradient descent (GD) optimization with momentum for the parameters . The update from step to is given as
for the momentum variable and
for the parameters , where
is the loss function,is the rate of momentum, and is the step size.
Regressions Models and Parameterization.
In Sec. 4 we model the respective target function with an affine transform
and in Sec. 5 we consider the nonlinear ReLU model
In both cases, the parameters are weights and bias .
We optimize the mean squared error (MSE), such that
where is the signed error given by for the affine target and for the ReLU-activated target.
To make gradient calculation easier, and interpretable, we will approximate the gradients for the ReLU-model by replacing with . This is still a reasonable approximation, since for any choice of we have that if
That is, the error and the gradient is either identical or has the same sign evaluated at any point and .
To make calculating expected gradients of easier, without introducing any further approximations, we define a unitary matrix (abbreviated ) such that the vector rotated by is mapped to
We refer to the rotated vector as . Again, if , the variables and are identically distributed and for , we also get from Eq. (13)
A ReLU is considered dying if all inputs are negative. For inputs with infinite support, we consider the ReLU as dying if outputs are non-zero with probability less than some (small)
4 Regression with affine model
In this section, we analyze the regression of the target function from Eq. (4) with the affine model from Eq. (8). From the perspective of the input and output space of this model, it is identical to the ReLU model in Eq. (9) for all inputs that map to positive values. On the other hand, from the parameter space perspective, we will show that the parameter evolution is identical in certain regions of the parameter space. The global optimum is also the same for both functions. This allows us to re-use some of the following results later.
To study the evolution of parameters and momentum during GD optimization, we formulate the GD optimization as an equivalent linear autonomous system (goh2017momentum) and analyze the behavior by inspection of the eigenvalues. For this analysis, we assume that the inputs are distributed as in the training data.
By inserting into Eq. (10), the optimization objective can be formulated as
using new shorthand definitions and . Considering that from Eq. (4), the derivatives are
for the weights and
for the bias . From these results, we can see that both and are zero when arrives at a critical point—in this case the global optimum.
The parameter evolution as given by Eqs. (6) and (7) can be formulated as a linear autonomous system in terms of , , and . The state consists of stacked pairs of momentum and parameters. We write the update equations Eq. (6) and Eq. (7) in the form
where the state evolves according to the constant matrix
which determines the evolution of each pair of momentum and parameter independently of other pairs.
Since the pairs evolve independently, we can study their convergence by analyzing the eigenvalues of . Given that and are from the ranges defined in Sec. 3, all eigenvalues are strictly inside the right half of the complex unit circle which guarantees convergence to the ground truth. For step sizes eigenvalues will still be inside the complex unit circle, still guaranteeing convergence, but eigenvalues can be real and negative. This means that parameters will alternate signs every gradient update, denoted “ripples” goh2017momentum. Although this sign-switching can cause dying ReLU in theory, in practice learning rates are usually .
We plot the eigenvalues of in Figure 2 (left) for as increases from , i.e. GD without momentum, towards . We observe that the eigenvalues eventually become complex () resulting in oscillation goh2017momentum (seen on the right side). The fraction , as we will show in Sec. 5, is a good measure of to what extent the ReLU is dying (smaller means dying), and hence we plot this quantity in particular. Note that the eigenvalues, and thus the behavior is entirely parameterized by the learning rate and momentum, and independent of . Thus, we can not adjust as a function of to make the system behave as the case . We now continue by showing how these properties translate to the case of a ReLU activated unit.
5 Regression with single ReLU-unit
We now want to understand the behavior of regressing to Eq. (5) with the ReLU model in Eq. (9). As discussed in Sec. 3, we will approximate the gradients by considering the linear target function in Eq. 4. Although this target can not be fully recovered, the optimal solution is still the same, and gradients share similarities, as previously discussed. Again, we consider the evolution and convergence of parameters and momentum during GD optimization, and assume that the inputs are distributed as in the training data.
Similarity to affine model.
The ReLU will output non-zero values for any that satisfies . We can equivalently write this condition as
From our assumption about the distribution of training inputs follows that .
With this result, we can compute the probability of a non-zero output from the ReLU as
Using Eq. (15), we see that dead ReLU is equivalent of . This is equivalent of which defines a “dead” cone in parameter space. We can also formulate a corresponding “linear” cone. In this case we get which is the same cone mirrored along the -axis. In the linear cone, because of the similarity to the affine model, we know that parameters evolve as described in Sec. 4 with increased oscillations as momentum is used. We will now investigate the analytical gradients to see how these properties translate into that perspective.
By inserting into Eq. (10), the optimization objective can be formulated as
where we use the indicator function to model the ReLU activation. The derivatives are
for the weights and
By changing base using , we can compute the derivatives,
for the weights in dimensions , and
for dimension , where we used as density function of the standard normal distribution. For the bias we have
Full derivations are listed in Appendix B.
Critical points in parameter space.
With the gradients from above, we can analyze the possible parameters (i.e. optima and saddle points) to which GD can converge when the derivatives are zero. First of all, global optimum is easily verified to have gradient zero when and .
Saddle points correspond to dead ReLU and occur when since then , , and . This equals the case that for any which can be verified by plugging these values into Eq. (24). These saddle points form the center of the dead cone. Note in practice that these limits in practice occur already at, for example , since then . The implication of this is that an entire dead cone can be considered as saddle points in practice, and that parameters will converge on the surface of the cone rather than in its interior.
For , we instead have and thus and and the gradients can be verified to equal those of the affine model in Sec. 4, as expected. This verifies that parameter evolution in the linear cone will be approximately identical to the affine model.
Simplification to enable visualization.
To continue our investigation of the parameter evolution and, in particular, focus on how the target variance and momentum evolve parameters into the dead cone we will make some simplifications. For this, we will assume that which enables us to express gradients without as
For the weight we can prove, see Appendix C, that
where and . For the bias we get
This means, if all for , then only the weight and bias evolve. We can now plot the vector fields in these parameters to see how they change w.r.t. .
Influence of on convergence without momentum.
The first key take-away when decreasing is that the global optimum will be closer to the origin and eventually between the dead and the linear cone. This location of the optimum is particularly sensitive, since in this case the parameters in the linear cone evolve towards the dead cone, and in addition exhibits oscillatory behavior for large . The color scheme in Figure 3 verifies that, like the probability of non-zero outputs in the dead cone, the gradients also tend to zero there. In the lower right quadrant, we can see an attracting line that is shifted towards the dead cone as decreases, eventually ending up inside the dead cone. For this case, when , we can follow the lines to see that most parameters originating from the right side end up in the dead cone. Parameters originating in and near the linear cone approach the ground truth, and the lower left quadrant evolves first towards the linear cone before evolving towards the ground truth.
When adding momentum, remember that parameter evolution in and near the linear cone exhibits oscillatory behavior. The hypothesis at this moment is that parameters originating from the linear cone can oscillate over into the dead cone and get stuck there. For the other regions they either evolve as before into the dead cone, or first into the linear cone and then into the dead cone by oscillation. We will now evaluate and visualize this.
Regions that converge to global optimum.
We are interested in distinguishing the regions from which parameters will converge to the global optimum and dead cone respectively. For this, we apply the update rules, Eqs. (6) and (7), with , until updates are smaller (in norm) than .
Figure 4 shows the results for GD without momentum (). We see that the region that converges to the dead cone changes with , and eventually switches sign, when becomes small. The majority of initializations still converges at the ground truth.
Figure 5 shows the results with momentum. The linear autonomous system in Sec. 4 with has complex eigenvalues for which lead to oscillation. This property approximately translates to the ReLU in the linear cone, where we expect oscillations for . Indeed, we observe the same results as without momentum for , but worse results for larger . Eventually, only small regions of initializations are able to converge to the global optimum.
6 Deeper architectures
In this section we will address two questions. (1) Does the problem persist in deeper architectures, including residual networks with batch normalization? (2) Since the ReLU is a linear operator for positive coefficients, can we simply scale the weight initialization and learning rate and have an equivalent network and learning procedure? For the latter we will show that it is not possible for deeper architectures.
Relevance for models used in practice.
We performed regression on the same datasets as in Sec. 1, with . We confirm the findings of lu2019dying that ReLUs are initialized dead in deeper “vanilla” NNs, but not in residual networks due to the batch normalization layer that precedes the ReLU. Results are shown in Figure 6. We further find that ReLUs die more often, and faster, the deeper the architecture, even for residual networks with batch normalization. We can also conclude that, in these experiments, stochastic gradient descent does not produce more dead ReLU during optimization.
For , the ReLU function has the property and thus . By rescaling the weight and bias initializations (not learning rate) by , the parameter trajectories during learning will proportional no matter the choice of . That is, ReLUs will die independent of . This is a special case though, since for any architecture with one or more hidden layers, it is not possible to multiply parameters by a single scalar such that the function is identical to . Proof is provided in Appendix D. Also, we still have a problem if we do not know in advance, which holds for example in the reinforcement learning setting.
7 Conclusion and future work
Target normalization is indeed a well-motivated and used practice, although we believe a theoretical understanding of why
is both important and lacking. We take a first stab at the problem by understanding the properties in the smallest moving part of a neural network, a single ReLU-activated affine transformation. Gradient descent on parameters of an affine transformation can be expressed as a discrete-time linear system, for which we have tools to explain the behavior. We provide a geometrical understanding of how these properties translate to the ReLU case, and when it is considered dead. We further illustrated that weight initialization from large regions in parameter space lead to dying ReLUs as target variance is decreased along with momentum optimization, and show that the problem is still relevant, and even aggravated, for deeper architectures. Remaining questions to be answered include how we can extend the analysis for the single ReLU to the full network. Here, the implicit target function of the single ReLU is likely neither of the linear or piecewise linear functions, and inputs are not Gaussian distributed and vary over time.
This work was funded by the *********** through the project ************.
Appendix A Regression experiment details
NN used for regression was of the form
where and where is the dimensionality of the input data. All elements in and were initialized from . The elements of and were initialized from . Batch size was and the Adam optimizer was set to use the same parameters as suggested by adam. Every experiment was run times with different seeds, effecting the random ordering of mini-batches and initialization of parameters. Each experiment was allowed gradient updates before evaluation. Evaluation was performed on a subset of the training set, as we wanted to investigate the fit to the data rather than generalization.
Appendix B Single ReLU gradients
For brevity we will use in place of below. Subscript notation on non-bold symbols here represents single dimensions of the vector counterpart. The gradient w.r.t. is
Multiplying by from the left and looking at a dimension , we get:
For we instead have
We can calculate the second term with integration by parts
The third term:
By substituting these expressions back in to Eq. (34) and simplifying we get Eq. (28). The derivation of gradients w.r.t. the bias is very similar to the gradients w.r.t. to and are omitted here for brevity.
Appendix C Gradients after convergence in the first dimensions
If we assume the first dimensions of are zero, we can show that the first dimensions of all have gradient zero. We can then also express the gradient without the unitary matrix that otherwise needs to be calculated for every new . We present this as a proposition by the end of this section. First, we need to show some preliminary steps.
where is either or .
As defined in Eq. (13), we have . Since
we must have . Since U is unitary we must have which is solved by . ∎
We have from Lemma C.1 that and lie in the same one-dimensional linear subspace spanned by . Since is unitary and also is in , then so must . This implies that . ∎
then for any vector in the linear subspace spanned by we have
for and where and . The gradient w.r.t. the bias is
Appendix D Scaling of parameters
Before stating the problem, we define an N-hidden layer neural network recursively
Denote the joint collection of parameters , , and as . We denote to make explicit the dependence on . We now investigate whether a scaling of can be achieved by scaling all parameters by a single number.
Given some , , there exists no such that
First, denote and factorize into factors
Multiplying by gives:
We finally define .