. Many variant activation functions have been designed to preclude neuron death, yet the humble ReLU continues to be preferred in practice, suggesting these effects are not completely understoodCheng:2020:PDeLU; Djork:2016:ELU; Goodfellow:2013:MaxOut; He:2015:HeInitialization; Ramachandran:2018:SwishNonlinearity.
A pair of preprints from Lu and Shin et al. have argued that, as a ReLU network architecture grows deeper, the probability that it is initialized dead goes to one lu2019:dyingRelu; shin2019:dyingRelu2
. If a network is initialized dead, the partial derivatives of its output are all zero, and thus it is not amenable to differential training methods such as stochastic gradient descentLeCun:1998:EfficientBackprop. This means that, for a bounded width, very deep ReLU networks would be nearly impossible to train. Both works propose upper bounds for the probability of living initialization, and suggest new initialization schemes to improve this probability. However, their bounds are complex and the derivations include some non-trivial assumptions.
In this work, we derive simple upper and lower bounds on the probability that a random ReLU network is alive. Our upper bound rigorously proves the result of Lu et al., while our lower bound establishes a new positive result, that a network can grow infinitely deep so long as it grows wider as well. Then, we show that our lower bound is asymptotically tight. That is, the probability of living initialization converges to the lower bound along any path through hyperparameter space such that neither the width nor depth is bounded. Our proof is based on the observation that very deep rectifier networks concentrate all outputs towards a single eigenvalue, stemming from the fact that networks lose information when individual neurons die. We confirm this result by numerical simulation, and sketch a derivation from first principles.
Interestingly, information loss by neuron death furnishes a compelling interpretation of various network architectures, such as residual layers (ResNets), batch normalization, and skip connections He:2016:ResNet; Ioffe:2015:batchNorm; Shelhamer:2017:FCN. This analysis provides a priori means of evaluating various model architectures, and could inform future designs of very deep networks, as well as their biological plausibility.
Based on this information, we propose a simple sign-flipping initialization scheme which guarantees with probability one that the ratio of living training data points is at least , where
is the number of layers in the network. Our scheme preserves the marginal distribution of each parameter, while modifying the joint distribution based on the training data. We confirm our results with numerical simulations, suggesting the actual improvement far exceeds the theoretical minimum. We also compare this scheme to batch normalization, which offers similar guarantees.
2 Preliminary definitions
Given an input dimension , with weights and , and input data , a ReLU neuron is defined as
A ReLU layer of with
is just the vector concatenation ofneurons, which can be written with parameters and , where the maximum is taken element-wise. A ReLU network with layers is , the composition of the layers. The parameters of a network are denoted , and we can write that where is the total number of parameters in the network. To simplify the proofs and notation, we assume that the width of each layer is the same throughout a network, always denoted .
In practice, neural networks parameters are often initialized from some random probability distribution at the start of training. This distribution is important to our results. In this work, as in practice, we always assume thatfollows a symmetric
, zero-mean probability density function (PDF). That is, the density of a parameter vectoris not altered by flipping the sign of any component. Furthermore, all components of are assumed to be statistically independent and identically distributed (IID), except where explicitly stated otherwise.
Sometimes we wish to refer to the response of a network layer before the rectifying nonlinearity. The pre-ReLU response of a the layer is denoted as , and consists of the first layers composed with the affine part of the layer, without the final maximum. For short-hand, the response of the neuron in the layer is denoted , while the pre-ReLU response is .
Let be the input domain of the network. Let , the image of under the layer, which gives the domain of
. Let the random variabledenote the input datum, the precise distribution of which is not relevant for this paper. We say that is dead at layer if the Jacobian
is the zero matrix,111For convenience, assume the convention that a partial derivative is zero whenever , in which case it would normally be undefined. taken with respect to the parameters
. By the chain rule,is then dead in any layer after as well.
The dead set of a ReLU neuron is
which is a half-space in . The dead set of a layer is just the intersection of the half space of each neuron, a (possibly empty) convex polygon. For convenience, the dead set of the neuron in the layer is denoted , while the dead set of the whole layer is . If , and is the least layer in which this occurs, we say that is killed by that layer. A dead neuron is one for which all of is dead. In turn, a dead layer is one in which all neurons are dead. A dead network is one for which the final layer is dead, which is guaranteed if any intermediate layer dies.
Finally, let be the probability that an entire network of width features and depth
layers is alive, the estimation of which is the central study of this work. For convenience in some proofs, letdenote the event that the layer of the network is alive, a function of the parameters . Under this definition, .
3 Upper bound
First, we derive a conservative upper bound on . We know that for all , due to the ReLU nonlinearity. Then, if layer kills , that layer dies independent of the actual value of . Now, if and only if none of the parameters are positive. Since the parameters follow symmetric independent distributions, this implies .
Next, note that layer is alive only if is, or more formally,
. The law of total probability yields the recursive relationship
Then, since is independent of , we have the upper bound
For fixed , this is a geometric sequence in , with limit zero. This verifies the claim of Lu et al lu2019:dyingRelu.
4 Lower bound
The good news is, we can bring our networks back to life again if they grow deeper and wider simultaneously. Our insight is to show that, while a deeper network has a higher chance of dead initialization, a wider network has a lower chance. We derive a lower bound for this probability from which we compute the optimal width for a given network depth. The basic idea is this: in a living network, for all . Recall from equation 1 that the dead set of a neuron is a sub-level set. Since each neuron is continuous, we know that its dead set is closed, and its compliment is open. Thus, given any point , implies that there exists some neighborhood around which is also not in . This means that a layer is alive so long as it contains a single living point, thus a lower bound for the probability of a living layer is the probability of a single point being alive.
Define the quantity , where is the dead set of some neuron. Surprisingly, does not depend on the value of . Given some symmetric distribution of the parameters, we have
Now, the surface can be rewritten as
, a hyperplane in. Since this surface has Lebesgue measure zero, it has negligible contribution to the integral. Combining this with the definition of a PDF, we have
Then, change variables to , with Jacobian determinant , and apply the symmetry of to yield
Thus , so . The previous calculation can be understood in a simple way devoid of any formulas: for any half-space containing , the closure of its compliment is equally likely and with probability one, exactly one of these sets contains . Thus, the probability of drawing a half-space containing is the same as the probability of drawing one that does not, so each must have probability .
Finally, if any of the neurons in layer is alive at , then the whole layer is alive. Since these are independent events with probability , . It follows from equation 2 that
From this inequality, we can compute the width required to achieve as
5 Tightness of the bounds
Combining the results of the previous sections, we have the bounds
Let denote the lower bound, and the upper. We now show that these bounds are asymptotically tight, as is later verified by our experiments. More precisely, for any , with . Furthermore, along any path such that . This means that these bounds are accurate for the extreme cases of a single-layer network, or a very deep network.
Tightness of the upper bound for
First we show that , assuming .
If is full-rank, then we can solve for some such that . Thus Since this is a polynomial, the set has Lebesgue measure zero. Integrating over the PDF of the parameter matrix, we have that .
Next, we consider general . For any , we know , which depends on the geometry of . In general, as expands to fill , converges monotonically to . For example, if is a ball of radius , then .
Tightness of the lower bound as
We now turn our attention to the lower bound. It is a vacuous to study with fixed , as both bounds converge to zero in this case. Similarly, both converge to one as for fixed . Instead, we consider any path through hyperparameter space, which essentially states that both and grow without limit. For example, consider equation 8 for any , which holds .
First, we require an assumption about the data and parameter distributions.
Let denote the vector of standard deviations of each element of
denote the vector of standard deviations of each element of, which varies with respect to the input data , conditioned on parameters . Let . We assume the sum of normalized conditional variances, taken over all living layers as
This basically states that the normalized variance decays rapidly with the number of layers , and it does not grow too quickly with the number of features . The assumption holds in the numerical simulations of section 8. We also provide a sketch of its derivation from some common assumptions about the data and parameters.
Assume the input features have zero mean and identity covariance matrix, i.e. the data have undergone a whitening transformation. Furthermore, assume that the network parameters have variance , as in the Xavier initialization scheme Glorot:2010:XavierInit. Let denote the output of some neuron in the first layer, with , and parameters stripped of their indices accordingly. Then we have
This shows that the pre-ReLU variance does not increase with each layer. A similar computation can be applied to later layers, by canceling terms involving uncorrelated features.
To show that the variance actually decreases rapidly with , we must factor in the ReLU nonlinearity. Let denote the event that the input data is alive after some neuron. The basic intuition is that, as data progresses through a network, information is lost when neurons die, which occurs with probability . For a deep enough network, the response of each data point is essentially just , which we call the eigenvalue of the network.
The law of total variance gives
Now, . Also, , as removing the negative half of the output space can only decrease the variance. Thus
Also, and , since outputs of zero do not contribute to the mean. Then
Putting this all together,
This tells us that the variance decays in each layer by a factor of approximately , the ratio of living data. Furthermore, we know from section 4 that , and experiments suggest that the other term is bounded in expectation. This recalls the factor of used in He initialization He:2015:HeInitialization. We defer a full proof for later work.
Now, we show the tightness of the lower bound based on assumption 5.1. To simplify the proof, we switch to the perspective of a discrete IID dataset . Let denote the number of living data points at layer , and let . Then, we show that as .
Partition the parameter space into the following events, conditioned on :
: All remaining data lives at layer :
: Only some of the remaining data lives at layer :
: All the remaining data is killed: .
Now, as the events are disjoint, we can write
Now, conditioning on the compliment of means that either all the remaining data is alive or none of it. If none of it, then flipping the sign of any neuron brings all the remaining data to life with probability one. As in section 4, symmetry of the parameter distribution ensures all sign flips are equally likely. Thus . Since and partition , we have
Recall the lower bound Expanding terms in the recursion, and letting be the analogous event for layer , we have that
Now, implies that for some while . But, if and only if . Applying the union bound across data points followed by Chebyshev’s inequality, we have that
This is a geometric series in . By assumption 5.1, with .
6 Sign-flipping initialization scheme and batch normalization
In this section, we propose a slight deviation from the usual IID initialization which partially circumvents the issue of dead neurons. We switch to a discrete point of view, assuming some finite training dataset . Notice that negating the parameters of a layer revives any data points killed by it. Thus, starting from the first layer, we can count how many data points are killed, and if it’s more than half the number which were previously alive, negate the parameters of that layer. This alters the joint parameter distribution based on the available data, while preserving the marginal distribution of each parameter. Furthermore, the scheme is practical, as the cost is essentially the same as a forward pass over the training set.
For a -layer network, this scheme guarantees that at least data points live. The one caveat is that the output before the ReLU activation function at layer cannot be all zeros. However, this edge case has probability zero since depends on a polynomial in the parameters of layer , the roots of which have measure zero. Batch normalization provides a similar guarantee: if for all , then with probability one there is at least one living data point Ioffe:2015:batchNorm. Similar to our sign-flipping scheme, this prevents total network death but still permits individual neurons to die. The two schemes are simulated and compared in section 8.
7 Other network architectures
Up to this point we focused on fully-connected ReLU networks. In this section we briefly discuss how our results can be generalized to other network types. Interestingly, we see that many empirical design choices have a theoretical explanation in terms of preventing neuron death.
7.1 Convolutional Networks
Convolutional neural networks (CNNs) are perhaps the most popular type of artificial neural network. Specialized for image and signal processing, these networks have most of their ReLU neurons constrained to perform convolutions. Keeping with our previous conventions, we say that a convolutional neuron takes in feature maps and outputs
where the maximum is again taken element-wise. In the two-dimensional case, and . By the Riesz Representation Theorem, discrete convolution can be represented by multiplication with some matrix . Since is a function of the much smaller matrix , we need to rework our previous bounds in terms of the dimensions and . Space permitting, it is not difficult to show by similar arguments that
Compare this to inequality 9. As with fully-connected networks, the lower bound depends on the number of neurons, while the upper bound depends on the total number of parameters.
7.2 Residual networks and skip connections
Residual networks (ResNets) are composed of layers which add their input features to the output He:2016:ResNet
. Residual connections do not prevent neuron death, as attested by other recent workarnekvist:2020:dyingReluMomentum. However, they do prevent a dead layer from automatically killing any later ones, by creating a path around a dead neuron. This could explain how residual connections allow deeper networks to be trained He:2016:ResNet. It may also affect the output variance and prevent information loss.
A related design is the use of skip connections, as in fully-convolutional networks Shelhamer:2017:FCN. For these architectures, the probability of network death is a function of the shortest path through the network. In the popular U-Net architecture, with chunk length , the shortest path has a depth of only Ronneberger:2015:Unet.
7.3 Other nonlinearities
Besides the ReLU, other nonlinearities including leaky ReLU, “swish” and hyperbolic tangent are sometimes used Djork:2016:ELU; Ramachandran:2018:SwishNonlinearity. For all of these functions, part of the input domain is much less sensitive than others. Our theory can be extended to all the ReLU variants by replacing dead data points with weak ones, i.e. those with small gradients. Given that gradients are averaged over a mini-batch, weak data points are likely equivalent to dead ones in practice Goodfellow:2016:DLBook.
8 Numerical simulations
In this section, we report numeral simulations verifying our analytic bounds. First, for each number of features , we randomly generated
data points from standard normal distributions, which resemble a real dataset after whitening. Then, we randomly initializedfully-connected ReLU networks having layers. Network parameters were normally-distributed, with variance as in He initialization He:2015:HeInitialization. We counted the number of living data points for each network, and repeated this experiment with batch normalization and sign flipping. Finally, we plotted these results in figure 1 alongside the upper and lower bounds from inequality 9. Then, to show asymptotic tightness of the lower bound, we repeated the experiment using the hyperparameters defined by equation 8, with IID parameters, for various values of , shown in figure 2.
We draw two main conclusions from these graphs. First, the simulations agree with our theoretical bounds. Note that in all cases, and with increasing and . Furthermore, the conditional variance decreases exponentially in figure 2, confirming assumption 5.1 and yielding a tight lower bound. Second, sign-flipping significantly increases the proportion of living data, as seen in figure 1. In section 6 we stated that with sign-flipping, the fraction of living training data points is at least . Note that for IID data , as that bound comes from the probability that any random data point is alive. Sign flipping significantly improves on both lower bounds, while batch normalization hardly exceeds the IID case.
This work aims to contribute to the theoretical understanding of artificial neural networks, and could lead to practical improvements in future deep learning models. However, it does not propose any specific applications of these technologies. Therefore, in our estimation, it raises no new ethical considerations.