The rectified linear unit (ReLU),2015; Ramachandran et al., 2017; Nair and Hinton, 2010). The success of ReLU is based on its superior training performance (Glorot et al., 2011; Sun et al., 2015) over other activation functions such as the logistic sigmoid and the hyperbolic tangent (Glorot and Bengio, 2010; LeCun et al., 1998). The ReLU has been used in various applications including image classification (Krizhevsky et al., 2012; Szegedy et al., 2015)2013), speech recognition (Hinton et al., 2012), and game intelligence (Silver et al., 2016), to name a few.
The use of gradient-based optimization is inevitable in training deep neural networks. It has been widely known that the deeper a neural network is, the harder it is to train (Srivastava et al., 2015; Du et al., 2018a)
. A fundamental difficulty in training deep neural networks is the vanishing and exploding gradient problem(Poole et al., 2016; Hanin, 2018; Chen et al., 2018). The dying ReLU is a kind of vanishing gradient, which refers to a problem when ReLU neurons become inactive and only output 0 for any input. It has been known as one of the obstacles in training deep ReLU neural networks (Trottier et al., 2017; Agarap, 2018). To overcome this problem, a number of methods have been proposed. Broadly speaking, these can be categorized into three general approaches. One approach modifies the network architectures. This includes but not limited to the changes in the number of layers, the number of neurons, network connections, and activation functions. In particular, many activation functions have been proposed to replace the ReLU (Maas et al., 2013; He et al., 2015; Clevert et al., 2015; Klambauer et al., 2017). However, the performance of other activation functions varies on different tasks and data sets (Ramachandran et al., 2017) and it typically requires a parameter to be turned. Thus, the ReLU remains one of the popular activation functions due to its simplicity and reliability. Another approach introduces additional training steps. This includes several normalization techniques (Ioffe and Szegedy, 2015; Salimans and Kingma, 2016; Ulyanov et al., 2016; Ba et al., 2016; Wu and He, 2018) and dropout (Srivastava et al., 2014)
. One of the most successful normalization techniques is the batch normalization(Ioffe and Szegedy, 2015)
. It is a technique that inserts layers into the deep neural network that transform the output for the batch to be zero mean unit variance. However, batch normalization increases by 30% the computational overhead to each iteration(Mishkin and Matas, 2016). The third approach modifies only weights and biases initialization procedure without changing any network architectures or introducing additional training steps (LeCun et al., 1998; Glorot and Bengio, 2010; He et al., 2015; Saxe et al., 2014; Mishkin and Matas, 2016). The third approach is the topic of our work presented in this paper.
The intriguing ability of gradient-based optimization is perhaps one of the major contributors to the success of deep learning. Training deep neural networks using gradient-based optimization fall into the noncovex nonsmooth optimization. Since a gradient-based method is either a first- or a second-order method, and once converged, the optimizer is either a local minimum or a saddle point. The authors of (Fukumizu and Amari, 2000) proved that the existence of local minima poses a serious problem in training neural networks. Many researchers have been putting immense efforts to mathematically understand the gradient method and its ability to solve nonconvex nonsmooth problems. Under various assumptions, especially on the landscape, many results claim that the gradient method can find a global minimum, can escape saddle points, and can avoid spurious local minima (Lee et al., 2016; Amari et al., 2006; Ge et al., 2015, 2016; Zhou and Liang, 2017; Wu et al., 2018; Yun et al., 2018; Du et al., 2017, 2018b, 2018a; Jin et al., 2017). However, these assumptions do not always hold and are provably false for deep neural networks (Safran and Shamir, 2018; Kawaguchi, 2016; Arora et al., 2018). This further limits our understanding on what contributes to the success of the deep neural networks. Often, theoretical conditions are impossible to be met in practice.
Where to start the optimization process plays a critical role in training and has a significant effect on the trained result (Nesterov, 2013). This paper focuses on a particular kind of bad local minima due to a bad initialization. Such a bad local minimum causes the dying ReLU. We focus on the worst case of dying ReLU, where ReLU neurons at a certain layer are all dead, i.e., the entire network dies. We refer this as the dying ReLU neural networks (NNs). This phenomenon could be well illustrated by a simple example. Suppose is a target function we want to approximate using a ReLU network. Since , a 2-layer ReLU network of width 2 can exactly represent . However, when we train a deep ReLU network, we frequently observe that the network is collapsed. This trained result is shown in Fig. 1. Our 1,000 independent simulations show that there is a high probability (more than 90%) for the deep ReLU network to collapse to a constant function. In this example, we employ a 10-layer ReLU network of width 2 which should perfectly recover .
Almost all common initialization schemes in training deep neural networks use symmetric probability distributions around 0. For example, zero mean uniform distributions and zero mean normal distributions were proposed and used in(LeCun et al., 1998; Glorot and Bengio, 2010; He et al., 2015). We show that when weights and biases are initialized from symmetric probability distributions around 0, the dying ReLU NNs occurs in probability as the number of depth goes to infinite. To the best of our knowledge, this is the first theoretical work on the dying ReLU. This result explains why the deeper a network, the harder it is to train. Furthermore, it says that the dying ReLU is inevitable as long as the network is deep enough. Also, our result implies that it is the network architecture that decides whether an initialization procedure is good or bad. Our analysis reveals that a specific network architecture can avoid the dying ReLU NNs with high probability. That is, for any , when a symmetric initialization is used and is satisfied where is the number of depth and is the number of width at each layer, with probability , the dying ReLU NNs will not happen.
Although there are other approaches to avoid the dying ReLU, we aim to overcome it without changing any network architectures or introducing additional training steps like normalizations. Perhaps, changing the initialization procedure might be one of the simplest treatments among others. We thus propose a new initialization procedure, namely, a randomized asymmetric initialization. The new initialization is designed to directly overcome the dying ReLU. We show that our initialization has a smaller upper bound of the probability of the dying ReLU NNs. All parameters used in our initialization are theoretically chosen to avoid the exploding gradient problem. This is done by the second moment analysis where we derive the expected length map relations between layers(He et al., 2015; Hanin, 2018; Poole et al., 2016).
The rest of the paper is organized as follows. After setting up notation and terminology in Section 2, we present the main theoretical results in Section 3. In Section 4, upon introducing a randomized asymmetric initialization, we discuss its theoretical properties. Numerical examples are provided in Section 5, before the conclusion in Section 6.
2 Mathematical Setup
be a feed-forward neural network withlayers and neurons in the -th layer (,
). Let us denote the weight matrix and bias vector in the-th layer by and , respectively. Given an activation function which is applied element-wise, the feed-forward neural network is recursively defined as follows: and
Here is called a -layer neural network or a -hidden layer neural network. In this paper, the rectified linear unit (ReLU) activation function is employed, i.e.,
Let be the set of all weight matrices and bias vectors. Given a set of training data , in order to train
, we consider the standard loss function:
where is a loss criterion. In training neural networks, the gradient-based optimization is typically employed to minimize the loss . The first step for training would be to initialize weight matrices and bias vectors. Typically, they are initialized according to certain probability distributions. For example, uniform distributions around 0 or zero-mean normal distributions are common choices.
The dying ReLU refers to a problem when some ReLU neurons become inactive. In this paper, we focus on the worst case of dying ReLU, where ReLU neurons at a certain layer are all dead, i.e., the entire network dies. We refer this as the dying ReLU neural networks. We then define two phases: (1) a network is dead before training, and (2) a network is dead after training. The phase 1 implies the phase 2, but not vice versa. The phase 1 only happens when the ReLU neural network is initialized to be a constant function. When the phase 1 happens, we say the network is born dead (BD).
3 Theoretical analysis
In this section, we present a theoretical analysis of the dying ReLU neural networks. We show that a deep ReLU network will eventually be BD in probability as the number of depth goes to infinity.
Let be a ReLU neural network with layers, each having neurons. Suppose that all weights and biases are randomly initialized from probability distributions, which satisfy
for some constant . Then
The proof can be found in Appendix A.
We remark that Equation 3 is a very mild condition and it can be satisfied in many cases. For example, when symmetric probability distributions around 0 are employed, the condition is met with . Theorem 3 implies that the fully connected ReLU network will be dead at the initialization as long as the network is deep enough. This explains theoretically why training a deep network is hard.
Theorem 3 shows that the ReLU network asymptotically will be dead. Thus, we are now concerned with the convergence behavior of the probability of NNs being BD. Since almost all common initialization procedures use symmetric probability distributions around 0, we derive an upper bound of the born dead probability (BDP) for symmetric initialization.
Let be a ReLU neural network with layers, each having neurons. Suppose that all weights are independently initialized from symmetric probability distributions around 0 and all biases are either drawn from a symmetric distribution or set to zero. Then
Furthermore, assuming for all ,
The proof can be found in Appendix B.
Theorem 3 provides an upper bound of the BDP. It shows that at a fixed depth , the network will not be BD in probability as the number of width goes to infinite. In order to understand how this probability behaves with respect to the number of width and depth, a lower bound is needed. We thus provide a lower bound of the BDP of ReLU NNs at . Let be a bias-free ReLU neural network with layers, each having neurons at . Suppose that all weights are independently initialized from continuous symmetric probability distributions around 0. Then
where , , and
The proof can be found in Appendix C.
Theorem 3 reveals that the BDP behavior depends on the network architecture. In Fig. 2, we plot the BDP with respect to increasing the number of layers at varying width from to . A bias-free ReLU feed-forward NN with
is employed with weights randomly initialized from symmetric distributions. The results of one million independent simulations are used to calculate each probability estimation. Numerical estimations are shown as symbols. The upper and lower bounds from Theorem3 are also plotted with dash and dash-dot lines, respectively. We see that when the NN gets narrower, the probability of NN being BD grows faster as the depth increases. Also, at a fixed width , the BDP grows as the number of layer increases. This is expected by Theorems 3 and 3.
Once the network is BD, we have no hope to train the network successfully. Here we provide a formal statement of the consequence of the network being BD. Suppose that the feed-forward ReLU neural network is BD. Then, for any loss function , and for any gradient based method, the ReLU network is optimized to be a constant function, which minimizes the loss. The proof can be found in Appendix D.
Theorem 3 implies that no matter what gradient-based optimizers are employed including stochastic gradient desecent (SGD), SGD-Nesterov (Sutskever et al., 2013), AdaGrad (Duchi et al., 2011), AdaDelta (Zeiler, 2012)
, RMSProp(Hinton, 2014), Adam (Kingma and Ba, 2015), BFGS (Nocedal and Wright, 2006), L-BFGS (Byrd et al., 1995), the network is trained to be a constant function which minimizes the loss.
If the online-learning or the stochastic gradient method is employed, where the training data are independently drawn from a probability distribution , the optimized network is
where the expectation is taken with respect to . For example, if -loss is employed, i.e., , the resulting network is . If loss is employed, i.e., , the resulting network is the median of with respect to . Note that the mean absolute error (MAE) and the mean squared error (MSE) used in practice are discrete versions of and loss, respectively, if the size of minibatch is large.
When we design a neural network, we want the BDP to be small, say, less than 1% or 10%. Then, the upper bound (Equation 4) of Theorem 3 can be used for designing a specific network architecture, which has a small probability of NNs being born dead.
Suppose for all . For fixed depth and , if the width is , with probability exceeding , the ReLU neural network will not be initialized to be dead. This readily follows from
As a practical guide, we constructed a diagram shown in Fig. 3 that includes both theoretical predictions and our numerical tests. We see that as the number of layers increases, the numerical tests match closer the theoretical results. It is clear from the diagram that a 10-layer NN of width 10 has a probability of dying less than 1% whereas a 10-layer NN of width 5 has a probability of dying greater than 10%; for width of three the probability is about 60%. Note that the growth rate of the maximum number of layers is exponential which is expected by Corollary 3.
4 Randomized Asymmetric Initialization
In order to alleviate the dying ReLU neural networks, we propose a new initialization procedure, namely, a randomized asymmetric initialization. For ease of discussion, we introduce some notation. For any vector and , we define
In order to train a -layer neural network, we need to initialize . At each layer, let . We denote the -th row of by , where and .
4.1 Proposed initialization
We propose to initialize as follows. Let be a probability distribution defined on for some or . Note that is asymmetric around 0. At the first layer of , we employ the so-called ‘He initialization’ (He et al., 2015), i.e., and . For , and each , we initialize as follows:
Randomly choose in .
Initialize and .
Since an index at each and is randomly chosen, and a positive number is assigned to it from an asymmetric probability distribution around 0, we name this new initialization a randomized asymmetric initialization. Only for the first layer, the He initialization is employed. This is because since an input could have a negative value, if the first layer were to be initialized from , it could cause the dying ReLU. We note that the new initialization requires us to choose and . In Subsection 4.2, these will be theoretically determined. One could choose multiple indices in the step 1 of the new initialization. However, for simplicity, we constraint ourselves to a single index case.
We now show that this new initialization procedure results in a smaller upper bound of the BDP. If a ReLU feed-forward neural network with layers, each having width , is initialized by the randomized asymmetric initialization, then
where ’s are some numbers in and . The proof can be found in Appendix E.
When a symmetric initialization is employed, for all , which results in Equation 4 of Theorem 3. Although the new initialization has a smaller upper bound compared to those by symmetric initialization, as Theorem 3 suggests, it also asymptotically suffers from the dying ReLU.
Assuming the same conditions in Theorem 4.1, and for all . Then, there exists , which depends on such that
For fixed depth and , if the width is , with probability exceeding , the ReLU neural network will not be initialized to be dead. Furthermore,
4.2 Second moment analysis
The proposed randomized asymmetric initialization described in Subsection 4.1 requires us to determine and . Similar to the He initialization (He et al. (2016)), we aim to properly choose initialization parameters from the length map analysis. Following the work of Poole et al. (2016), we present the analysis of a single input propagation through the deep ReLU network. To be more precise, we track the expectation of the normalized squared length of the input vector at each layer,
The expectation is taken with respect to all weights and biases.
Let be a probability distribution whose support is . Let have finite first and second moments, i.e., , for , and . Suppose the -th layer weights and biases are initialized by the randomized asymmetric initialization described in Subsection 4.1. Then for any input , we have
The proof can be found in Appendix F.
Under the same conditions of Theroem 4.2, if , , , for all , and , we have
Since , cannot be zero. In order for , the initialization parameters have to be chosen to satisfy . Assuming , if is chosen to be
we have which satisfies the condition.
4.3 Comparison against other initialization procedures
In Fig. 4, we demonstrate the probability that the network is BD by the proposed randomized asymmetric initialization method. Here we employ and from Equation 6. To compare against other procedures, we present the results by the He initialization (He et al., 2015), one of the most popular symmetric initializations. We also present the results of existing asymmetric initialization procedures; the orthogonal (Saxe et al., 2014) and the layer-sequential unit-variance (LSUV) (Mishkin and Matas, 2016) initializations. The LSUV is the orthogonal initialization combined with rescaling of weights such that the output of each layer has unit variance. Because weight rescaling cannot make the output escape from the negative part of ReLU, it is sufficient to consider the orthogonal initialization. We see that the BDPs by the orthogonal initialization are very close to and a little lower than those by the He initialization. This implies that the orthogonal initialization cannot prevent the dying ReLU network. However, it is clearly observed that our proposed initialization can drastically drop the BDPs compared to other initialization procedures. This is expected by Theorem 4.1.
5 Numerical examples
We demonstrate the effectiveness of the proposed randomized asymmetric initialization procedure in training deep ReLU networks.
Test functions include one- and two-dimensional functions of different regularities. The following test functions are employed as unknown target functions. For one dimensional cases,
For two dimensional case,
The network architecture is set to be the width of at all layers. Here and are the dimensions of the input and output, respectively. We choose this specific network architecture as it theoretically guarantees to approximate any continuous function. It was shown in (Hanin and Sellke, 2017) that the minimum number of width required for the universal approximation is less or equal to . We present the ensemble of 1,000 independent training simulations. In all numerical examples, we employ one of the most popular first-order gradient-based optimization, Adam (Kingma and Ba, 2015) with the default parameters. The minibatch size is chosen to be either 64 or 128. The standard -loss function is used on 3,000 training data. The training data are randomly uniformly drawn from . Without changing any setups described above, we present the approximation results based on different initialization procedures. The results by our proposed randomized asymmetric initialization are referred to ‘Rand. Asymmetric’. Specifically, we use with defined in Equation 6. To compare against other methods, we also show the results by the He initialization (He et al., 2015).
In one dimensional examples, we employ a 10-layer ReLU network of width 2. It follows from Fig. 4 that we expect to observe at least 88% training results by the symmetric initialization and 22% training results by our proposed initialization are collapsed. In the two dimensional example, we employ a 20-layer ReLU network of width 4. According to Fig. 4, we expect to see at least 63% training results by the symmetric initialization and 3.7% training results by our proposed method are collapsed.
Fig. 5 shows all training outcomes of our test for approximating and its corresponding empirical probabilities by different initialization procedures. For this specific test function, we observe only 3 trained results shown in A, B, C. We employ a 10-layer ReLU network with width 2 to approximate . In fact, can be represented exactly by a 2-layer ReLU network with width 2, . It can clearly be seen that the He initialization results in the collapse with probability more than 90%. However, this probability is drastically reduced to 40% by the proposed randomized asymmetric initialization. These probabilities are different from the probability that the network is BD. This implies that even though the network wasn’t BD, there are cases that after training, the network dies. In this example, 5.6% and 18.3% of results by the symmetric and our method, respectively, are not dead at the initialization, however, they are ended up with collapsing after training. The 37.3% of training results by the proposed initialization perfectly recover the target function , however, only 2.2% of results by the He initialization achieve this success. Also, 22.4% of the proposed initialization and 4.2% of the He initialization produce the half-trained results which correspond to Fig. 5 (B). We remark that the only difference in training is the initialization procedure. This implies that our new initialization does not only prevent the dying ReLU network but also improves the quality of the training in this case.
|Collapse (A)||Half-Trained (B)||Success (C)|
|Symmetric (He init.)||93.6%||4.2%||2.2%|
The approximation results for are shown in Fig. 6. Note that is a function. We again employ a 10-layer ReLU network of width 2. It can be seen that 91.9% of training results by the symmetric initialization and 29.2% of training results by our proposed initialization are collapsed which correspond to Fig. 6
(A). This indicates that the randomized asymmetric initialization can effectively alleviate the dying ReLU. In this example, 3.9% and 7.2% of results by the symmetric and our method, respectively, are not dead at the initialization, however, they are ended up with collapsing after training. Except for the collapse, other training results are not easy to be classified. Fig.6 (B,C,D) show three training results among many others. We observe that the behavior and result of training are not easily predictable in general. However, we consistently observe partially collapsed results after training. Such partial collapses are also observed in Fig. 6 (B,C,D). We believe that this requires more attention and postpone the study of this partial collapse to future work.
|Collapsed (A)||Not collapsed (B,C,D)|
|Symmetric (He init.)||91.9%||8.1%|
Similar behavior can be observed for approximating a discontinuous function . The approximation results for and its corresponding empirical probabilities are shown in Fig. 7. It can be seen that 93.8% of training results by the symmetric initialization and 32.6% of training results by our proposed initialization are collapsed which correspond to Fig. 7 (A). In this example, the new initialization drops the probability of collapsing by 60.3 percentage point. Again, this implies that the new initialization can effectively avoid the dying ReLU, especially when deep and narrow ReLU networks are employed. Fig. 7 (B,C,D) show three trained results among many others. Again, we observe partially collapsed results.
|Collapsed (A)||Not collapsed (B,C,D)|
|Symmetric (He init.)||93.8%||6.2%|
As a last example, we show the approximation result for a multi-dimensional inputs and outputs function defined in Equation 8. We observe similar behavior. Fig. 8 shows some of approximation results for and its corresponding probabilities. We employ a 20-layer ReLU network with width 4. For training, we use 3,000 training data sampled from a uniform distribution on , and the minibatch size was chosen as 128 during training. Among 1,000 independent simulations, the collapsed results are obtained by the symmetric initialization with 76.8% probability and by our method with 9.6% probability. From Fig. 4, we expect to observe at least 63% and 3.7% of results by the symmetric and our initialization to be collapsed. Thus, in this example, 13.8% and 5.9% of results by the symmetric and our method, respectively, are not dead at the initialization, however, they are ended up with Fig. 8 (A) after training. This indicates that the new proposed initialization can also effectively overcome the dying ReLU in multi-dimensional inputs and outputs tasks.
|Collapsed (A)||Not collapsed (B)|
|Symmetric (He init.)||76.8%||23.2%|
In this paper, we establish, to the best of our knowledge, the first theoretical analysis on the dying ReLU. By focusing on the worst case of dying ReLU, we define ‘the dying ReLU network’ which refers to the problem when the ReLU neural network is dead. We categorize the dying process into two phases. One phase is the event where the ReLU network is initialized to be a constant function. We refer to this event as ‘the network is born dead’. The other phase is the event where the ReLU network is collapsed after training. Certainly, the first phase implies the second, but not vice versa. We show that the probability that the network is born dead goes to 1 as the depth goes infinite. Also, we provide an upper and a lower bound of the dying probability in when the standard symmetric initialization is used.
Furthermore, in order to overcome the dying ReLU networks, we propose a new initialization procedure, namely, a randomized asymmetric initialization. We show that the new initialization has a smaller upper bound of the probability of NNs being born dead. By establishing the expected length map relation (second moment analysis), all parameters needed for the new method are theoretically designed. Numerical examples are provided to demonstrate the performance of our method. We observe that the new initialization does not only overcome the dying ReLU but also improves the training performance.
This work received support by the DARPA EQUiPS grant N66001-15-2-4055, the AFOSR grant FA9550-17-1-0013, and the DARPA AIRA grant HR00111990025. The research of the third author was partially supported by the NSF of China 11771083 and the NSF of Fujian 2017J01556, 2016J01013.
Appendix A Proof of Theorem 3
The proof starts with the following lemma. Let be a -layer ReLU neural network with neurons at the -th layer. Suppose all weights are randomly independently generated from probability distributions satisfying for any nonzero vector and any -th row of . Then
where for any . Suppose for all . Then for all . If , we are done as . If it is not the case, there exists in such that for all ,
Thus we have for all . Let consider the following events:
Note that . Also, since for any nonzero vector , we have . Therefore,
Thus we can focus on . If , we are done as . If it is not the case, it follows from the similar procedure that in . By repeating these, we conclude that
Let be a training domain where is any positive real number. We consider a probability space where all random weight matrices and bias vectors are defined on. For every , let be a sub--algebra of generated by . Since for , is a filtration. Let us define the events of our interest where
where the second equality is from Lemma A. Note that is measurable in . Here could be either or random vectors. Since , . To calculate for , let consider another event on which exactly -components of are zero on . For notational completeness, we set for and . Then since , we have
We want to show . Since is a partition of , by the total law of probability, we have
where , and . Since and are independently initialized, we have
where the second and the third inequalities hold from the assumption. Here does not depend on . If are randomly initialized from symmetric distributions around 0,
Let define a transition matrix of size such that the -component is defined to be
Suppose for all . Note that the first row of is for all . Thus we have the following strictly increasing sequence :
Since , it converges, say, . Suppose , i.e., and let . Then since , we have
By taking limit on the both sides, we have
which leads a contradiction. Therefore, . It then follows from Equation 10 that
which completes the proof.
Appendix B Proof of Theorem 3
Based on Lemma A, let consider
Then if , . Thus it suffices to compute as
For , let consider
Note that and . Since for all , we have
Note that . Also, note that since the rows of and are independent,
Since the weight and biases are randomly drawn from symmetry distribution around 0 and , we obtain