1 Introduction
Neural networks have been successfully used in various fields of applications. These include image classification in computer vision
(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), natural language translation (Wu et al., 2016), and superhuman performance in the game of Go (Silver et al., 2016). Modern neural networks are often severely overparameterized or overspecified. Overparameterization means that the number of parameters is much larger than the number of training data. Overspecification means that the number of neurons in a network is much larger than needed. It has been reported that the wider the neural networks, the easier it is to train (Livni et al., 2014; Safran and Shamir, 2016; Nguyen and Hein, 2017).In general, neural networks are trained by first or secondorder gradientbased optimization methods from random initialization. Almost all gradientbased optimization methods are stemmed from backpropagation
(Rumelhart et al., 1985)and the stochastic gradient descent (SGD) method
(Robbins and Monro, 1951). Many variants of vanilla SGD have been proposed. For example, AdaGrad (Duchi et al., 2011), RMSProp
(Hinton, 2014), Adam (Kingma and Ba, 2015), AMSGrad (Reddi et al., 2019), and LBFGS (Byrd et al., 1995), to name just a few. Different optimization methods have different convergence properties. It is still far from clear how different optimization methods affect the performance of trained neural networks. Nonetheless, how to start the optimization processes plays a crucial role for the success of training. Properly chosen weight initialization could drastically improve the training performance and allow the training of deep neural networks, for example, see (LeCun et al., 1998; Glorot and Bengio, 2010; Saxe et al., 2014; He et al., 2015; Mishkin and Matas, 2016), and for more recent work see (Lu et al., 2019). Among them, when it comes to the rectified linear unit (ReLU) neural networks, the ‘He initialization’
(He et al., 2015) is one of the most commonly used initialization methods, due to its success in a visual recognition challenge.There are several theoretical works showing that under various assumptions, overparameterized neural networks can perfectly interpolate the training data. For the shallow neural network setting, see
(Oymak and Soltanolkotabi, 2019; Soltanolkotabi et al., 2019; Du et al., 2018b; Li and Liang, 2018). For the deep neural network setting, see (Du et al., 2018a; Zou et al., 2018; AllenZhu et al., 2018). Hence, overparameterization can be viewed as a sufficient condition for minimizing the training loss. Despite of the current theoretical progress, there still exists a huge gap between existing theories and empirical observations in terms of the level of overparameterization. To illustrate this gap, let us consider the problem of approximating with a shallow ReLU network using 10 training data points. The training set is randomly uniformly drawn from . To interpolate all 10 data points, the best existing theoretical condition requires the width of (Oymak and Soltanolkotabi, 2019). In this case, the width of 100 would be needed. Figure 1shows the convergence of the root mean square errors (RMSE) on the training data with respect to the number of epochs for five independent simulations. On the left, the results of width 10 are shown. We observe that all five training losses converge to zero as the number of epochs increases. It would be an ongoing challenge to bridge the gap of the degree of overparameterization.
On the other hand, we know that can be exactly represented by only two ReLU neurons as . Thus, we show the results of width 2 on the right of Figure 1. In contrast to the theoretical guarantee, we observe that only one out of five simulations can achieve zero training error. It turns out that there is a probability greater than 0.43 that the network of width 2 fails to be trained successfully (Theorem 4.1); see also (Lu et al., 2019).
In this paper, we study the trainability of ReLU networks, a necessary condition for the successful training. Suppose a learning task requires a ReLU network to have at least a certain number of active neurons, say, . Given the learning task, the training is said to be successful if the trained network has at least active neurons and produces a small training loss. Also, a network is said to be trainable if the number of active neurons is greater than or equal to . We first show that in order to achieve a successful training, an initialized network should have at least active neurons. This implies that a network being trainable is a necessary condition for the successful training. If an initialized ReLU network is not trainable, regardless of which gradientbased optimization method is selected, the training will not be successful. Due to random initialization of weights and biases, however, it is unlikely that all neurons are active at the beginning of the training. We thus study the probability distribution of the number of active neurons at the initialization. Then, we introduce the notion of trainability of ReLU networks. We refer to the probability of a network being trainable as trainabilty. The trainability serves as an upper bound of the probability of the successful training. The trainability can be calculated from the probability distribution of the number of active neurons. Furthermore, by showing that overparameterization can be understood under the frame of overspecification, we prove that overparameterization is both a necessary and a sufficient condition for minimizing the training loss, i.e., interpolating all training data.
In practice, it is important to maintain a high trainability for successful training. To secure a high trainability, overspecification is inevitable. From this perspective, the zerobias initialization should be preferred over the random bias initialization. However, the zerobias locates all neurons at the origin. This could make the training slower and often leads to a spurious local minimum, especially for a learning task which requires neurons to be evenly distributed. For example, see Figure 3 in Section 5. On the other hand, if the random bias initialization is used, all neurons are randomly distributed over the entire domain and some neurons would never be activated. Thus, in order to maintain a high trainability, a network needs to be severely overparameterized. If not, the trained network would lose its accuracy at some parts where neurons are dead. In order to overcome these difficulties, we propose a new datadependent initialization method. The new method is for the overparameterized setting, where the size of width is greater than or equal to the number of training data. By adapting the trainability perspective, the proposed method is designed to avoid both the clustered neuron problem and the dying ReLU neuron problem at the same time. We remark that the idea of datadependent initialization is not new. See (Ioffe and Szegedy, 2015; Krähenbühl et al., 2015; Salimans and Kingma, 2016). However, our method is specialized to the overparameterized setting.
The rest of this paper is organized as follows. Upon presenting the mathematical setup in Section 2, we discuss the probability distribution of the number of active neurons in Section 3. In Section 4, we present the trainability of ReLU networks. A new datadependent initialization is introduced in Section 5. Numerical examples are provided in Section 6, before the conclusion in Section 7.
2 Mathematical Setup
Let
be a feedforward neural network with
layers and neurons in the th layer (, ). For, the weight matrix and the bias vector in the
th layer are denoted by and , respectively; is called the width of the th layer. We also denote the input by and the output at the th layer by. Given an activation function
which is applied elementwise, the feedforward neural network is defined by(1) 
and . Note that is called a hidden layer neural network or a layer neural network. Also, , , is called a neuron or a unit in the th hidden layer. We use to describe a network architecture.
Let be a collection of all weight matrices and bias vectors, i.e., where . To emphasize the dependency on , we often denote the neural network by . In this paper, the rectified linear unit (ReLU) is employed as an activation function, i.e,
In many machine learning applications, the goal is to train a neural network using a set of training data
. Each datum is a pair of an input and an output, . Here is the input space and is the output space. Thus, we write . In order to measure the discrepancy between a prediction and an output, we introduce a loss metricto define a loss function
:(2) 
For example, the squared loss , logistic , hinge, or crossentropy are commonly employed. We then seek to find which minimizes the loss function . In general, a gradientbased optimization method is employed for the training. In its very basic form, given an initial value of , the parameters are updated according to
where is the learning rate.
2.1 Weights and Biases Initialization and Data Normalization
Gradientbased optimization is a popular choice for training a neural network and requires weights and biases to be initialized in the first place. How to initialize the network plays a crucial role in the success of the training. Typically, the weight matrices are randomly initialized from probability distributions.
In this paper, we consider the following weights and biases initialization schemes. One is the normal initialization. That is, all weights and/or biases in the
th layer are independently initialized from zeromean normal distributions.
(3) 
where
is the identity matrix of size
. When and , it is known as the ‘He initialization’ (He et al., 2015). The ‘He initialization’ is one of the most popular initialization methods for ReLU networks. The other is the uniform distribution on the unit hypersphere initialization. That is, each row of either
or is independently initialized from its corresponding the unit hypersphere uniform distribution.(4) 
Throughout this paper, we assume that the training input domain is
for some
. In practice, the training data is often normalized to have mean zero and variance 1. Given a training data set
, the normalization makes for all . This corresponds to set . The output of the 1st layer can be written as , where and . Many theoretical works (AllenZhu et al., 2018; Du et al., 2018a; Li and Liang, 2018; Soltanolkotabi et al., 2019; Zou et al., 2018) assume for all and set the last entry of z to be a positive constant . This corresponds to set .2.2 Dying ReLU and Born Dead Probability
The dying ReLU refers to the problem when ReLU neurons become inactive and only output a constant for any input. We say that a ReLU neuron in the th hidden layer is dead on if it is a constant function on . That is, there exists a constant such that
Also, a ReLU neuron is said to be born dead (BD) if it is dead at the initialization. In contrast, a ReLU neuron is said to be active in if it is not a constant function on . The notion of born death was introduced in (Lu et al., 2019), where a ReLU network is said to be BD if there exists a layer where all neurons are BD. We refer to the probability that a ReLU neuron is BD as the born dead probability (BDP) of a ReLU neuron.
In the 1st hidden layer, once a ReLU neuron is dead, it cannot be revived during the training. The dead neurons in a shallow ReLU network cannot be revived through gradientbased training. The proof can be found in Appendix A. In the th hidden layer where , a dead neuron should be either a strictly positive number or zero. If the former is the case, it readily follows from Lemma 10 of Lu et al. (2019) that with probability 1, there exists a hidden layer such that all neurons are dead. For the reader’s convenience, we present a variant of Lemma 10 of Lu et al. (2019) in the below. Given a deep ReLU network, suppose the weight matrices and/or the bias vectors are initialized from probability distributions which satisfy that for any fixed nonzero vector ,
If either

there exists a hidden layer such that all neurons are dead, or

there exists a dead neuron in the th hidden layer whose value is positive,
with probability 1, all dead neurons cannot be revived through gradientbased training.
3 Probability Distribution of the Number of Active ReLU Neurons
Understanding how many neurons will be active at the initialization is not only directly related to the trainability of a ReLU network, but also suggests how much overspecification or overparameterization would be needed. In this section, we investigate the probability distribution of the number of active ReLU neurons for an initialized ReLU network.
Given a ReLU network having architecture, let be a filtered probability space where is the algebra generated by . Let be the number of active neurons at the th hidden layer. Then, the distribution of is defined as follows. The probability distribution of the number of active neurons at the th hidden layer is
(5) 
where ,
is the stochastic matrix of size
whose entry is .For each and , let be the event where exactly neurons are active at th layer. Also, let be the BDP of a single ReLU neuron at th hidden layer given . Then, the stochastic matrix is expressed as
(6) 
where is the expectation with respect to . Thus, it can be seen that is a fundamental quantity for the complete understanding of .
As a first step towards understanding , we calculate the exact probability distribution of the number of active neurons in the 1st hidden layer. Given a ReLU network having architecture, suppose the training input domain is . If either the ‘normal’ (3) or the ‘unit hypersphere’ (4) initialization without bias is used in the 1st hidden layer, we have
If either the ‘normal’ (3) or the ‘unit hypersphere’ (4) with bias is used in the 1st hidden layer,
follows a binomial distribution with parameters
and , where(7) 
and is the Gamma function. The proof can be found in Appendix B.
We now calculate for a ReLU network at . Since the bias in each layer can be initialized in different ways, we consider some combinations of them. Given a ReLU network having architecture, suppose the training input domain is .

[leftmargin=*]

Suppose the ‘unit hypersphere’ (4) initialization with bias is used in the 1st hidden layer.

Suppose the ‘unit hypersphere’ (4) initialization without bias is used in the 1st hidden layer.
Then where is defined in Theorem 3. The proof can be found in Appendix D.
Further characterization of will be deferred to a future study. Theorems 3 and 3 indicate that the bias initialization could drastically change the active neuron distributions . Since , the behaviors of and affect the higher layer’s distributions . In Figure 2, we consider a ReLU network with architecture and plot the empirical distributions , , from independent simulations at . On the left and the middle, the ‘unit hypersphere’ (4) initialization without and with bias are employed, respectively, in all layers. On the right, the ‘unit hypersphere’ initialization without bias is employed in the 1st hidden layer, and the ‘normal’ (3) initialization with bias is employed in all other layers. The theoretically derived distributions, , are also plotted as references. We see that all empirical results are well matched with our theoretical derivations. When the 1st hidden layer is initialized with bias, with probability 0.8, at least one neuron in the 1st hidden layer will be dead. On the other hand, if the 1st hidden layer is initialized without bias, with probability 1, no neuron will be dead. It is clear that the distributions obtained by three initialization schemes show different behavior.
4 Trainability and Overspecification of ReLU Networks
We are now in a position to introduce tranability of ReLU neural networks.
4.1 Trainability of Shallow ReLU Networks
For pedagogical reasons, we first confine ourselves to shallow ReLU networks. Let be a class of shallow ReLU neural networks of width on ;
(8) 
where , , and . We note that for . However, a function could allow different representations in other function classes for in a compact domain . We say is the minimal function class for if such that on and is the smallest integer. For example, on can be expressed as either , or . However, it cannot be represented by . Thus, is the minimal function class for . We remark that and are not the same function in , however, they are the same on . We extend the notion of the minimal function class and define the approximated minimal function class as follow.
Given a continuous function and , a function class is said to be the approximated minimal function class for if is the smallest number such that and in . If , we say is the minimal function class for .
Note that the existence of such is guaranteed by the universal function approximation theorem for shallow neural networks (Hornik, 1991; Cybenko, 1989). Any ReLU network of width greater than is then said to be overspecified for approximating .
A network is said to be overparameterized if the number of parameters is larger than the number of training data. In this paper, we consider the overparameterization, where the size of width is greater than or equal to the number of training data. Then, overparameterization can be understood under the frame of overspecification by the following theorem. For any nondegenerate training data, there exists a shallow ReLU network of width which interpolates all the training data. Furthermore, there exists nondegenerate training data such that any shallow ReLU network of width less than cannot interpolate all the training data. In this sense, is the minimal width. The proof can be found in Appendix E.
Theorem 4.1 shows that any network of width greater or equal to is overspecified for interpolating training data. Thus, we could regard overparameterization as a kind of overspecification.
We are now in a position to define the trainability of a neural network. For a learning task which requires at least active neurons, a shallow ReLU network of width is said to be trainable if the number of active neurons is greater than or equal to . We refer to the probability that a network is trainable at the initialization as trainability.
Note that approximating a function whose minimal function class is and interpolating any nondegenerate training data can be regarded as learning tasks which requires at least active neurons.
It was shown in Theorem 2.2 that once a ReLU neuron of the 1st hidden layer is dead, it will never be revived during the training. Thus, given a learning task which requires at least active neurons, in order for the successful training, an initialized network should have at least active neurons in the first place. If the number of active neurons is less than , there is no hope to train the network successfully. Therefore, a network being trainable is a necessary condition for the successful training. We remark that this condition is independent of the choice of a loss metric in (2).
Given a learning task which requires a shallow ReLU network having at least active neurons, suppose the training input domain is and a shallow ReLU network of width is employed with .

[leftmargin=*]
The probability of a network being trainable is , where is the number of active neurons in the 1st hidden layer at the initialization.
If the ‘He initialization’ without bias is used in the 1st hidden layer, by Theorem 3, we have . Thus, .
If the ‘He initialization’ with bias is used in the 1st hidden layer, by Theorem 3, we have . Thus, the proof is completed.
Theorem 4.1 explains why overspecification is a necessary condition in training a shallow ReLU network if the bias is randomly initialized. It also suggests a degree of overspecification whenever one has a specific width in mind to use for a learning task. That is, if it is known (either theoretically or empirically) that a shallow network of width can achieve a good performance, one should use a network of width to guarantee (on average) that neurons are active at the initialization. For example, when , , , it is suggested to work on a network of width in the first place. Also, the example (Figure 1) given in Section 1 can be understood in this manner. By Theorem 4.1, with probability at least 0.43, the network of width 2 fails to be trained successfully for a learning task which requires at least 2 active neurons.
In the shallow ReLU network, suppose either the ‘normal’ (3) or the ‘unit hypersphere’ (4) initialization with bias is employed in the first hidden layer. Also, the training input domain is . For any nondegenerate training data, which requires a network to have at least active neurons for the interpolation, suppose and the input dimension satisfy
(9) 
where . Then, overparameterization is both a necessary and a sufficient condition for interpolating all the training data with probability at least over the random initialization by the (stochastic) gradientdescent method. The proof can be found in Appendix F.
We remark that Theorem 4.1 assumes that the biases are randomly initialized. If the biases are initialized to zero, overparameterization or overspecification is not needed from this perspective. Also, we note that, to the best of our knowledge, all existing theoretical results assume that the biases are randomly initialized, e.g. Du et al. (2018b); Oymak and Soltanolkotabi (2019); Li and Liang (2018).
4.2 Trainability of Deep ReLU Networks
We now extend the notion of trainability to deep ReLU networks. Unlike dead ReLU neurons in the 1st hidden layer, a dead neuron in the th hidden layer for could be revived during the training if two conditions are satisfied. One is that for all layers, there exists at least one active neuron. This condition is directly from Theorem 2.2. The other is that the dead neuron in the th hidden layer should be in the condition of tentative death. We remark that these two conditions are necessary conditions for the revival of a dead neuron. We provide a precise meaning of the tentative death as follows.
Let us consider a neuron in the th hidden layer;
Suppose the neuron is dead. For any changes in , but not in and , if the neuron is still dead, we say a neuron is dead permanently. For example, if , since , regardless of , the neuron will never be activated. Hence, in this case, there is no hope that the neuron can be revived during the training. Otherwise, we say a neuron is dead tentatively.
Given the event that exactly neurons are active in the th hidden layer, let and be the conditional probabilities that a neuron in the th hidden layer is born dead permanently and born dead tentatively, respectively. Then,
Let and be the number of tentatively dead and permanently dead neurons at the th hidden layer. It then can be checked that
where is the expectation with respect to and is a multinomial coefficient. Also note that and
With the notion of permanent death of a neuron, the above derivation gives the following trainability theorem for deep ReLU networks.
Given a learning task which requires a hidden layer ReLU network having at least active neurons at the th layer, suppose we employ a hidden layer ReLU network having neurons at the th layer such that for all . Then, with probability
(10) 
the network is trainable at the initialization. Note that there is no tentatively dead neuron in the first hidden layer. Thus, and imply . In onedimensional case where the biases are initialized to zero, an upper bound of the trainability can be derived. Suppose that all weights are independently initialized from continuous symmetric probability distributions around 0, all biases are zero, and . Then, the trainability of a hidden layer ReLU network having neurons at each layer is bounded above by
where and . The proof can be found in Appendix G.
In principle, a single active neuron in the highest layer could potentially revive tentatively dead neurons through backpropagation (gradient). However, in practice, it would be better to have at least active neurons in the th hidden layer at the initialization for both faster training and robustness. If we employ a hidden layer ReLU network having neurons in the th hidden layer, with probability
(11) 
an initialized ReLU network has at least active neurons in the th hidden layer for . Therefore, it is suggested to use a ReLU network with sufficiently large width at each layer to secure a high probability of (11).
Remark: A trainable network is not guaranteed to be trained successfully. However, if a network is not trainable, we have no hope for the successful training. Thus, a network being trainable is a necessary condition for the successful training. Also, the probability that a network is trainable, i.e., the trainability, serves as an upper bound of the probability of the successful training.
5 Datadependent Bias Initialization
In the previous section, we discuss the trainability of ReLU networks. In terms of trainability, Theorem 4.1 indicates that the zerobias initialization would be preferred over the random bias initialization. In practice, however, the zerobias initialization often finds a spurious local minimum or gets stuck on a flat plateau. To illustrate this difficulty, we consider a problem of approximating a sum of two sine functions on . For this task, we use a shallow ReLU network of width 500 with the ‘He initialization’ without bias. In order to reduce extra randomness in the experiment, 100 equidistant points on are used as the training data set. One of the most popular gradientbased optimization methods, Adam (Kingma and Ba, 2015), is employed with its default parameters and we use the fullbatch size. The trained network is plotted in Figure 3. It is clear that the trained network is stuck on a local minimum. A similar behavior is repeatedly observed in all of our multiple independent simulations.
This phenomenon could be understood as follows. Since the biases are zero, all initialized neurons are clustered at the origin. Consequently, it would take long time for gradientupdate to distribute neurons over the training domain to achieve a small training loss. In the worst case, along the way of distributing neurons, it will find a spurious local minimum. We refer to this problem as the clustered neuron problem. Indeed, this is observed in Figure 3. The trained network well approximates the target function on a small domain containing the origin, however, it loses its accuracy on the domain far from the origin.
On the other hand, if we randomly initialize the bias, as shown in Theorem 4.1, overspecification is inevitable to guarantee, with high probability, a certain number of active neurons. In this setting, at the initialization, only 375 neurons will be active among 500 neurons on average. In Figure 3, we also show the trained result by the ‘He initialization’ with bias. Due to the nonzero bias, neurons are randomly distributed. Accordingly, the trained network quite well approximates the target function. However, due to dead neurons, it loses its accuracy at some parts of the domain, e.g. in the intervals containing .
In order to overcome such difficulties, we propose a new datadependent initialization method in the overparameterized setting, where the size of width is greater than or equal to the number of training data. By adapting the trainability perspective, the method is designed to alleviate both the clustered neuron problem and the dying ReLU neuron problem at the same time.
5.1 Datadependent Bias Initialization
Let be the number of training data and be the width of a shallow ReLU network. Suppose the network is overparameterized so that for some positive number . We then propose to initialize the biases as follows;
where ’s are iid and . We note that this mimics the explicit construction for the data interpolation in Theorem 4.1. By doing so, the th neuron is initialized to be located near as
The precise value of is determined as follows. Let be the expectation of the normalized squared norm of the network, i.e., where the expectation is taken over weights and biases and is a shallow ReLU network having architecture. Given a set of training input data , we define the average of on as
We then choose our parameters to match by our datadependent initialization to the one by the standard initialization method. For example, when the ‘normal’ (3) initialization without bias is used, we have
where for , for , and is the Frobenius norm. When the ‘He initialization’ without bias is used, i.e., and , we have .
Suppose for some positive number . Let be the set of training input data. If for , for , , and is initialized by the proposed method, then
where and . The proof can be found in Appendix H.
For example, if we set and to be
(12) 
by the datadependent initialization is equal to the one by the ‘He initialization’ without bias.
The proposed initialization makes sure that all neurons are equally distributed over the training data points. Also, it would make sure that at least one neuron will be activated at a training datum. By doing so, it would effectively avoid both the clustered neuron problem and the dying ReLU neuron problem.
In Figure 4, we demonstrate the performance of the proposed method in approximating the sum of two sine functions. On the left, the trained neural network is plotted and on the right, the root mean square errors (RMSE) of the training loss are shown with respect to the number of epochs by three different initialization methods. We remark that since the training set is deterministic and the fullbatch is used, the only randomness in the training process is from the weights and biases initialization. It can be seen that the proposed method not only results in the fastest convergence but also achieves the smallest approximation error among others. The number of dead neurons in the trained network is 127 (He with bias), 3 (He without bias), and 17 (Datadependent).
6 Numerical Examples
We present numerical examples to demonstrate our theoretical findings and effectiveness of our new datadependent initialization method.
6.1 Trainability of Shallow ReLU Networks
We present two examples to demonstrate the trainability of a ReLU neural network and justify our theoretical results. Here all the weights and biases are initialized according to the ‘He initialization’ (3) with bias. We consider two univariate test target functions:
(13) 
We note that is the minimal function class for and is the minimal function class for . That is, theoretically, and should be exactly recovered by a shallow ReLU network of width and , respectively. For the training, we use a training set of 600 data points uniformly generated from and a test set of 1,000 data points uniformly generated from . The standard stochastic gradient descent with minibatch of size 128 with a constant learning rate of is employed and we set the maximum number of epochs to . Also, the standard loss is used.
In Figure 5, we show the approximation results for approximating . On the left, we plot the empirical probability of the successful training with respect to the value of width. The empirical probabilities are obtained from 1,000 independent simulations and a single simulation is regarded as a success if the test error is less than . We also plot the trainability from Theorem 4.1. As expected, it provides an upper bound for the probability of the successful training. It is clear that the more the network is overspecified, the higher trainability is obtained. Also, it can be seen that as the size of width grows, the empirical training success rate increases. This suggests that a successful training could be achieved (with high probability) by having a very high trainability. However, since it is a necessary condition, although an initialized network is in for , i.e., trainable, the final trained result could be in either or as shown in the middle and right of Figure 5, respectively.
Similar behavior is observed for approximating . In Figure 6, we show the approximation results for . On the left, the empirical probability of the successful training and the trainability (Theorem 4.1) are plotted with respect to the size of width. Again, the trainability provides an upper bound for the probability of the successful training. Also, it can be seen that the empirical training success rate increases, as the size of width grows. On the middle and right, we plot two of local minima which a trainable network could end up with. We remark that the choice of gradientbased optimization methods, welltuned learning rate, and/or other tunable optimization parameters could affect the empirical training success probability. However, the maximum probability one can hope for is bounded by the trainability. In all of our simulations, we did not tune any optimization parameters.
6.2 Datadependent Bias Initialization
Next, we compare the training performance of three initialization methods. The first one is the ‘He initialization’ (He et al., 2015) without bias. This corresponds to . The second one is the ‘He initialization’ with bias (3). This corresponds to . Here is the width of the 1st hidden layer. The last one is our datadependent initialization described in the previous section. We use the parameters from (12). All results are generated under the same conditions except for the weights and biases initialization.
We consider the following test functions on :
(14) 
In all tests, we employ a shallow ReLU network of width 100 and it is trained over 25 randomly uniformly drawn points from . The standard
loss and the gradientdescent method with moment are employed. The learning rate is set to be a constant of
and the momentum term is set to 0.9.Figure 7
shows the mean of the RMSE on the training data from 10 independent simulations with respect to the number of epochs by three different initialization methods. The shaded area covers plus or minus one standard deviation from the mean. On the left and right, the results for approximating
and are shown, respectively. We see that the datadependent initialization not only results in the faster loss convergence but also achieves the smallest training loss. Also, the average number of dead neurons in the trained network is 11 (He with bias), 0 (He without bias), and 0 (Datadependent) for , and 12 (He with bias), 0 (He without bias), and 0 (Datadependent) for .7 Conclusion
In this paper, we establish the trainability of ReLU neural networks, a necessary condition for the successful training. Given a learning task which requires a ReLU network with at least active neurons, a network is said to be trainable if it has at least active neurons. We refer to the probability of a network being trainable as trainability. In order to calculate the trainability, we first study the probability distribution of the number of active neurons in the th layer. We completely characterize and by focusing on the onedimensional case, we derive for four different combinations of initialization schemes. With , we derive the trainability of ReLU networks. The trainability serves as an upper bound of the probability of the successful training. Also, by showing that overparameterization can be understood as a kind of overspecification, we prove that overparameterization is both a neccessary and a sufficient condition for interpolating all training data, i.e., minimizing the loss.
Although the zerobias initialization seems to be preferred over the randombias initialization from the trainability perspective, the zerobias initialization locates all neurons in the origin at the initialization. This often deteriorates the training performance, especially, for a task which requires neurons to be located on the regime away from the origin. On the other hand, the randombias initialization suffers from the dying ReLU neuron problem and typically requires a severely overparameterzed or overspecified network. To alleviate such difficulties, we propose a new datadependent initialization method in the overparameterized setting. The proposed method is designed to avoid both the dying ReLU neuron problem and the clustered ReLU neuron problem at the same time. Numerical examples are provided to demonstrate the performance of our method. We found that the datadependent initialization method outperforms both the ‘He initialization’ with and without bias in all of our tests.
This work is supported by the DOE PhILMs project (No.desc0019453), the AFOSR grant FA95501710013, and the DARPA AIRA grant HR00111990025.
Appendix A Proof of Theorem 2.2
Suppose a ReLU neural network of width is initialized to be
Comments
There are no comments yet.