Trainability and Data-dependent Initialization of Over-parameterized ReLU Neural Networks

07/23/2019 ∙ by Yeonjong Shin, et al. ∙ Brown University 3

A neural network is said to be over-specified if its representational power is more than needed, and is said to be over-parameterized if the number of parameters is larger than the number of training data. In both cases, the number of neurons is larger than what it is necessary. In many applications, over-specified or over-parameterized neural networks are successfully employed and shown to be trained effectively. In this paper, we study the trainability of ReLU networks, a necessary condition for the successful training. We show that over-parameterization is both a necessary and a sufficient condition for minimizing the training loss. Specifically, we study the probability distribution of the number of active neurons at the initialization. We say a network is trainable if the number of active neurons is sufficiently large for a learning task. With this notion, we derive an upper bound of the probability of the successful training. Furthermore, we propose a data-dependent initialization method in the over-parameterized setting. Numerical examples are provided to demonstrate the effectiveness of the method and our theoretical findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have been successfully used in various fields of applications. These include image classification in computer vision

(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), natural language translation (Wu et al., 2016), and superhuman performance in the game of Go (Silver et al., 2016). Modern neural networks are often severely over-parameterized or over-specified. Over-parameterization means that the number of parameters is much larger than the number of training data. Over-specification means that the number of neurons in a network is much larger than needed. It has been reported that the wider the neural networks, the easier it is to train (Livni et al., 2014; Safran and Shamir, 2016; Nguyen and Hein, 2017).

In general, neural networks are trained by first- or second-order gradient-based optimization methods from random initialization. Almost all gradient-based optimization methods are stemmed from backpropagation

(Rumelhart et al., 1985)

and the stochastic gradient descent (SGD) method

(Robbins and Monro, 1951). Many variants of vanilla SGD have been proposed. For example, AdaGrad (Duchi et al., 2011)

, RMSProp

(Hinton, 2014), Adam (Kingma and Ba, 2015), AMSGrad (Reddi et al., 2019), and L-BFGS (Byrd et al., 1995), to name just a few. Different optimization methods have different convergence properties. It is still far from clear how different optimization methods affect the performance of trained neural networks. Nonetheless, how to start the optimization processes plays a crucial role for the success of training. Properly chosen weight initialization could drastically improve the training performance and allow the training of deep neural networks, for example, see (LeCun et al., 1998; Glorot and Bengio, 2010; Saxe et al., 2014; He et al., 2015; Mishkin and Matas, 2016), and for more recent work see (Lu et al., 2019)

. Among them, when it comes to the rectified linear unit (ReLU) neural networks, the ‘He initialization’

(He et al., 2015) is one of the most commonly used initialization methods, due to its success in a visual recognition challenge.

There are several theoretical works showing that under various assumptions, over-parameterized neural networks can perfectly interpolate the training data. For the shallow neural network setting, see

(Oymak and Soltanolkotabi, 2019; Soltanolkotabi et al., 2019; Du et al., 2018b; Li and Liang, 2018). For the deep neural network setting, see (Du et al., 2018a; Zou et al., 2018; Allen-Zhu et al., 2018). Hence, over-parameterization can be viewed as a sufficient condition for minimizing the training loss. Despite of the current theoretical progress, there still exists a huge gap between existing theories and empirical observations in terms of the level of over-parameterization. To illustrate this gap, let us consider the problem of approximating with a shallow ReLU network using 10 training data points. The training set is randomly uniformly drawn from . To interpolate all 10 data points, the best existing theoretical condition requires the width of (Oymak and Soltanolkotabi, 2019). In this case, the width of 100 would be needed. Figure 1

shows the convergence of the root mean square errors (RMSE) on the training data with respect to the number of epochs for five independent simulations. On the left, the results of width 10 are shown. We observe that all five training losses converge to zero as the number of epochs increases. It would be an ongoing challenge to bridge the gap of the degree of over-parameterization.

Figure 1: The root mean square errors on the training data of five independent simulations with respect to the number of epochs. The standard -loss is employed. (Left) Width 10 and depth 2. (Right) Width 2 and depth 2.

On the other hand, we know that can be exactly represented by only two ReLU neurons as . Thus, we show the results of width 2 on the right of Figure 1. In contrast to the theoretical guarantee, we observe that only one out of five simulations can achieve zero training error. It turns out that there is a probability greater than 0.43 that the network of width 2 fails to be trained successfully (Theorem 4.1); see also (Lu et al., 2019).

In this paper, we study the trainability of ReLU networks, a necessary condition for the successful training. Suppose a learning task requires a ReLU network to have at least a certain number of active neurons, say, . Given the learning task, the training is said to be successful if the trained network has at least active neurons and produces a small training loss. Also, a network is said to be trainable if the number of active neurons is greater than or equal to . We first show that in order to achieve a successful training, an initialized network should have at least active neurons. This implies that a network being trainable is a necessary condition for the successful training. If an initialized ReLU network is not trainable, regardless of which gradient-based optimization method is selected, the training will not be successful. Due to random initialization of weights and biases, however, it is unlikely that all neurons are active at the beginning of the training. We thus study the probability distribution of the number of active neurons at the initialization. Then, we introduce the notion of trainability of ReLU networks. We refer to the probability of a network being trainable as trainabilty. The trainability serves as an upper bound of the probability of the successful training. The trainability can be calculated from the probability distribution of the number of active neurons. Furthermore, by showing that over-parameterization can be understood under the frame of over-specification, we prove that over-parameterization is both a necessary and a sufficient condition for minimizing the training loss, i.e., interpolating all training data.

In practice, it is important to maintain a high trainability for successful training. To secure a high trainability, over-specification is inevitable. From this perspective, the zero-bias initialization should be preferred over the random bias initialization. However, the zero-bias locates all neurons at the origin. This could make the training slower and often leads to a spurious local minimum, especially for a learning task which requires neurons to be evenly distributed. For example, see Figure 3 in Section 5. On the other hand, if the random bias initialization is used, all neurons are randomly distributed over the entire domain and some neurons would never be activated. Thus, in order to maintain a high trainability, a network needs to be severely over-parameterized. If not, the trained network would lose its accuracy at some parts where neurons are dead. In order to overcome these difficulties, we propose a new data-dependent initialization method. The new method is for the over-parameterized setting, where the size of width is greater than or equal to the number of training data. By adapting the trainability perspective, the proposed method is designed to avoid both the clustered neuron problem and the dying ReLU neuron problem at the same time. We remark that the idea of data-dependent initialization is not new. See (Ioffe and Szegedy, 2015; Krähenbühl et al., 2015; Salimans and Kingma, 2016). However, our method is specialized to the over-parameterized setting.

The rest of this paper is organized as follows. Upon presenting the mathematical setup in Section 2, we discuss the probability distribution of the number of active neurons in Section 3. In Section 4, we present the trainability of ReLU networks. A new data-dependent initialization is introduced in Section 5. Numerical examples are provided in Section 6, before the conclusion in Section 7.

2 Mathematical Setup

Let

be a feed-forward neural network with

layers and neurons in the -th layer (, ). For

, the weight matrix and the bias vector in the

-th layer are denoted by and , respectively; is called the width of the -th layer. We also denote the input by and the output at the -th layer by

. Given an activation function

which is applied element-wise, the feed-forward neural network is defined by

(1)

and . Note that is called a -hidden layer neural network or a -layer neural network. Also, , , is called a neuron or a unit in the -th hidden layer. We use to describe a network architecture.

Let be a collection of all weight matrices and bias vectors, i.e., where . To emphasize the dependency on , we often denote the neural network by . In this paper, the rectified linear unit (ReLU) is employed as an activation function, i.e,

In many machine learning applications, the goal is to train a neural network using a set of training data

. Each datum is a pair of an input and an output, . Here is the input space and is the output space. Thus, we write . In order to measure the discrepancy between a prediction and an output, we introduce a loss metric

to define a loss function

:

(2)

For example, the squared loss , logistic , hinge, or cross-entropy are commonly employed. We then seek to find which minimizes the loss function . In general, a gradient-based optimization method is employed for the training. In its very basic form, given an initial value of , the parameters are updated according to

where is the learning rate.

2.1 Weights and Biases Initialization and Data Normalization

Gradient-based optimization is a popular choice for training a neural network and requires weights and biases to be initialized in the first place. How to initialize the network plays a crucial role in the success of the training. Typically, the weight matrices are randomly initialized from probability distributions.

In this paper, we consider the following weights and biases initialization schemes. One is the normal initialization. That is, all weights and/or biases in the

-th layer are independently initialized from zero-mean normal distributions.

(3)

where

is the identity matrix of size

. When and , it is known as the ‘He initialization’ (He et al., 2015)

. The ‘He initialization’ is one of the most popular initialization methods for ReLU networks. The other is the uniform distribution on the unit hypersphere initialization. That is, each row of either

or is independently initialized from its corresponding the unit hypersphere uniform distribution.

(4)

Throughout this paper, we assume that the training input domain is

for some

. In practice, the training data is often normalized to have mean zero and variance 1. Given a training data set

, the normalization makes for all . This corresponds to set . The output of the 1st layer can be written as , where and . Many theoretical works (Allen-Zhu et al., 2018; Du et al., 2018a; Li and Liang, 2018; Soltanolkotabi et al., 2019; Zou et al., 2018) assume for all and set the last entry of z to be a positive constant . This corresponds to set .

2.2 Dying ReLU and Born Dead Probability

The dying ReLU refers to the problem when ReLU neurons become inactive and only output a constant for any input. We say that a ReLU neuron in the -th hidden layer is dead on if it is a constant function on . That is, there exists a constant such that

Also, a ReLU neuron is said to be born dead (BD) if it is dead at the initialization. In contrast, a ReLU neuron is said to be active in if it is not a constant function on . The notion of born death was introduced in (Lu et al., 2019), where a ReLU network is said to be BD if there exists a layer where all neurons are BD. We refer to the probability that a ReLU neuron is BD as the born dead probability (BDP) of a ReLU neuron.

In the 1st hidden layer, once a ReLU neuron is dead, it cannot be revived during the training. The dead neurons in a shallow ReLU network cannot be revived through gradient-based training. The proof can be found in Appendix A. In the -th hidden layer where , a dead neuron should be either a strictly positive number or zero. If the former is the case, it readily follows from Lemma 10 of Lu et al. (2019) that with probability 1, there exists a hidden layer such that all neurons are dead. For the reader’s convenience, we present a variant of Lemma 10 of Lu et al. (2019) in the below. Given a deep ReLU network, suppose the weight matrices and/or the bias vectors are initialized from probability distributions which satisfy that for any fixed nonzero vector ,

If either

  • there exists a hidden layer such that all neurons are dead, or

  • there exists a dead neuron in the -th hidden layer whose value is positive,

with probability 1, all dead neurons cannot be revived through gradient-based training.

3 Probability Distribution of the Number of Active ReLU Neurons

Understanding how many neurons will be active at the initialization is not only directly related to the trainability of a ReLU network, but also suggests how much over-specification or over-parameterization would be needed. In this section, we investigate the probability distribution of the number of active ReLU neurons for an initialized ReLU network.

Given a ReLU network having architecture, let be a filtered probability space where is the -algebra generated by . Let be the number of active neurons at the -th hidden layer. Then, the distribution of is defined as follows. The probability distribution of the number of active neurons at the -th hidden layer is

(5)

where ,

is the stochastic matrix of size

whose -entry is .

For each and , let be the event where exactly neurons are active at -th layer. Also, let be the BDP of a single ReLU neuron at -th hidden layer given . Then, the stochastic matrix is expressed as

(6)

where is the expectation with respect to . Thus, it can be seen that is a fundamental quantity for the complete understanding of .

As a first step towards understanding , we calculate the exact probability distribution of the number of active neurons in the 1st hidden layer. Given a ReLU network having architecture, suppose the training input domain is . If either the ‘normal’ (3) or the ‘unit hypersphere’ (4) initialization without bias is used in the 1st hidden layer, we have

If either the ‘normal’ (3) or the ‘unit hypersphere’ (4) with bias is used in the 1st hidden layer,

follows a binomial distribution with parameters

and , where

(7)

and is the Gamma function. The proof can be found in Appendix B.

We now calculate for a ReLU network at . Since the bias in each layer can be initialized in different ways, we consider some combinations of them. Given a ReLU network having architecture, suppose the training input domain is .

  • [leftmargin=*]

  • Suppose the ‘unit hypersphere’ (4) initialization with bias is used in the 1st hidden layer.

    1. If the ‘normal’ (3) initialization without bias is used in the 2nd hidden layer and , the stochastic matrix is and

    2. If the ‘normal’ (3) initialization with bias is used in the 2nd hidden layers and , the stochastic matrix is and

      where , , , and

  • Suppose the ‘unit hypersphere’ (4) initialization without bias is used in the 1st hidden layer.

    1. If the ‘normal’ (3) initialization without bias is used in the 2nd hidden layer, the stochastic matrix is for and

    2. If the ‘normal’ (3) initialization with bias is used in the 2nd hidden layers, the stochastic matrix is for and

      where , , , and

Then where is defined in Theorem 3. The proof can be found in Appendix D.

Further characterization of will be deferred to a future study. Theorems 3 and 3 indicate that the bias initialization could drastically change the active neuron distributions . Since , the behaviors of and affect the higher layer’s distributions . In Figure 2, we consider a ReLU network with architecture and plot the empirical distributions , , from independent simulations at . On the left and the middle, the ‘unit hypersphere’ (4) initialization without and with bias are employed, respectively, in all layers. On the right, the ‘unit hypersphere’ initialization without bias is employed in the 1st hidden layer, and the ‘normal’ (3) initialization with bias is employed in all other layers. The theoretically derived distributions, , are also plotted as references. We see that all empirical results are well matched with our theoretical derivations. When the 1st hidden layer is initialized with bias, with probability 0.8, at least one neuron in the 1st hidden layer will be dead. On the other hand, if the 1st hidden layer is initialized without bias, with probability 1, no neuron will be dead. It is clear that the distributions obtained by three initialization schemes show different behavior.

Figure 2: The probability distributions of the number of active neurons at different layers are shown for a ReLU network having architecture. (Left) All layers are initialized by the ‘unit hypersphere’ with bias. (Middle) All layers are initialized by the ‘unit hypersphere’ without bias. (Right) The first hidden layer is initialized by the ‘unit hypersphere’ without bias. All other layers are initialized by the ‘normal’ with bias.

4 Trainability and Over-specification of ReLU Networks

We are now in a position to introduce tranability of ReLU neural networks.

4.1 Trainability of Shallow ReLU Networks

For pedagogical reasons, we first confine ourselves to shallow ReLU networks. Let be a class of shallow ReLU neural networks of width on ;

(8)

where , , and . We note that for . However, a function could allow different representations in other function classes for in a compact domain . We say is the minimal function class for if such that on and is the smallest integer. For example, on can be expressed as either , or . However, it cannot be represented by . Thus, is the minimal function class for . We remark that and are not the same function in , however, they are the same on . We extend the notion of the minimal function class and define the approximated minimal function class as follow.

Given a continuous function and , a function class is said to be the -approximated minimal function class for if is the smallest number such that and in . If , we say is the minimal function class for .

Note that the existence of such is guaranteed by the universal function approximation theorem for shallow neural networks (Hornik, 1991; Cybenko, 1989). Any ReLU network of width greater than is then said to be over-specified for approximating .

A network is said to be over-parameterized if the number of parameters is larger than the number of training data. In this paper, we consider the over-parameterization, where the size of width is greater than or equal to the number of training data. Then, over-parameterization can be understood under the frame of over-specification by the following theorem. For any non-degenerate training data, there exists a shallow ReLU network of width which interpolates all the training data. Furthermore, there exists non-degenerate training data such that any shallow ReLU network of width less than cannot interpolate all the training data. In this sense, is the minimal width. The proof can be found in Appendix E.

Theorem 4.1 shows that any network of width greater or equal to is over-specified for interpolating training data. Thus, we could regard over-parameterization as a kind of over-specification.

We are now in a position to define the trainability of a neural network. For a learning task which requires at least active neurons, a shallow ReLU network of width is said to be trainable if the number of active neurons is greater than or equal to . We refer to the probability that a network is trainable at the initialization as trainability.

Note that approximating a function whose minimal function class is and interpolating any non-degenerate training data can be regarded as learning tasks which requires at least active neurons.

It was shown in Theorem 2.2 that once a ReLU neuron of the 1st hidden layer is dead, it will never be revived during the training. Thus, given a learning task which requires at least active neurons, in order for the successful training, an initialized network should have at least active neurons in the first place. If the number of active neurons is less than , there is no hope to train the network successfully. Therefore, a network being trainable is a necessary condition for the successful training. We remark that this condition is independent of the choice of a loss metric in (2).

Given a learning task which requires a shallow ReLU network having at least active neurons, suppose the training input domain is and a shallow ReLU network of width is employed with .

  • [leftmargin=*]

  • If either the ‘normal’ (3) or the ‘unit hypersphere’ (4) initialization without bias is used in the 1st hidden layer, with probability 1, the network is trainable.

  • If either the ‘normal’ (3) or the ‘unit hypersphere’ (4) initialization with bias is used in the 1st hidden layer, with probability,

    where is defined in Theorem 3, the network is trainable. Furthermore, on average,

    neurons will be active at the initialization.

The probability of a network being trainable is , where is the number of active neurons in the 1st hidden layer at the initialization.

If the ‘He initialization’ without bias is used in the 1st hidden layer, by Theorem 3, we have . Thus, .

If the ‘He initialization’ with bias is used in the 1st hidden layer, by Theorem 3, we have . Thus, the proof is completed.

Theorem 4.1 explains why over-specification is a necessary condition in training a shallow ReLU network if the bias is randomly initialized. It also suggests a degree of over-specification whenever one has a specific width in mind to use for a learning task. That is, if it is known (either theoretically or empirically) that a shallow network of width can achieve a good performance, one should use a network of width to guarantee (on average) that neurons are active at the initialization. For example, when , , , it is suggested to work on a network of width in the first place. Also, the example (Figure 1) given in Section 1 can be understood in this manner. By Theorem 4.1, with probability at least 0.43, the network of width 2 fails to be trained successfully for a learning task which requires at least 2 active neurons.

In the shallow ReLU network, suppose either the ‘normal’ (3) or the ‘unit hypersphere’ (4) initialization with bias is employed in the first hidden layer. Also, the training input domain is . For any non-degenerate training data, which requires a network to have at least active neurons for the interpolation, suppose and the input dimension satisfy

(9)

where . Then, over-parameterization is both a necessary and a sufficient condition for interpolating all the training data with probability at least over the random initialization by the (stochastic) gradient-descent method. The proof can be found in Appendix F.

We remark that Theorem 4.1 assumes that the biases are randomly initialized. If the biases are initialized to zero, over-parameterization or over-specification is not needed from this perspective. Also, we note that, to the best of our knowledge, all existing theoretical results assume that the biases are randomly initialized, e.g. Du et al. (2018b); Oymak and Soltanolkotabi (2019); Li and Liang (2018).

4.2 Trainability of Deep ReLU Networks

We now extend the notion of trainability to deep ReLU networks. Unlike dead ReLU neurons in the 1st hidden layer, a dead neuron in the -th hidden layer for could be revived during the training if two conditions are satisfied. One is that for all layers, there exists at least one active neuron. This condition is directly from Theorem 2.2. The other is that the dead neuron in the -th hidden layer should be in the condition of tentative death. We remark that these two conditions are necessary conditions for the revival of a dead neuron. We provide a precise meaning of the tentative death as follows.

Let us consider a neuron in the -th hidden layer;

Suppose the neuron is dead. For any changes in , but not in and , if the neuron is still dead, we say a neuron is dead permanently. For example, if , since , regardless of , the neuron will never be activated. Hence, in this case, there is no hope that the neuron can be revived during the training. Otherwise, we say a neuron is dead tentatively.

Given the event that exactly neurons are active in the -th hidden layer, let and be the conditional probabilities that a neuron in the -th hidden layer is born dead permanently and born dead tentatively, respectively. Then,

Let and be the number of tentatively dead and permanently dead neurons at the -th hidden layer. It then can be checked that

where is the expectation with respect to and is a multinomial coefficient. Also note that and

With the notion of permanent death of a neuron, the above derivation gives the following trainability theorem for deep ReLU networks.

Given a learning task which requires a -hidden layer ReLU network having at least active neurons at the -th layer, suppose we employ a -hidden layer ReLU network having neurons at the -th layer such that for all . Then, with probability

(10)

the network is trainable at the initialization. Note that there is no tentatively dead neuron in the first hidden layer. Thus, and imply . In one-dimensional case where the biases are initialized to zero, an upper bound of the trainability can be derived. Suppose that all weights are independently initialized from continuous symmetric probability distributions around 0, all biases are zero, and . Then, the trainability of a -hidden layer ReLU network having neurons at each layer is bounded above by

where and . The proof can be found in Appendix G.

In principle, a single active neuron in the highest layer could potentially revive tentatively dead neurons through back-propagation (gradient). However, in practice, it would be better to have at least active neurons in the -th hidden layer at the initialization for both faster training and robustness. If we employ a -hidden layer ReLU network having neurons in the -th hidden layer, with probability

(11)

an initialized ReLU network has at least active neurons in the -th hidden layer for . Therefore, it is suggested to use a ReLU network with sufficiently large width at each layer to secure a high probability of (11).

Remark: A trainable network is not guaranteed to be trained successfully. However, if a network is not trainable, we have no hope for the successful training. Thus, a network being trainable is a necessary condition for the successful training. Also, the probability that a network is trainable, i.e., the trainability, serves as an upper bound of the probability of the successful training.

5 Data-dependent Bias Initialization

In the previous section, we discuss the trainability of ReLU networks. In terms of trainability, Theorem 4.1 indicates that the zero-bias initialization would be preferred over the random bias initialization. In practice, however, the zero-bias initialization often finds a spurious local minimum or gets stuck on a flat plateau. To illustrate this difficulty, we consider a problem of approximating a sum of two sine functions on . For this task, we use a shallow ReLU network of width 500 with the ‘He initialization’ without bias. In order to reduce extra randomness in the experiment, 100 equidistant points on are used as the training data set. One of the most popular gradient-based optimization methods, Adam (Kingma and Ba, 2015), is employed with its default parameters and we use the full-batch size. The trained network is plotted in Figure 3. It is clear that the trained network is stuck on a local minimum. A similar behavior is repeatedly observed in all of our multiple independent simulations.

Figure 3: The trained networks for approximating by the ‘He initialization’ without bias and with bias. A shallow ReLU network of width 500 is employed. The target function is also plotted.

This phenomenon could be understood as follows. Since the biases are zero, all initialized neurons are clustered at the origin. Consequently, it would take long time for gradient-update to distribute neurons over the training domain to achieve a small training loss. In the worst case, along the way of distributing neurons, it will find a spurious local minimum. We refer to this problem as the clustered neuron problem. Indeed, this is observed in Figure 3. The trained network well approximates the target function on a small domain containing the origin, however, it loses its accuracy on the domain far from the origin.

On the other hand, if we randomly initialize the bias, as shown in Theorem 4.1, over-specification is inevitable to guarantee, with high probability, a certain number of active neurons. In this setting, at the initialization, only 375 neurons will be active among 500 neurons on average. In Figure 3, we also show the trained result by the ‘He initialization’ with bias. Due to the non-zero bias, neurons are randomly distributed. Accordingly, the trained network quite well approximates the target function. However, due to dead neurons, it loses its accuracy at some parts of the domain, e.g. in the intervals containing .

In order to overcome such difficulties, we propose a new data-dependent initialization method in the over-parameterized setting, where the size of width is greater than or equal to the number of training data. By adapting the trainability perspective, the method is designed to alleviate both the clustered neuron problem and the dying ReLU neuron problem at the same time.

5.1 Data-dependent Bias Initialization

Let be the number of training data and be the width of a shallow ReLU network. Suppose the network is over-parameterized so that for some positive number . We then propose to initialize the biases as follows;

where ’s are iid and . We note that this mimics the explicit construction for the data interpolation in Theorem 4.1. By doing so, the -th neuron is initialized to be located near as

The precise value of is determined as follows. Let be the expectation of the normalized squared norm of the network, i.e., where the expectation is taken over weights and biases and is a shallow ReLU network having architecture. Given a set of training input data , we define the average of on as

We then choose our parameters to match by our data-dependent initialization to the one by the standard initialization method. For example, when the ‘normal’ (3) initialization without bias is used, we have

where for , for , and is the Frobenius norm. When the ‘He initialization’ without bias is used, i.e., and , we have .

Suppose for some positive number . Let be the set of training input data. If for , for , , and is initialized by the proposed method, then

where and . The proof can be found in Appendix H.

For example, if we set and to be

(12)

by the data-dependent initialization is equal to the one by the ‘He initialization’ without bias.

The proposed initialization makes sure that all neurons are equally distributed over the training data points. Also, it would make sure that at least one neuron will be activated at a training datum. By doing so, it would effectively avoid both the clustered neuron problem and the dying ReLU neuron problem.

In Figure 4, we demonstrate the performance of the proposed method in approximating the sum of two sine functions. On the left, the trained neural network is plotted and on the right, the root mean square errors (RMSE) of the training loss are shown with respect to the number of epochs by three different initialization methods. We remark that since the training set is deterministic and the full-batch is used, the only randomness in the training process is from the weights and biases initialization. It can be seen that the proposed method not only results in the fastest convergence but also achieves the smallest approximation error among others. The number of dead neurons in the trained network is 127 (He with bias), 3 (He without bias), and 17 (Data-dependent).

Figure 4: (Left) The trained network for approximating by the proposed data-dependent initialization. A shallow ReLU network of width 500 is employed. (Right) The root mean square error of the training loss with respect to the number of epochs of Adam (Kingma and Ba, 2015).

6 Numerical Examples

We present numerical examples to demonstrate our theoretical findings and effectiveness of our new data-dependent initialization method.

6.1 Trainability of Shallow ReLU Networks

We present two examples to demonstrate the trainability of a ReLU neural network and justify our theoretical results. Here all the weights and biases are initialized according to the ‘He initialization’ (3) with bias. We consider two uni-variate test target functions:

(13)

We note that is the minimal function class for and is the minimal function class for . That is, theoretically, and should be exactly recovered by a shallow ReLU network of width and , respectively. For the training, we use a training set of 600 data points uniformly generated from and a test set of 1,000 data points uniformly generated from . The standard stochastic gradient descent with mini-batch of size 128 with a constant learning rate of is employed and we set the maximum number of epochs to . Also, the standard loss is used.

In Figure 5, we show the approximation results for approximating . On the left, we plot the empirical probability of the successful training with respect to the value of width. The empirical probabilities are obtained from 1,000 independent simulations and a single simulation is regarded as a success if the test error is less than . We also plot the trainability from Theorem 4.1. As expected, it provides an upper bound for the probability of the successful training. It is clear that the more the network is over-specified, the higher trainability is obtained. Also, it can be seen that as the size of width grows, the empirical training success rate increases. This suggests that a successful training could be achieved (with high probability) by having a very high trainability. However, since it is a necessary condition, although an initialized network is in for , i.e., trainable, the final trained result could be in either or as shown in the middle and right of Figure 5, respectively.

Figure 5: (Left) The empirical probability that a network approximates successfully and the probability that a network is trainable (Theorem 4.1) with respect to the size of width . The trained network which falls in (middle) and (right) .

Similar behavior is observed for approximating . In Figure 6, we show the approximation results for . On the left, the empirical probability of the successful training and the trainability (Theorem 4.1) are plotted with respect to the size of width. Again, the trainability provides an upper bound for the probability of the successful training. Also, it can be seen that the empirical training success rate increases, as the size of width grows. On the middle and right, we plot two of local minima which a trainable network could end up with. We remark that the choice of gradient-based optimization methods, well-tuned learning rate, and/or other tunable optimization parameters could affect the empirical training success probability. However, the maximum probability one can hope for is bounded by the trainability. In all of our simulations, we did not tune any optimization parameters.

Figure 6: (Left) The empirical probability that a network approximates successfully and the probability that a network is trainable (Theorem 4.1) with respect to the size of width . The trained network which falls in (middle) and (right) .

6.2 Data-dependent Bias Initialization

Next, we compare the training performance of three initialization methods. The first one is the ‘He initialization’ (He et al., 2015) without bias. This corresponds to . The second one is the ‘He initialization’ with bias (3). This corresponds to . Here is the width of the 1st hidden layer. The last one is our data-dependent initialization described in the previous section. We use the parameters from (12). All results are generated under the same conditions except for the weights and biases initialization.

We consider the following test functions on :

(14)

In all tests, we employ a shallow ReLU network of width 100 and it is trained over 25 randomly uniformly drawn points from . The standard

loss and the gradient-descent method with moment are employed. The learning rate is set to be a constant of

and the momentum term is set to 0.9.

Figure 7

shows the mean of the RMSE on the training data from 10 independent simulations with respect to the number of epochs by three different initialization methods. The shaded area covers plus or minus one standard deviation from the mean. On the left and right, the results for approximating

and are shown, respectively. We see that the data-dependent initialization not only results in the faster loss convergence but also achieves the smallest training loss. Also, the average number of dead neurons in the trained network is 11 (He with bias), 0 (He without bias), and 0 (Data-dependent) for , and 12 (He with bias), 0 (He without bias), and 0 (Data-dependent) for .

Figure 7: The convergence of the root mean square error on the training data for approximating (left) and (right) with respect to the number of epochs of the gradient descent with moment by three different initialization methods. A shallow ReLU network of width 100 is employed. The shaded area covers plus or minus one standard deviation from the mean.

7 Conclusion

In this paper, we establish the trainability of ReLU neural networks, a necessary condition for the successful training. Given a learning task which requires a ReLU network with at least active neurons, a network is said to be trainable if it has at least active neurons. We refer to the probability of a network being trainable as trainability. In order to calculate the trainability, we first study the probability distribution of the number of active neurons in the -th layer. We completely characterize and by focusing on the one-dimensional case, we derive for four different combinations of initialization schemes. With , we derive the trainability of ReLU networks. The trainability serves as an upper bound of the probability of the successful training. Also, by showing that over-parameterization can be understood as a kind of over-specification, we prove that over-parameterization is both a neccessary and a sufficient condition for interpolating all training data, i.e., minimizing the loss.

Although the zero-bias initialization seems to be preferred over the random-bias initialization from the trainability perspective, the zero-bias initialization locates all neurons in the origin at the initialization. This often deteriorates the training performance, especially, for a task which requires neurons to be located on the regime away from the origin. On the other hand, the random-bias initialization suffers from the dying ReLU neuron problem and typically requires a severely over-parameterzed or over-specified network. To alleviate such difficulties, we propose a new data-dependent initialization method in the over-parameterized setting. The proposed method is designed to avoid both the dying ReLU neuron problem and the clustered ReLU neuron problem at the same time. Numerical examples are provided to demonstrate the performance of our method. We found that the data-dependent initialization method outperforms both the ‘He initialization’ with and without bias in all of our tests.

This work is supported by the DOE PhILMs project (No.de-sc0019453), the AFOSR grant FA9550-17-1-0013, and the DARPA AIRA grant HR00111990025.

Appendix A Proof of Theorem 2.2

Suppose a ReLU neural network of width is initialized to be