# The Benefits of Over-parameterization at Initialization in Deep ReLU Networks

It has been noted in existing literature that over-parameterization in ReLU networks generally leads to better performance. While there could be several reasons for this, we investigate desirable network properties at initialization which may be enjoyed by ReLU networks. Without making any assumption, we derive a lower bound on the layer width of deep ReLU networks whose weights are initialized from a certain distribution, such that with high probability, i) the norm of hidden activation of all layers are roughly equal to the norm of the input, and, ii) the norm of parameter gradient for all the layers are roughly the same. In this way, sufficiently wide deep ReLU nets with appropriate initialization can inherently preserve the forward flow of information and also avoid the gradient exploding/vanishing problem. We further show that these results hold for an infinite number of data samples, in which case the finite lower bound depends on the input dimensionality and the depth of the network. In the case of deep ReLU networks with weight vectors normalized by their norm, we derive an initialization required to tap the aforementioned benefits from over-parameterization without which network fails to learn for large depth.

## Authors

• 21 publications
• 359 publications
• ### On the Proof of Global Convergence of Gradient Descent for Deep ReLU Networks with Linear Widths

This paper studies the global convergence of gradient descent for deep R...
01/24/2021 ∙ by Quynh Nguyen, et al. ∙ 0

• ### Dying ReLU and Initialization: Theory and Numerical Examples

The dying ReLU refers to the problem when ReLU neurons become inactive a...
03/15/2019 ∙ by Lu Lu, et al. ∙ 0

• ### Gradient Dynamics of Shallow Univariate ReLU Networks

We present a theoretical and empirical study of the gradient dynamics of...
06/18/2019 ∙ by Francis Williams, et al. ∙ 3

• ### Approximating Continuous Functions by ReLU Nets of Minimal Width

10/31/2017 ∙ by Boris Hanin, et al. ∙ 0

• ### A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case

A key element of understanding the efficacy of overparameterized neural ...
10/03/2019 ∙ by Greg Ongie, et al. ∙ 0

• ### Collapse of Deep and Narrow Neural Nets

Recent theoretical work has demonstrated that deep neural networks have ...
08/15/2018 ∙ by Lu Lu, et al. ∙ 0

• ### The Separation Capacity of Random Neural Networks

Neural networks with random weights appear in a variety of machine learn...
07/31/2021 ∙ by Sjoerd Dirksen, et al. ∙ 0

## Code Repositories

### overparametrization_benefits

Code for the paper: The Benefits of Over-parameterization at Initialization in Deep ReLU Networks (https://arxiv.org/abs/1901.03611)

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Parameter initialization is an important aspect of deep network optimization and plays a crucial role in determining the quality of the final model. Too large or too small parameter scale leads to exploding or vanishing gradient problem across hidden layers. As such, some parameters can get initialized in plateaus and others along steep valleys and optimization becomes unstable. We will specifically focus on this problem for deep ReLU networks because of their popularity and success in the various applications of deep learning. There have been a number of papers that have studied initialization in deep ReLU networks previously. We contrast between our contribution and existing work in section

2.

Our analysis of the initialization aspect for deep ReLU networks centers around the claim that over-parameterization in terms of network width avoids the exploding and vanishing gradient problem in the backward pass, and forms a norm (and distance) preserving mapping across hidden layers in the forward pass. Our findings/contributions are as follows:

1. At appropriate initialization, deep ReLU networks are norm (and distance) preserving maps, i.e., the norm of (distance between) hidden activations (s) is approximately equal to the norm of (distance between) the input vector (s).

2. The same initialization also guarantees that the exploding and vanishing gradient problem does not happen in the sense that the norm of parameter gradients are equal across all layers.

3. We do not make any assumption on the data distribution as done in a number of previous papers that study initialization. In fact, our results hold for an infinite stream of data.

4. We derive a finite lower bound on the width of the hidden layers for which the above results hold (i.e., the network needs to be sufficiently over-parameterized) in contrast to a number of previous papers that assume infinitely wide layers.

5. We show how to initialize deep ReLU networks whose weight vectors are normalized by their norm (as done in (Salimans & Kingma, 2016; Arpit et al., 2016)) so that properties 1 and 2 above hold given the network is wide enough. To the best of our knowledge, we are the first to study initialization conditions for weight normalized deep ReLU networks.

6. Finally, we derive the initialization conditions for residual networks which ensures the activation norms are preserved when the network is sufficiently wide (see appendix A).

## 2 Relation with Existing Work

The seminal work of Glorot & Bengio (2010) studied for the first time a principled way to initialize deep networks to avoid gradient explosion/vanishing problem. Their analysis however is done for deep linear networks. The analysis by He et al. (2015) follows the derivation strategy of Glorot & Bengio (2010) except they tailor their derivation for deep ReLU networks. However, both these papers make a strong assumption that the dimensions of the input are statistically independent and that the network width is infinite. Our results do not make these assumptions.

(Saxe et al., 2013)

introduce the notion of dynamical isometry which is achieved when all the singular values of the input-output Jacobian of the network is 1. They show that deep linear networks achieve dynamical isometry when initialized using orthogonal weights and this property allows fast learning in such networks.

Pennington et al. (2017, 2018)

study the exploding and vanishing gradient problem in deep ReLU networks using tools from free probability theory. Under the assumption of an infinitely wide network, they show that the average squared singular value of the input-output Jacobian for deep ReLU network is 1 when initialized appropriately. Our paper on the other hand shows that there exists a finite lower bound on the width of the network for which the Frobenius norm of the hidden layer-output Jacobian (equivalently the sum of its squared singular values) are equal across all hidden layers.

(Hanin & Rolnick, 2018)

show that for a fixed input, the variance of the squared norm of hidden layer activations are bounded from above and below for deep ReLU networks to be near the squared norm of the input such that the bound depends on the sum of reciprocal of layer widths of the network. Our paper shows a similar result in a PAC bound sense but as an important difference, we show that these results hold even for an infinite stream of data by making the bound depend on the dimensionality of the input.

(Hanin, 2018) show that sufficiently wide deep ReLU networks with appropriately initialized weights prevent exploding/vanishing gradient problem (EVGP) in the sense that the fluctuation between the elements of the input-output Jacobian matrix of the network is small. This avoids EVGP because a large fluctuation between the elements of the input-output Jacobian implies a large variation in its singular values. Our paper shows that sufficiently wide deep ReLU networks avoid EVGP in the sense that the norm of gradient for weights of each layer is roughly equal to a fixed quantity that depends on the input and target.

Arpit et al. (2016)

introduced weight normalized deep networks in which the pre and post activations are scaled/summed with constants depending on activation function which ensures that the hidden activations have 0 mean and unit variance, especially at initialization. Their work makes assumptions on the distribution of input and pre-activations of the hidden layers. Weight normalization

(Salimans & Kingma, 2016) offers a simpler alternative to Arpit et al. (2016) in which no scaling/summing constants are used aside from normalizing the weights by their norm. Our work focuses on the initialization conditions for normalized deep ReLU networks and shows that over-parameterization along with appropriate initialization ensures the activation and parameter gradient norms are preserved without making any assumption on the distribution of data or hidden activations.

Over-parameterization in deep networks has previously been shown to have advantages. Neyshabur et al. (2014); Arpit et al. (2017)

show empirically that wider networks train faster (number of epochs) and have better generalization performance. From a theoretical view point,

(Neyshabur et al., 2018) derive a generalization bound for a two layer ReLU network where they show that a wider network has a lower complexity. Lee et al. (2017) show that infinitely wide deep networks act as a Gaussian process. (Arora et al., 2018) show that over-parameterization in deep linear networks acts as a conditioning on the gradient leading to faster convergence, although in this case over-parameterization in terms of depth is studied. Our analysis complements this line of work by showing another advantage of over-parameterization in deep ReLU networks.

Random projection is a popular method for dimensionality reduction based on the Johnson-Lindenstrauss (JL) lemma (Johnson & Lindenstrauss, 1984)

that involves projecting a vector onto a properly constructed random matrix. This lemma states that the

norm of a randomly projected vector is approximately equal to the

norm of the original vector. In this work, we show that a linear transformation followed by a point-wise ReLU also preserves the norm of the input vector in the following cases i) each element of the projection matrix is sampled i.i.d. from an isotropic distribution; ii) each row vector is i.i.d. sampled from an isotropic distribution and has unit length.

An et al. (2015) claim deep ReLU networks cannot be distance preserving transforms based on the argument that applying ReLU on a vector discards information in the negative direction, and can thus make two different vectors collapse to the same point. As we show, ReLU networks are capable of preserving vector norms. The subtlety that allows ReLU networks to preserve norms is over-parameterization. A single rectifier unit discards information in the negative direction, but when an over-complete ReLU transformation is used, there is a high likelihood that all the rectifier units together capture all the information of the input while non-linearly transforming it at the same time.

## 3 Un-normalized Deep ReLU Networks

Let be training sample pairs of inputs vectors and target vectors . Define a layer deep ReLU network with the hidden layer’s activation given by,

 hl :=ReLU(al) al :=Wlhl−1+bll∈{1,2,⋯L} (1)

where are the hidden activations, is the input to the network and can be one of the input vectors , are the weight matrices,

are the bias vectors which are initialized as 0s,

are the pre-activations and .

Define a loss on the deep network function for any given training data sample as,

 ℓ(fθ(x),y) (2)

where

is any desired loss function. For instance,

can be log loss for a classification problem, in which case

is transformed using a weight matrix to have dimensions equal to the number of classes and the softmax activation is applied to yield class probabilities (i.e., a logistic regression like model on top of

). However for our purpose, we do not need to restrict to a specific choice, we only need it to be differentiable. We will make use of the notation,

 δ(x,y) :=∂ℓ(fθ(x),y)∂aL (3)

We first derive the norm preservation property for finite datasets and then extend these results to an infinite data stream.

### 3.1 Activation Norm Preservation

Consider an layer deep ReLU network and a data samples from a fixed dataset . Then we essentially show in this section that for a sufficiently wide ReLU network, the norm of hidden layer activation of any layer is roughly equal to the norm of the input at initialization if the network weights are initialized appropriately. Specifically we show,

 ∥hl∥2≈∥x∥2∀l∀x∈D (4)

An important step to achieve this goal is to show that in expectation, the non-linear transformation in each layer preserves the norm of its corresponding input. Evaluating this expectation also helps determining the scale of the random initialization that leads to norm preservation.

###### Lemma 1

Let , where , . If , then for any fixed vector , .

The above result shows that for each layer, initializing its weights from an i.i.d. Gaussian distribution with 0 mean and

variance preserves the norm of its input in expectation. We now show that this property holds for a finite width network.

###### Lemma 2

Let , where , . If , and , then for any fixed vector ,

 Pr(|∥v∥2−∥u∥2|≤ϵ∥u∥2) ≥1−2exp(−m(ϵ4+log21+√1+ϵ)) (5)

We note that (He et al., 2015) also find this initialization appropriate for ReLU networks (We contrast with their findings in section 2). We finally extend this result to show that all hidden layers and the output of a deep ReLU network also preserve the norm of the input for a finite width network and a finite dataset.

###### Theorem 1

Let be a fixed dataset with samples and define a layer ReLU network as shown in Eq. 3 such that each weight matrix has its elements sampled as and biases are set to zeros. Then for any sample and , we have that,

 Pr((1−ϵ)L∥x∥2≤∥fθ(x)∥2≤(1+ϵ)L∥x∥2) ≥1−L∑l′=12Nexp(−nl′(ϵ4+log21+√1+ϵ)) (6)

If the weights are sampled from a Gaussian with variance larger (smaller) than , the hidden activation norm will explode (vanish) exponentially with depth.

Consider any given loss function and a data sample , we will show in this section that the norm of gradient for the parameter of the layer will roughly be equal to each other at initialization if the network weights are initialized appropriately. More specifically, we will show that for a wide enough network, the following holds at initialization for all and ,

 ∥∂ℓ(fθ(x),y)∂Wl∥F≈∥δ(x,y)∥2⋅∥x∥2∀l (7)

As a first step, we note that the gradient for a parameter for a sample is given,

 ∂ℓ(fθ(x),y)∂Wl=diag(∂ℓ(fθ(x),y)∂al)⋅Mnl(hl−1) (8)

where is a matrix of size such that each row is the vector . Therefore, a simple algebraic manipulation shows that,

 ∥∂ℓ(fθ(x),y)∂Wl∥F=∥∂ℓ(fθ(x),y)∂al∥2⋅∥hl−1∥2 (9)

In the previous section, we showed that for a sufficiently wide network, with high probability. To show that gradient norms of parameters are preserved in the sense shown in Eq. (7), we essentially show that with high probability for sufficiently wide networks.

Noe that by definition. To show the norm is preserved for all layers, we begin by noting that,

 ∂ℓ(fθ(x),y)∂al =∂hl∂al⊙(∂al+1∂hlT∂ℓ(fθ(x),y)∂al+1) =1(al)⊙(Wl+1T∂ℓ(fθ(x),y)∂al+1) (10)

where is the point-wise product (or Hadamard product) and is the heaviside step function. The following proposition shows that

follows a Bernoulli distribution w.r.t. the weights given any fixed input.

###### Proposition 1

If network weights are sampled i.i.d. from a random distribution with mean 0 and biases are 0 at initialization, and is sufficiently large, then each dimension of follows an i.i.d. Bernoulli distribution with probability 0.5 at initialization.

Given this property of , we show below that the transformation of type shown in Eq. (3.2) is norm preserving in expectation.

###### Lemma 3

Let , where , and . If and , then for any fixed vector , .

The above lemma reveals the variance of the 0 mean Gaussian from which the weights must be sampled in order for the vector norm to be preserved in expectation. Since is sampled from a 0.5 probability Bernoulli, we have that the weights must be sampled from a Gaussian with variance . We now show this property holds for a finite width network.

###### Lemma 4

Let , where , , and . If , and , then for any fixed vector ,

 Pr(|∥v∥2−∥u∥2|≤ϵ∥u∥2) ≥1−2exp(−m(ϵ4+log21+√1+ϵ)) (11)

Having shown the finite width case holds, we now note that we need to apply this result to Eq. (3.2). In this case, we must substitute the matrix in the above lemma with the network’s weight matrix . In the previous subsection, we showed that each element of the matrix must be sampled from in order for the norm of the input vector to be preserved. However, in order for the Jacobian norm to be preserved, we require to be sampled from as per the above lemma. This suggests that if we want the norms to be preserved in the forward and backward pass for a single layer simultaneously, it is beneficial for the width of the network to be close to uniform. The reason we want them to simultaneously hold is because as shown in Eq. (9), in order for the parameter gradient norm to be same for all layers, we need the norm of both the Jacobian as well as the hidden activation to be preserved throughout the hidden layers. Therefore, assuming the network has a uniform width, we now prove that deep ReLU networks with the mentioned initialization prevent the exploding/vanishing gradient problem.

###### Theorem 2
111We have assumed that in independent from and at initialization. We get rid of this assumption in the next subsection.

Let be a fixed dataset with samples and define a layer ReLU network as shown in Eq. 3 such that each weight matrix has its elements sampled as and biases are set to zeros. Then for any sample , , and for all with probability at least,

 1−4NLexp(−n(ϵ4+log21+√1+ϵ)) (12)

the following hold true,

 (1−ϵ)L∥x∥2⋅∥δ(x,y)∥2≤∥∂ℓ(fθ(x),y)∂Wl∥2 ≤(1+ϵ)L∥x∥2⋅∥δ(x,y)∥2 (13)

and

 (1−ϵ)l∥x∥2≤∥hl∥2≤(1+ϵ)l∥x∥2 (14)

The corollary below shows a lower bound on the network width which simultaneously ensures parameter gradients do not explode/vanish and activation norms are preserved.

###### Corollary 1

Suppose all the hidden layers of the layer deep ReLU network have the same width , and let the following hold with probability at least

 |∥fθ(x)∥2−∥x∥2|≤ϵ∥x∥2

and,

 ∣∣∥∂ℓ(fθ(x),y)∂Wl∥2F−∥x∥2⋅∥δ(x,y)∥2∣∣≤ϵ∥x∥2⋅∥δ(x,y)∥2

hold for fixed training samples , then,

 n≥10.25ϵ′−log(0.5(1+√1+ϵ′))log4NLδ (15)

where .

### 3.3 Infinite Data Stream

So far, we have shown for sufficiently wide deep ReLU network with appropriate initialization that they avoid the exploding/vanishing gradient problem and preserve the norm of hidden activations for, i) a finite dataset, ii) the assumption shown in theorem 2. We now show that the norm preserving property of a ReLU layer shown in lemma 2 can hold for an infinite stream of data if the layer width is larger than the lower bound shown below which depends on the input dimensionality.

###### Theorem 3

Let be a dimensional subspace of and . If , , and,

 m≥1ϵ/12−log(0.5(1+√1+ϵ/3))⋅(dlog2Δ+log4δ) (16)

then with probability at least ,

 ∣∣∥ReLU(Ru)∥−∥u∥∣∣≤ϵ∥u∥∀u∈X (17)

where .

A similar result to extend the result of lemma 4 to show norm preservation for the backward pass in the infinite data stream case is derived in theorem 6 in appendix.

Since the proof of theorem 2 essentially uses lemma 2 and 4, due to the extension of the two lemmas to the infinite data case in the above two theorems, the statement of theorem 2 will hold for an infinite stream of data as well if the network width is wide enough depending as on input manifold dimensionality and on depth (of course the lower bound on width in that case needs to be re-calculated using the above two theorems). Finally note that the assumption made in theorem 2 becomes irrelevant once the network width is larger than the above mentioned bound because this assumption is required in order to apply lemma 4 to Eq. (3.2). But since the infinite data version of lemma 4, viz, theorem 6, applies to all possible vectors, this assumption is not necessary.

### 3.4 Distance Preservation

We say that a transformation is distance preserving if the pairwise distance between all the samples in a given dataset is preserved after the transformation. In order to see that sufficiently wide deep ReLU networks are distance preserving transformations, we simply need to apply the activation norm preservation result discussed in the previous sub-sections to the vectors formed by the pairwise difference between data samples. Hence for data samples, if the network is wide enough to preserve the distance for vectors, then that transformation preserves the pairwise distance between samples. Alternatively, if we choose the network to be wide enough such that it is norm preserving for an infinite stream of vectors (as shown in the previous sub-section), then we automatically get the distance preserving property.

## 4 Weight Normalized Deep ReLU Networks

We now analyze deep ReLU networks whose weight vectors are normalized at every layer. We define a layer normalized deep ReLU network with the hidden layer’s activation given by,

 hl :=ReLU(al) al (18)

where is a multiplicative factor (which we will show is important), and the notation implies that each row vector of has unit norm, i.e.,

 ^Wli=Wli∥Wli∥2∀i (19)

The rest of the symbols in Eq. (4) have same definition as in Eq. (3). For any vector , we also define the notation . The definition of is the same as that in Eq. (3).

We find that with an appropriate initialization and sufficiently wide layers, both the activation norms and parameter gradient norms are preserved in such deep ReLU networks. In contrast with the analysis in the previous section, we will only study normalized ReLU networks in terms of expectation and not extend the results to derive PAC bounds. We will instead resort to the argument of the law of large numbers due to which one can expect that if the width of the layers are large enough, the results for the finite width case approaches the results that hold in expectation. Throughout this section, take note of the distinction between the notations

and for any matrix .

### 4.1 Activation Norm Preservation

The theorem below shows that in expectation, the transformation of any hidden layer of a normalized ReLU network preserves the norm of the input.

###### Theorem 4

Let , where and . If where is any isotropic distribution in , then for any fixed vector , where,

 Kn=⎧⎪⎨⎪⎩2Sn−1Sn⋅(23⋅45…n−2n−1)if n is even2Sn−1Sn⋅(12⋅34…n−2n−1)⋅π2otherwise (20)

and is the surface area of a unit -dimensional sphere.

The constant seems hard to evaluate analytically, but remarkably, we empirically find that for . This implies that in expectation, the non-linear transformation shown in theorem 4 is norm preserving for practical cases where input dimension is larger than 1. Hence, we can extend this argument to a normalized deep ReLU network and have that the norm of the output of the network is approximately equal to the norm of the input for wide networks. We summarize this statement in the following informal corollary.

###### Corollary 2

(informal) For a sufficiently wide normalized deep ReLU network defined in Eq. (4) with , the following holds for any fixed input at initialization,

 ∥fθ(x)∥≈∥x∥ (21)

When analyzing the parameter gradients of a normalized deep ReLU network, there is a slight ambiguity. In such a network (described by Eq. (4)), the output is invariant to the scale of the weights. Hence, it is only the direction of the weight vectors that matter, in which case we may study the gradient w.r.t. the normalized weights. However, in practice, we update the un-normalized weights in which case the gradient back-propagates through the normalization term as well. Therefore, we study the gradient norm for both these cases at the time of initialization.

Gradient w.r.t. : In this case, we show that for a sufficiently wide network ,

 ∥∂ℓ(fθ(x),y)∂^Wl∥F≈√2nl−1nl⋅∥δ(x,y)∥2⋅∥x∥2 (22)

The steps for showing this are very similar to those in section 3.2 with minor differences. We note that,

 ∂ℓ(fθ(x),y)∂^Wl =√2nl−1nl⋅diag(∂ℓ(fθ(x),y)∂al)⋅Mnl(hl−1) (23)

where is a matrix of size such that each row is the vector . Therefore for all ,

 ∥∂ℓ(fθ(x),y)∂^Wl∥F=√2nl−1nl⋅∥∂ℓ(fθ(x),y)∂al∥2⋅∥hl−1∥2 (24)

Since we have already stated in corollary 2 that for a sufficiently wide network the norm of any hidden layer’s activations is approximately equal to the norm of the input, we only need to show that . To this end we write,

 ∂ℓ(fθ(x),y)∂al =√2nlnl+1⋅1(al)⊙(^Wl+1T∂ℓ(fθ(x),y)∂al+1) (25)

To avoid redundancy, we note that similar to proposition 1, each dimension of in the above equation follows an i.i.d. sampling from Bernoulli distribution with probability 0.5 at initialization. The following proposition then shows that transformations of type shown in the above equation is norm preserving.

###### Proposition 2

Let , where , and . If where is any isotropic distribution in and , then for any fixed vector , , where is same as defined in theorem 4.

The above proposition when applied to Eq. (25) shows that for a sufficiently wide network due to the law of large numbers. Substituting this result along with the statement of corollary 2 into Eq. (24) leads us to the conclusion that Eq. (22) holds true.

Gradient w.r.t. : Now we consider the gradient norm of un-normalized parameters which are the ones that are updated in practice instead of . In this case, we show for a sufficiently wide network, ,

 ∥∂ℓ(fθ(x),y)∂Wl∥F≈√2nl−1nl∥Wli∥2⋅∥δ(x,y)∥2⋅∥x∥2 (26)

where we have assumed for each , is the same for all . To begin, notice,

 ∂ℓ(fθ(x),y)∂Wl=√2nl−1nl⋅diag(∂ℓ(fθ(x),y)∂al)⋅M (27)

where the row of the matrix is,

 (28)

In the previous two cases (Eq. (7) and Eq. (4.2)), we were able to show that that parameter gradient decomposes into two separate terms– and . However, this decomposition is not possible directly in this case. So we jointly compute the expectation of the gradient norm .

###### Proposition 3

Consider a matrix such that the row , where , and . If where is any isotropic distribution in such that for some fixed and follows any distribution independent of , then for any fixed vector , , where is same as defined in theorem 4.

The proposition applied to Eq. (27) shows that,

 E[∥∂ℓ(fθ(x),y)∂Wl∥2F] (29) =2nl−1nl∥Wli∥2⋅(1−Knl−1nl−1)⋅E[∥∂ℓ(fθ(x),y)∂al∥22]⋅∥hl−1∥22

For wide enough network, . Further, proposition 2 applied to Eq. (25) already shows that . Combining these arguments, we have shown that Eq. (26) holds true.

### 4.3 Batch Normalization

Batch normalization (Ioffe & Szegedy, 2015) is designed in way that the distribution of pre-activation in each mini-batch has zero mean and unit variance for each feature dimension. Therefore, when feed-forwarding a batch of input through hidden layers, the norm of activations must not vanish of explode. Hence the scaling factor for pre-activations that we show is needed when using weights normalization, may not be required in the case of batch normalization. The properties of gradients and the loss curvature have been studied more thoroughly by Santurkar et al. (2018), and we point the reader to this reference for a detailed analysis. We leave a thorough analysis of initialization conditions for batch norm as future work.

## 5 Experiments

We show three experiments in the main paper (see appendix for additional experiments). In the first set of experiments, we verify the hidden activations have the same norm as input norm (Eq. 4), and the parameter gradient norm approximately equal the product of input norm and output error norm (Eq. 7) for all layer indices for sufficiently wide un-normalized deep ReLU networks. For this experiment we choose a 10 layer network with 2000 randomly generated input samples in and randomly generated target labels in and cross-entropy loss. We add a linear layer along with softmax activation to the ReLU network’s outputs to make the final output in . According to corollary 1, the network width should be at least 4060 to preserve the activation norm and gradient norm simultaneously with maximum error margin and failure probability . We plot the aforementioned ratios for width 4060 and smaller widths (100, 500, 2000) for comparison. We show results for both He initialization (He et al., 2015) which we theoretically show is optimal, as well as Glorot initialization (Glorot & Bengio, 2010) which is not optimal for deep ReLU nets. As can be seen in figure 1

, the mean ratio of hidden activation norm to the input norm over the dataset is roughly 1 with a small standard deviation for He initializaiton. This approximation becomes better with larger width. On the other hand, Glorot initialization fails at preserving activation norm for deep ReLU nets. A similar result can be seen for parameter gradients norms (figure

2), where we find He initialization circumvents the gradient exploding/vanishing problem for all layers.

In the second experiment, we make the same evaluations as above for weight normalized deep ReLU network. We compute gradient w.r.t. un-normalized weights (see section 4.2), in which case, as we theoretically showed, (Eq. 26). Here we sample the rows of each layer’s weight matrix from an isotropic Gaussian distribution, and then normalize each row to have unit norm. We run experiments with the proposed initialization (see corollary 2; ), as well as traditional weight normalization (Salimans & Kingma (2016), ) for network widths 100, 500 and 1000. As can be seen in figures 3 and 4, our proposed initialization preserves activation norm and prevents gradient explosion/vanishing problem as we showed theoretically, while the traditional weight normalization does not. Further, these approximations get better with larger width.

Finally, we show results on the MNIST dataset (Lecun & Cortes, ) with fully connected deep ReLU networks with width 500. The network is trained using SGD with batch-size 100, momentum 0.9, weight decay 0.0005. Learning rate is drop by a factor of 0.5 every 10 epochs. We tried base learning rates from the set for both traditional weight normalization and our proposed initialization and finally use the learning rate which yields the best validation accuracy. The convergence plots are shown in figure 5. For depth 2 and 5 both methods converge, but for depth 10, the traditional weight normalization is unable to escape the bad initialization while our proposal trains normally.

## 6 Conclusion and Recommendations

We rigorously derived the conditions needed for the preservation of activation norm and prevention of gradient explosion/vanishing problems in both un-normalized and weight normalized deep ReLU networks without making any assumption and verified our predictions empirically. In general we showed that over-parameterization in terms of width is crucial for avoiding these problems. Another useful recommendation is to keep network width as uniform as possible, especially when the width is small; for large widths the uniformity is less important up to small additive differences in layer widths. Other practical recommendations that help avoiding this problem are as follows,

1. Un-normalized deep ReLU networks: Initialize each element of each layer’s weight matrix from , and biases to 0s.

2. Weight normalized deep ReLU networks: The network layers should be designed as,

 hli :=ReLU(√2fan-infan-outWli∥Wli∥2hl−1+bli) (30)

where denotes the layer index, denotes the unit. Each row of the weight matrix can be sampled from any isotropic distribution, but its norm should be re-scaled to 1, or alternatively, each element of the weights can be sampled from as it ensures the row norms are 1 in expectation. The biases should be initialized to 0s. We refer to the proposed scaling and way of sampling weights collectively as our initialization strategy. If using the and parameters of weight normalization (Salimans & Kingma, 2016), each element of can be initialized to instead of using the above parameterization.

3. Residual networks: See appendix A.

## Acknowledgments

We would like to thank Giancarlo Kerg for proofreading the paper. DA was supported by IVADO.

## Appendix A Residual Networks

Residual networks (He et al., 2016) are a popular choice of architecture in various deep learning applications. Here we derive an initialization strategy for residual networks that preserves activation norm when the network is sufficiently wide. We define a residual network with residual blocks as ,

 hb+1 :=hb+γFb(hb)b∈{1,2,⋯B} (31)

where (input) ,

denotes the hidden representation after applying

residual blocks, each is a residual block which can be a fully connected feed-forward deep ReLU network as defined in Eq. 3 or Eq. 30, and is a scalar that scales the output of the residual blocks. In practice when using residual networks, is usually set to 1. The theorem below states an initialization strategy for residual networks (with and without weight normalization) for which network activations at all layers are preserved in expectation, and hence would also be preserved for a particular instance of an initialized network if it is sufficiently wide.

###### Theorem 5

Let be a residual network as defined in Eq. 31. If the network weights are un-normalized, let each residual block be of the form shown in Eq. 3 and the element of weight matrices sampled from . If the network weights are weight normalized, let each residual block be of the form shown in Eq. 30 such that for any weight matrix, each row of the weight matrix can be sampled from any isotropic distribution, but its norm should be re-scaled to 1. Finally, set and assume for . Then in the limit of and an infinitely wide network,

 ∥x∥2e2 ≤∥fθ(x)∥2≤e2⋅∥x∥2 (32)

Note in the theorem we have assumed for in the case of normalized network which (as discussed in the main text) we empirically found to be true but did not prove analytically in theorem 4. Also, the exponential symbol in the inequality of the theorem arises due to the fact that

 limB→∞(1−1B)B=1e (33) limB→∞(1+1B)B=e (34)

In summary, the initialization for ReLU residual networks can be done in the same way as that for fully connected deep ReLU networks depending on whether or not the weights are normalized or not. The only additional requirement for residual networks during initialization is to scale the output of each residual block by , where denotes the total number of residual blocks in the network. Note our result will also hold in the case when a linear layer exists after the input and/or ”shortcut connections” are used in residual network. These changes only require minor changes in the proof which we do not consider for the sake of simplicity.

In the following experiment we verify the tightness of the bound in lemma 2. To do so, we vary the network width of a one hidden layer ReLU transformation from 500 to 4000, and feed 2000 randomly sampled inputs through it. For each sample we measure the distortion defined as,

 ϵ:=|1−∥h∥∥x∥| (35)

where is the output of the one hidden layer ReLU transformation. We compute the mean value of

for the 2000 examples and plot them against the network width used. We call this the empirical estimate. We simultaneously plot the values of

predicted by lemma 2 for failure probability . We call this the theoretical value. The plots are shown in figure 6 (left). As can be seen, our lower bound on width is an over-estimation but becomes tighter for smaller values of . A similar result can be seen for lemma 4 in figure 6 (right). Thus our proposed bounds can be improved and we leave that as future work.

## Appendix C Proofs

###### Lemma 1

Let , where and . If , then for any fixed vector , .

Proof: Define , where denotes the row of . Since each element is an independent sample from Gaussian distribution, each

is essentially a weighted sum of these independent random variables. Thus, each

and independent from one another. Thus each element where

denotes the rectified Normal distribution. Our goal is to compute,

 E[∥v∥2] =E[m∑i=1v2i] (36) =mE[v2i] (37)

From the definition of ,

 E[vi] =12⋅0+12E[Z] (38)

where follows a half-Normal distribution corresponding to the Normal distribution . Thus . Similarly,

 E[v2i] =0.5E[Z2] (39) =0.5(var(Z)+E[Z]2) (40)

Since , we get,

 E[v2i] =0.5⎛⎝2m∥u∥2(1−2π)+(2√∥u∥2mπ)2⎞⎠ (41) =∥u∥2m (42)

Thus,

 mE[v2i]=∥u∥2 (43)

which proves the claim.

###### Lemma 2

Let , where , . If , and , then for any fixed vector ,

 Pr(|∥v∥2−∥u∥2|≤ϵ∥u∥2)≥1−2exp(−m(ϵ4+log21+√1+ϵ)) (44)

Proof: Define . Then we have that each element and independent from one another since where denotes the rectified Normal distribution. Thus to bound the probability of failure for the R.H.S.,

 Pr(∥v∥2≥(1+ϵ)∥u∥2) =Pr(∥u∥20.5m∥~v∥2≥(1+ϵ)∥u∥2) (45) =Pr(∥~v∥2≥0.5m(1+ϵ)) (46)

Using Chernoff’s bound, we get for any ,

 Pr(∥~v∥2≥0.5m(1+ϵ)) =Pr(exp(λ∥~v∥2)≥exp(λ0.5m(1+ϵ))) (47) ≤E[exp(λ∥~v∥2)]exp(0.5mλ(1+ϵ)) (48) =E[exp(∑mi=1λ~vi2)]exp(0.5mλ(1+ϵ)) (49) =Πmi=1E[exp(λ~vi2)]exp(0.5mλ(1+ϵ)) (50) =(E[exp(λ~vi2)]exp(0.5λ(1+ϵ)))m (51)

Denote

as the probability distribution of the rectified Normal random variable

. Then,

 E[exp(λ~vi2)] =∫∞−∞exp(λ~vi2)p(~vi) (52)

We know that the mass at is 0.5 and the density between and follows the Normal distribution. Thus,

 E[exp(λ~vi2)] =0.5exp(0)+1√2π∫∞0exp(λ~vi2−~vi2/2) (53) =0.5+12√(1−2λ)√2√π