1 Introduction
The bestknown universal approximation theorems of neural networks (NNs) were obtained almost three decades ago by Cybenko (1989) and Hornik et al. (1989), stating that every measurable function can be approximated accurately by a singlehiddenlayer neural network, i.e., a shallow neural network. Although powerful, these results do not provide any information on the required size of a neural network to achieve a prespecified accuracy. In Barron (1993)
, the author analyzed the size of a neural network to approximate functions using Fourier transforms. Subsequently, in
Mhaskar (1996), the authors considered optimal approximations of smooth and analytic functions in shallow networks, and demonstrated that neurons can uniformly approximate any function on a compact set in with error . This is an interesting result and it shows that to approximate a threedimensional function with accuracy we need to design a NN with neurons for a function, but for a very smooth function, e.g., , we only need 1000 neurons. In the last 15 years, deep neural networks (i.e., networks with a large number of layers) have been used very effectively in diverse applications.After some initial debate, at the present time, it seems that deep NNs perform better than shallow NNs of comparable size, e.g., a 3layer NN with 10 neurons per layer may be a better approximator than a 1layer NN with 30 neurons. From the approximation point of view, there are several theoretical results to explain this superior performance. In Eldan & Shamir (2016)
, the authors showed that a simple approximately radial function can be approximated by a small 3layer feedforward NN, but it cannot be approximated by any 2layer network with the same accuracy irrespective of the activation function, unless its width is exponential in the dimension (see
Mhaskar et al. (2017); Mhaskar & Poggio (2016); Delalleau & Bengio (2011); Poggio et al. (2017) for further discussions). In Liang & Srikant (2017) (see also Yarotsky (2017)), the authors claimed that for approximation of a large class of piecewise smooth functions using the rectified linear unit (ReLU) activation function, a multilayer NN using layers only needs neurons, while neurons are required by NNs with layers. That is, the number of neurons required by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network for a given accuracy level of function approximation. In Petersen & Voigtlaender (2018), the authors studied approximation theory of a class of (possibly discontinuous) piecewise functions for ReLU NN, and they found that no more than nonzero weights are required to approximate the function in the sense, which proves to be optimal. Under this optimality condition, they also show that a minimum depth (up to a multiplicative constant) is given by to achieve optimal approximation rates. As for the expressive power of NNs in terms of the width, Lu et al. (2017) showed that any Lebesgue integrable function from to can be approximated by a ReLU forward NN of width with respect to distance, and cannot be approximated by any ReLU NN whose width is no more than . Hanin & Sellke (2017) showed that any continuous function can be approximated by a ReLU forward NN of width , and they also give a quantitative estimate of the depth of the NN; here and are the dimensions of the input and output, respectively. For classification problems, networks with a pyramidal structure and a certain class of activation functions need to have width larger than the input dimension in order to produce disconnected decision regions (Nguyen et al., 2018).With regards to optimum activation function employed in the NN approximation, before 2010 the two commonly used nonlinear activation functions were the logistic sigmoid and the hyperbolic tangent (); they are essentially the same function by simple rescaling, i.e., . The deep neural networks with these two activations are difficult to train (Glorot & Bengio, 2010)
. The nonzero mean of the sigmoid induces important singular values in the Hessian
(LeCun et al., 1998), and they both suffer from the vanishing gradient problem, especially through neurons near saturation (Glorot & Bengio, 2010). In 2011, ReLU was proposed, which avoids the vanishing gradient problem because of its linearity, and also results in highly sparse NNs (Glorot et al., 2011). Since then, ReLU and its variants including leaky ReLU (LReLU) (Maas et al., 2013), parametric ReLU (PReLU) (He et al., 2015) and ELU (Clevert et al., 2015)are favored in almost all deep learning models. Thus, in this study, we focus on the ReLU activation.
While the aforementioned theoretical results are very powerful, they do not necessarily coincide with the results of training of NNs in practice which is NPhard (Šíma, 2002). For example, while the theory may suggest that the approximation of a multidimensional smooth function is accurate for NN with 10 layers and 5 neurons per layer, it may not be possible to realize this NN approximation in practice. Fukumizu & Amari (2000) first proved that existence of local minima poses a serious problem in learning of NNs. After that, more work has been done to understand bad local minima under different assumptions (Zhou & Liang, 2017; Du et al., 2017; Safran & Shamir, 2017; Wu et al., 2018; Yun et al., 2018). Besides local minima, singularity (Amari et al., 2006) and bad saddle points (Kawaguchi, 2016) also affect training of NNs. Our paper focuses on a particular kind of bad local minima, i.e., those encountered in deep and narrow neural networks collapse with high probability. This is the topic of our work presented in this paper. Our results are summarized in Fig. 6
, which shows a diagram of the safe region of training to achieve the theoretically expected accuracy. As we show in the next section through numerical simulations as well as in subsequent sections through theoretical results, there is very high probability that for deep and narrow ReLU NNs will converge to an erroneous state, which may be the mean value of the function or its partial mean value. However, if the NN is trained with proper normalization techniques, such as batch normalization
(Ioffe & Szegedy, 2015), the collapse can be avoided. Not every normalization technique is effective, for example, weight normalization (Salimans & Kingma, 2016) leads to the collapse of the NN.2 Collapse of deep and narrow neural networks
In this section, we will present several numerical tests for one and twodimensional functions of different regularity to demonstrate that deep and narrow NNs collapse to the mean value or partial mean value of the function.
It is well known that it is hard to train deep neural networks. Here we show through numerical simulations that the situation gets even worse if the neural networks is narrow. First, we use a 10layer ReLU network with width 2 to approximate , and choose the mean squared error (MSE) as the loss. In fact, can be represented exactly by a 2layer ReLU NN with width 2, . However, our numerical tests show that there is a high probability () for the NN to collapse to the mean value of (Fig. 1), no matter what kernel initializers (He normal (He et al., 2015), LeCun normal (LeCun et al., 1998; Klambauer et al., 2017), Glorot uniform (Glorot & Bengio, 2010)) or optimizers (first order or second order including SGD, SGDNesterov (Sutskever et al., 2013), AdaGrad (Duchi et al., 2011), AdaDelta (Zeiler, 2012)
, RMSProp
(Hinton, 2014), Adam (Kingma & Ba, 2015), BFGS (Nocedal & Wright, 2006), LBFGS (Byrd et al., 1995)) are employed. The training data were sampled from a uniform distribution on
, and the minibatch size was chosen as 128 during training. We find that when this happens, in most cases the bias in the last layer is the mean value of the function , and the composition of all the previous layers is equivalent to a zero function. It can be proved that under these conditions, the gradient vanishes, i.e., the optimization stops (Corollary 5). For functions of different regularity, we observed the same collapse problem, see Fig. 2 for the function and Fig. 3 for the function .For multidimensional inputs and outputs, this collapse phenomenon is also observed in our simulations. Here, we test the target function with and , which can be represented by a 2layer neural network with width 4, . When training a 10layer ReLU network with width 4, there is a very high probability for the NN to collapse to the mean value or with low probability to the partial mean value of (Fig. 4).
We also observed the same collapse problem for other losses, such as the mean absolute error (MAE); the results are summarized in Fig. 5 for three different functions with varying regularity. Furthermore, we find that for MSE loss, the constant is the mean value of the target function, while for MAE it is the median value.
3 Initialization of ReLU nets
As we demonstrated above, when the weights of the ReLU NN are randomly initialized from a symmetric distribution, the deep and narrow NN will collapse with high probability. This type of initialization is widely used in real applications. Here, we demonstrate that this initialization avoids the problem of exploding/vanishing mean activation length, therefore this is beneficial for training neural networks.
We study a feedforward neural network
with layers and neurons in the layer (, ). The weights and biases in the layer are an weight matrix and , respectively. The input is , and the neural activity in the layer is . The feedforward dynamics is given bywhere is a componentwise activation function.
Following the work in Poole et al. (2016), we investigate how the length of the input propagates through neural networks. The normalized squared length of the vector before activation at each layer is defined as
(1) 
where denotes the entry of the vector
. If the weights and biases are drawn i.i.d. from a zero mean Gaussian with variance
and respectively, then the length at layer can be obtained from its previous layer (Poole et al., 2016; Long & Sedghi, 2019)(2) 
where is the standard Gaussian measure, and the initial condition is , . It is worth pointing out that Eq. 2 is true for ReLU, but it requires that the widths of NN tend to infinity for other activation functions. When is ReLU, the recursion is simplified to
(3) 
For ReLU, He normal (He et al., 2015), i.e., and , is widely used. This choice guarantees that , which neither shrinks nor expands the inputs. In fact, this result explains the success of He normal in applications. A parallel work by Hanin & Rolnick (2018) shows that initializing weights from a symmetric distribution with variance 2/fanin (fanin is the dimension of the input of each layer) avoids the problem of exploding/vanishing mean activation length. Here we arrived at the same conclusion but with much less work.
4 Theoretical analysis of the collapse problem
In this section, we present the theoretical analysis of the collapse behavior observed in Section 2, and we also derive an estimate of the probability of this collapse. We start by stating the following assumptions for a ReLU feedforward neural network with layers and neurons in the layer (, ):

The domain for is a connected space with at least two points;

The weight matrix of any layer
is a random matrix, where the joint distribution of
is absolutely continuous with respect to Lebesgue measure for .
Remark: We point out here that the connectedness in assumption A1 is a very weak requirement for the input space. The weights in a neural network are usually sampled independently from continuous distributions in real applications, and thus the assumption A2 is satisfied at the NN initialization stage; during training, the assumption A2 is usually maintained due to stochastic gradients of minibatch.
Lemma 1.
With assumptions A1 and A2, if is a constant function, then there exists a layer such that ^{1}^{1}1 denotes for any index , i.e., componentwise. Similarly for , and . and , with probability 1 (wp1).
Corollary 2.
With assumptions A1 and A2, if is biasfree and a constant function, then there exists a layer such that for any , it holds and wp1.
Lemma 3.
With assumptions A1 and A2, if is a constant function, then any order gradients of the loss function with respect to the weights and biases in layers vanish, where is the layer obtained in Lemma 1.
Theorem 4.
For a ReLU feedforward neural network with assumption A1, if the assumption A2 is satisfied during the initialization, and there exists a layer such that for any input , then for any function and , is eventually optimized to a constant function when training by a gradient based optimizer. If using loss and exists, then the resulted constant is , which we write as if no confusion arises; if using loss and the median of the distribution of exists, then the resulted constant is the median.
Remark: See Appendices A, B, C and D for the proofs of Lemma 1, Corollary 2, Lemma 3 and Theorem 4, respectively. MAE and MSE loss used in practice are discrete versions of and loss, respectively, if the size of minibatch is large.
Corollary 5.
With assumptions A1 and A2, for a ReLU feedforward neural network and any function , , if is a constant function with the value , then the gradients of the loss function with respect to any weight or bias vanish when using the loss.
Corollary 5 can be generalized to the following corollary including more general converged mean states.
Corollary 6.
With assumptions A1 and A2, for a ReLU feedforward neural network and any function , , if and each is a connected domain with at least two points, such that
then the gradients of the loss function with respect to any weight or bias vanish when using the loss. Here
is the random variable of
restricted to .See Appendices E and F for the proofs of Corollaries 5 and 6. We can see that Corollary 5 is a special case of Corollary 6 with .
Lemma 7.
Let us assume that a onelayer ReLU feedforward neural network is initialized independently by symmetric nonzero distributions, i.e., any weight or bias of is initialized by a symmetric nonzero distribution, which can be different for different parameters. Then, for any fixed input the corresponding output is zero with probability , except the special case where all biases and the input are zero yielding that the output is always zero.
Theorem 8.
If a ReLU feedforward neural network with layers assembled width is initialized randomly by symmetric nonzero distributions for weights and zero biases, then for any fixed nonzero input, the corresponding output is zero with probability if the last layer also employs ReLU activation, otherwise with the probability .
See Appendices G and H for the proofs of Lemma 7 and Theorem 8. Although biases are initialized to 0 in most applications, for the sake of completeness, we also consider the case where biases are not initialized to 0.
Proposition 9.
If a ReLU feedforward neural network with layers assembled width is initialized randomly by symmetric nonzero distributions (weights and biases), then for any fixed nonzero input, the corresponding output is zero with probability if the last layer also employs ReLU activation, otherwise the output is equal to the last bias with probability .
See Appendix I for the proof of Proposition 9. We note that Theorem 8 provides the probability for any given input, but in Theorem 4 it requires that the entire neural network is a zero function. Hence, the probability in Theorem 8 is an upper bound. In the following theorem, we give a theoretical formula of the probability for the NN with width 2.
Proposition 10.
Suppose the origin is an interior point of . Consider a biasfree ReLU neural network with , width 2 and layers, and weights are initialized randomly by symmetric nonzero distributions. Then for this neural network, the probability of being initialized to a constant function is the last component of , where
(4) 
with and
being the probability distribution after the first layer and the probability transition matrix when one more layer is added, respectively. Here every layer employs the ReLU activation.
See Appendix J for the derivation of and . For general cases, we found that it is hard to obtain an explicit expression for the probability, so we used numerical simulations instead, where 1 million samples of random initialization are used to calculate each probability estimation. We show both theoretically (Theorem 8, Propositions 9 and 10) and numerically that NN has the same probability to collapse no matter what symmetric distributions are used, even if different distributions are used for different weights. On the other hand, to keep the collapse probability less than , because the probability obtained in Theorem 8 is an upper bound, which corresponds to a safer maximum number of layers, we have that , which implies the upper bound of the depth of NN
(5) 
Theorem 8 shows that when the NN gets deeper and narrower, the probability of the NN initialized to a zero function is higher (Fig. 6A). Hence, we have higher probability of vanishing gradient in almost all the layers, rather than just some neurons. In our experiments, we also found that there is very high probability that the gradient is 0 for all parameters except in the last layer, because ReLU is not used in the last layer. During the optimization, the neural network thus can only optimize the parameters in the last layer (Theorem 4). When we design a neural network, we should keep the probability less than 1% or 10%. As a practical guide, we constructed a diagram shown in Fig. 6B that includes both theoretical predictions and our numerical tests. We see that as the number of layers increases, the numerical tests match closer the theoretical results. It is clear from the diagram that a 10layer NN of width 10 has a probability of only 1% to collapse whereas a 10layer NN of width 5 has a probability greater than 10% to collapse; for width of three the probability is greater than 60%.
5 Training deep and narrow neural networks
In this section, we present some training techniques and examine which ones do not suffer from the collapse problem.
5.1 Asymmetric weight initialization
Our analysis applies for any symmetric initialization, so it is straightforward to consider asymmetric initializations. The asymmetric initializations proposed in the literature include orthogonal initialization (Saxe et al., 2014) and layersequential unitvariance (LSUV) initialization (Mishkin & Matas, 2016). LSUV is the orthogonal initialization combined with rescaling of weights such that the output of each layer has unit variance. Because weight rescaling cannot make the output escape from the negative part of ReLU, it is sufficient to consider the orthogonal initialization. The probability of collapse when using orthogonal initialization is very close to and a little lower than that when using symmetric distributions (Fig. 7). Therefore, orthogonal initialization cannot treat the collapse problem.
5.2 Normalization and dropout
As we have shown in the previous section, deep and narrow neural networks cannot be trained well directly with gradientbased optimizers. Here, we employ several widely used normalization techniques to train this kind of networks. We do not consider some methods, such as Highway (Srivastava et al., 2015) and ResNet (He et al., 2016), because in these architectures the neural nets are no longer the standard feedforward neural networks. Current normalization methods mainly include batch normalization (BN) (Ioffe & Szegedy, 2015), layer normalization (LN) (Ba et al., 2016), weight normalization (WN) (Salimans & Kingma, 2016), instance normalization (IN) (Ulyanov et al., 2016), group normalization (GN) (Wu & He, 2018), and scaled exponential linear units (SELU) (Klambauer et al., 2017). BN, LN, IN and GN are similar techniques and follow the same formulation, see Wu & He (2018) for the comparison.
Because we focus on the performance of these normalization methods on narrow nets and the width of the neural network must be larger than the dimension of the input to achieve a good approximation, we only test the normalization methods on low dimensional inputs. However, LN, IN and GN perform normalization on each training data individually, and hence they cannot be used in our lowdimensional situations. Hence, we only examine BN, WN and SELU. BN is applied before activations while for SELU LeCun normal initialization is used (Klambauer et al., 2017). Our simulations show that the neural network can successfully escape from the collapsed areas and approximate the target function with a small error, when BN or SELU are employed. BN changes the weights and biases not only depending on the gradients, and different from ReLU the negative values do not vanish in SELU. However, WN failed because it is only a simple reparameterization of the weight vectors.
Moreover, our simulations show that the issue of collapse cannot be solved by dropout, which induces sparsity and more zero activations (Srivastava et al., 2014).
6 Conclusion
We consider here ReLU neural networks for approximating multidimensional functions of different regularity, and in particular we focus on deep and narrow NNs due to their reportedly good approximation properties. However, we found that training such NNs is problematic because they converge to erroneous means or partial means or medians of the target function. We demonstrated this collapse problem numerically using one and twodimensional functions with , and regularity. These numerical results are independent of the optimizers we used; the converged state depends on the loss but changing the loss function does not lead to correct answers. In particular, we have observed that the NN with MSE loss converges to the mean or partial mean values while the NN with MAE loss converges to the median values. This collapse phenomenon is induced by the symmetric random initialization, which is popular in practice because it maintains the length of the outputs of each layer as we show theoretically in Section 3.
We analyze theoretically the collapse phenomenon by first proving that if a NN is a constant function then there must exist a layer with output 0 and the gradients of weights and biases in all the previous layers vanish (Lemma 1, Corollary 2, and Lemma 3). Subsequently, we prove that if such conditions are met, then the NN will converge to a constant value depending on the loss function (Theorem 4). Furthermore, if the output of NN is equal to the mean value of the target function, the gradients of weights and biases vanish (Corollaries 5 and 6). In Lemma 7 and Theorem 8 and Proposition 9, we derive estimates of the probability of collapse for general cases, and in Proposition 10, we derive a more precise estimate for deep NNs with width 2. These theoretical estimates are verified numerically by tests using NNs with different layers and widths. Based on these results, we construct a diagram which can be used as a practical guideline in designing deep and narrow NNs that do not suffer from the collapse phenomenon.
Finally, we examine different methods of preventing deep and narrow NNs from converging to erroneous states. In particular, we find that asymmetric initializations including orthogonal initialization and LSUV cannot be used to avoid this collapse. However, some normalization techniques such as batch normalization and SELU can be used successfully to prevent the collapse of deep and narrow NNs; on the other hand, weight normalization fails. Similarly, we examine the effect of dropout which, however, also fails.
Acknowledgments
This work received support by the DARPA EQUiPS grant N660011524055, the NSF grant DMS1736088, the AFOSR grant FA95501710013. The research of the second author was partially supported by the NSF of China 11771083 and the NSF of Fujian 2017J01556, 2016J01013.
References
 Amari et al. (2006) S. Amari, H. Park, and T. Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural computation, 18(5):1007–1065, 2006.
 Ba et al. (2016) J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

Barron (1993)
A. R. Barron.
Universal approximation bounds for superpositions of a sigmoidal function.
IEEE Transactions on Information Theory, 39(3):930–945, 1993.  Byrd et al. (1995) R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
 Clevert et al. (2015) D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 Cybenko (1989) G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
 Delalleau & Bengio (2011) O. Delalleau and Y. Bengio. Shallow vs. deep sumproduct networks. In Advances in Neural Information Processing Systems, pp. 666–674, 2011.
 Du et al. (2017) S. Du, J. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.

Duchi et al. (2011)
J. Duchi, E. Hazan, and Y. Singer.
Adaptive subgradient methods for online learning and stochastic
optimization.
Journal of Machine Learning Research
, 12(Jul):2121–2159, 2011.  Eldan & Shamir (2016) R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pp. 907–940, 2016.

Fukumizu & Amari (2000)
K. Fukumizu and S. Amari.
Local minima and plateaus in hierarchical structures of multilayer perceptrons.
Neural networks, 13(3):317–327, 2000. 
Glorot & Bengio (2010)
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
International Conference on Artificial Intelligence and Statistics
, pp. 249–256, 2010.  Glorot et al. (2011) X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 315–323, 2011.
 Hanin & Rolnick (2018) B. Hanin and D. Rolnick. How to start training: The effect of initialization and architecture. arXiv preprint arXiv:1803.01719, 2018.
 Hanin & Sellke (2017) B. Hanin and M. Sellke. Approximating continuous functions by relu nets of minimal width. arXiv preprint arXiv:1710.11278, 2017.

He et al. (2015)
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
InIEEE International Conference on Computer Vision
, pp. 1026–1034, 2015. 
He et al. (2016)
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pp. 770–778, 2016.  Hinton (2014) G. Hinton. Overview of minibatch gradient descent. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf, 2014.
 Hornik et al. (1989) K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
 Ioffe & Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
 Kawaguchi (2016) K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
 Kingma & Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 Klambauer et al. (2017) G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Selfnormalizing neural networks. In Advances in Neural Information Processing Systems, pp. 972–981, 2017.
 LeCun et al. (1998) Y. LeCun, L. Bottou, G. B. Orr, and K.R. Müller. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–50. Springer, 1998.
 Liang & Srikant (2017) S. Liang and R. Srikant. Why deep neural networks for function approximation? In International Conference on Learning Representations, 2017.
 Long & Sedghi (2019) P. M. Long and H. Sedghi. On the effect of the activation function on the distribution of hidden nodes in a deep network, 2019. URL https://openreview.net/forum?id=HJej3s09Km.
 Lu et al. (2017) Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems, pp. 6231–6239, 2017.
 Maas et al. (2013) A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, volume 30, pp. 3, 2013.
 Mhaskar (1996) H. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8(1):164–177, 1996.
 Mhaskar et al. (2017) H. Mhaskar, Q. Liao, and T. A. Poggio. When and why are deep networks better than shallow ones? In Association for the Advancement of Artificial Intelligence, pp. 2343–2349, 2017.
 Mhaskar & Poggio (2016) H. N. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.
 Mishkin & Matas (2016) D. Mishkin and J. Matas. All you need is a good init. In International Conference on Learning Representations, 2016.
 Nguyen et al. (2018) Q. Nguyen, M. Mukkamala, and M. Hein. Neural networks should be wide enough to learn disconnected decision regions. In International Conference on Machine Learning, 2018.
 Nocedal & Wright (2006) J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.
 Petersen & Voigtlaender (2018) P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. In Conference on Learning Theory, 2018.

Poggio et al. (2017)
T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao.
Why and when can deepbut not shallownetworks avoid the curse of dimensionality: a review.
International Journal of Automation and Computing, 14(5):503–519, 2017.  Poole et al. (2016) B. Poole, S. Lahiri, M. Raghu, J. SohlDickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, pp. 3360–3368, 2016.
 Safran & Shamir (2017) I. Safran and O. Shamir. Spurious local minima are common in twolayer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
 Salimans & Kingma (2016) T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.
 Saxe et al. (2014) A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, 2014.
 Šíma (2002) J. Šíma. Training a single sigmoidal neuron is hard. Neural computation, 14(11):2709–2728, 2002.
 Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Srivastava et al. (2015) R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Advances in Neural Information Processing Systems, pp. 2377–2385, 2015.
 Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pp. 1139–1147, 2013.
 Ulyanov et al. (2016) D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance Normalization: The Missing Ingredient for Fast Stylization. ArXiv eprints, July 2016.
 Wu et al. (2018) C. Wu, J. Luo, and J. Lee. No spurious local minima in a two hidden unit relu network. In International Conference on Learning Representations Workshop, 2018.
 Wu & He (2018) Y. Wu and K. He. Group normalization. arXiv preprint arXiv:1803.08494, 2018.
 Yarotsky (2017) D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
 Yun et al. (2018) C. Yun, S. Sra, and Jadbabaie A. Small nonlinearities in activation functions create bad local minima in neural networks. arXiv preprint arXiv:1802.03487, 2018.
 Zeiler (2012) M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 Zhou & Liang (2017) Y. Zhou and Y. Liang. Critical points of neural networks: Analytical forms and landscape properties. arXiv preprint arXiv:1710.11205, 2017.
Appendix A Proof of Lemma 1
Lemma 11.
Let be a random matrix, where are random variables, and the joint distribution of is absolutely continuous for . If is a nonzero column vector, then .
Proof.
Let us consider the first value of , i.e., . Because , we have
is a hyperplane in
whose coordinates are , . Because the joint distribution of is absolutely continuous, . Hence,Therefore, . ∎
Now let us go back to the proof of Lemma 1.
Proof.
By assumption A2 and Lemma 11, is a constant function, iff is a constant function with respect to . So we can assume that there is ReLU in the last layer, and prove that there exists a layer , s.t., and wp1 for every . We proceed in two steps.
i) For , we have is a constant. If is not always , then there exists and , s.t., . Because is a connected space with at least two points, then has no isolated points, which implies is not an isolated point. Since the neural network is a continuous map, is connected. So there exists in the neighborhood of , s.t., and wp1, because of by Lemma 11. Hence, , which contradicts the fact that is a constant function. Therefore, and .
ii) Assume the theorem is true for . Then for , if , choose and we are done; otherwise, consider the NN without the first layer with as the input, denoted . By i, is a connected space with at least two points. Because is a constant function of and has layers, by induction, there exists a layer whose output is zero. Therefore, for the original neural network , the output of such layer is also zero.
By i and ii, the statement is true for any . ∎
Appendix B Proof of Corollary 2
Proof.
By Lemma 1, there exists a layer , s.t. and wp1. Because is biasfree, and wp1. By induction, for any , and wp1. ∎
Appendix C Proof of Lemma 3
Proof.
Because
, it is then obvious by backpropagation. ∎
Appendix D Proof of Theorem 4
Proof.
Because , is a constant function, and then by Lemma 3, gradients of the loss function w.r.t. the weights and biases in layers vanish. Hence, the weights and biases in layers will not change when using a gradient based optimizer, which implies is always a constant function depending on the weights and biases in layers . Therefore, will be optimized to a constant function, which has the smallest loss. For loss, this constant with the smallest loss is . For loss, this constant with the smallest loss is its median. ∎
Appendix E Proof of Corollary 5
Appendix F Proof of Corollary 6
Proof.
It suffices to show that gradients vanish for , and .
i) When is restricted on , is a constant function with value . Similar to Corollary 5, gradients vanish when using the loss.
ii) For , the loss at is 0, so gradients vanish.
By i and ii, gradients vanish when using the (MSE) loss. ∎
Appendix G Proof of Lemma 7
Proof.
Let be any input, and be the corresponding output. For ,
Because is a dim vector initialized by a symmetric distribution, then
So , and then . Here denotes the probability. ∎
Appendix H Proof of Theorem 8
Proof.
If the last layer also employs ReLU activation, by Lemma 7, for . Then, for any fixed input ,
The last equality holds because .
If in the last layer we do not apply ReLU activation, then . ∎
Appendix I Proof of Proposition 9
Proof.
If the last layer also has ReLU activation, by Lemma 7,
If the last layer does not have ReLU activation, and , then
For , is a single layer perceptron, which is a trivial case. ∎
Appendix J Proof of Proposition 10
Proof.
We consider a ReLU neural network with and each hidden layer with width 2. Because all biases are zero, then it is easy to see the following fact: when the input is 0, the output of any neuron in any layer is 0; when the input is negative, the output of any neuron in any layer is a linear function with respect to the input; when the input is positive, the output of any neuron in any layer is also a linear function with respect to the input. Because the origin is an interior point of , then it suffices to consider a subset with . The output of each hidden layer has 16 possible cases:
where , , , are some coefficients.
Each case in the th hidden layer may also induce all 16 cases in the th layer. For any given case in the th hidden layer, we will compute the probabilities of these 16 cases for the th layer as follows.
i) Case (1)
Note that lies in the first quadrant, and lies in the third quadrant. Then the output of the next layer is
Since the matrix is random, for fixed and , the probability of case (1) is . Without loss of generality, we can assume that , and hence we can assume that , and , . It is easy to see that
Since are random, the probability of case (1) is
Similarly, the probability of cases (6), (11) and (16) in the th layer are also . For cases (2), (3), (5), (8), (9), (12), (14) and (15), the probability is
For cases (4), (7), (10) and (13), the probability is
ii) Case (2) (the same method can be applied for cases (3), (5) and (9))
Note that in this case we can assume that , and is a constant vector. It is easy to see that , and hence the probabilities of cases (1), (6), (11) and (16) are
Similarly, the probabilities of cases (2), (3), (5), (8), (9), (12), (14) and (15) are
and the probabilities of cases (4), (7), (10) and (13) are
iii) Case (4) (the same method can be applied for cases (8) and (12))
The output of the next layer is
It is easy to see that the probabilities of cases (4), (8), (12) and (16) are , and the probabilities of all other cases are 0.
iv) Case (6) (the same method can be applied for case (11))
The output of the next layer is
Note that in this case, and , and thus it is not hard to see that the probabilities of cases (1), (6), (11) and (16) are , and the probabilities of all the other cases are 0.
v) Case (7) (the same method can be applied for case (10))
The output of the next layer is