One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss, even if the labels are arbitrary(Zhang et al., 2016). Over-parameterization is widely believed to be the main reason for this phenomenon as only if the neural network has a sufficiently large capacity, it is possible for this neural network to fit all the training data. In practice, many neural network architectures are highly over-parameterized. For example, Wide Residual Networks have 100x parameters than the number of training data (Zagoruyko and Komodakis, 2016).
The second mysterious phenomenon in training deep neural networks is “deeper networks are harder to train.” To solve this problem, He et al. (2016) proposed the deep residual network (ResNet) architecture which enables randomly initialized first order method to train neural networks with an order of magnitude more layers. Theoretically, Hardt and Ma (2016) showed that residual links in linear networks prevent gradient vanishing in a large neighborhood of zero, but for neural networks with non-linear activations, the advantages of using residual connections are not well understood.
In this paper, we demystify these two mysterious phenomena. We consider the setting where there are data points, and the neural network has layers with width
. We focus on the least-squares loss and assume the activation function is Lipschitz and smooth. This assumption holds for many activation functions including the soft-plus. Our contributions are summarized below.
Next, we consider the ResNet architecture. We show as long as , then randomly initialized gradient descent converges to zero training loss at a linear rate. Comparing with the first result, the dependence on the number of layers improves exponentially for ResNet. This theory demonstrates the advantage of using residual connections.
Lastly, we apply the same technique to analyze convolutional ResNet. We show if where is the number of patches, then randomly initialized gradient descent achieves zero training loss.
Our proof builds on two ideas from previous work on gradient descent for two-layer neural networks. First, we use the observation by Li and Liang (2018) that if the neural network is over-parameterized, every weight matrix is close to its initialization. Second, following Du et al. (2018b)
, we analyze the dynamics of the predictions whose convergence is determined by the least eigenvalue of the Gram matrix induced by the neural network architecture and to lower bound the least eigenvalue, it is sufficient to bound the distance of each weight matrix from its initialization.
Different from these two works, in analyzing deep neural networks, we need to exploit more structural properties of deep neural networks and develop new techniques. See Section 6 and the Appendix for more details.
This paper is organized as follows. In Section 2, we formally state the problem setup. In Section 3, we give our main result for the deep fully-connected neural network. In Section 4, we give our main result for the ResNet. In Section 5, we give our main result for the convolutional ResNet. In Section 6, we present a unified proof strategy for these three architectures. We conclude in Section 7 and defer all proofs to the appendix.
1.2 Related Works
Recently, many works try to study the optimization problem in deep learning. Since optimizing a neural network is a non-convex problem, one approach is first to develop a general theory for a class of non-convex problems which satisfy desired geometric properties and then identify that the neural network optimization problem belongs to this class. One promising candidate class is the set of functions that satisfy all local minima are global and there exists a negative curvature for every saddle point. For this function class, researchers have shown gradient descent (Jin et al., 2017; Ge et al., 2015; Lee et al., 2016; Du et al., 2017a) can find a global minimum. Many previous works thus try to study the optimization landscape of neural networks with different activation functions (Soudry and Hoffer, 2017; Safran and Shamir, 2018, 2016; Zhou and Liang, 2017; Freeman and Bruna, 2016; Hardt and Ma, 2016; Nguyen and Hein, 2017; Kawaguchi, 2016; Venturi et al., 2018; Soudry and Carmon, 2016; Du and Lee, 2018; Soltanolkotabi et al., 2018; Haeffele and Vidal, 2015). However, even for a deep linear network, there exists a saddle point that does not have a negative curvature (Kawaguchi, 2016), so it is unclear whether this approach can be used to obtain the global convergence guarantee of first-order methods.
Another way to attack this problem is to study the dynamics of a specific algorithm for a specific neural network architecture. Our paper also belongs to this category. Many previous works put assumptions on the input distribution and assume the label is generated according to a planted neural network. Based on these assumptions, one can obtain global convergence of gradient descent for some shallow neural networks (Tian, 2017; Soltanolkotabi, 2017; Brutzkus and Globerson, 2017; Du et al., 2018a; Li and Yuan, 2017; Du et al., 2017b). Some local convergence results have also been proved (Zhong et al., 2017a, b; Zhang et al., 2018). In comparison, our paper does not try to recover the underlying neural network. Instead, we focus the empirical loss minimization problem and rigorously prove that randomly initialized gradient descent can achieve zero training loss.
The most related papers are Li and Liang (2018); Du et al. (2018b) who observed that when training a two-layer full connected neural network, most of the patterns do not change over iterations, which we also use to show the stability of the Gram matrix. They used this observation to obtain the convergence rate of gradient descent on a two-layer over-parameterized neural network for the cross-entropy and least-squares loss. More recently, Allen-Zhu et al. (2018a) generalizes ideas from Li and Liang (2018)
to derive convergence rates of training recurrent neural networks. Our work extends these previous results in several ways: a) we consider deep networks, b) we generalize to ResNet architectures, and c) we generalize to convolutional networks. To improve the width dependenceon sample size
, we utilize a smooth activation (e.g. smooth ReLU). For example, our results specialized to depthimprove upon Du et al. (2018b) in the required amount of overparametrization from to . See Theorem 3.1 for the precise statement.
Chizat and Bach (2018); Wei et al. (2018); Mei et al. (2018) used optimal transport theory to analyze gradient descent on over-parameterized models. However, their results are limited to two-layer neural networks and may require an exponential amount of over-parametrization.
developed the connection between deep neural networks with kernel methods and showed stochastic gradient descent can learn a function that is competitive with the best function in the conjugate kernel space of the network.Andoni et al. (2014)
showed that gradient descent can learn networks that are competitive with polynomial classifiers. However, these results do not imply gradient descent can find a global minimum for the empirical loss minimization problem.
Finally in concurrent work, Allen-Zhu et al. (2018b) also analyze gradient descent on deep neural networks. The primary difference between the two papers are that we analyze general smooth activations, and Allen-Zhu et al. (2018b) develop specific analysis for ReLU activation. The two papers also differ significantly on their data assumptions. We wish to emphasize a fair comparison is not possible due to the difference in setting and data assumptions. We view the two papers as complementary since they address different neural net architectures.
For ResNet, the primary focus of this manuscript, the required width per layer222In all comparisons, we ignore the polynomial dependency on data-dependent parameters which only depends on the input data and the activation function. The two papers use different measures and in our manuscript and in Allen-Zhu et al. (2018b), so they are not directly comparable. for Allen-Zhu et al. (2018b) is and for Theorem 4.1 is 333 and does not depend on the depth . We substituted this into Theorem 4.1. Our paper requires a width that does not depend on the desired accuracy . As a consequence, Theorem 4.1 guarantees the convergence of gradient descent to a global minimizer. The iteration complexity of Allen-Zhu et al. (2018b) is and of Theorem 4.1 is .
For fully-connected networks, Allen-Zhu et al. (2018b) requires width and iteration complexity . Theorem 3.1 requires width and iteration complexity . The primary difference is for very deep fully-connected networks, Allen-Zhu et al. (2018b) requires less width and iterations, but we have much milder dependence on sample size . Commonly used fully-connected networks such as VGG are not deep (
), yet the dataset size such as ImageNet () is very large.
In a second concurrent work, Zou et al. (2018)
also analyzed the convergence of gradient descent on fully-connected networks with ReLU activation. The emphasis is on different loss functions (e.g. hinge loss), so the results are not directly comparable. BothZou et al. (2018) and Allen-Zhu et al. (2018b) also analyze stochastic gradient descent.
We Let . Given a set , we use
to denote the uniform distribution over. We use
to denote the standard Gaussian distribution. For a matrix, we use to denote its
-th entry. For a vector, we use to denote the Euclidean norm. For a matrix we use to denote the Frobenius norm and to denote the operator norm. If a matrix is positive semi-definite, we use to denote its smallest eigenvalue. We use to denote the standard Euclidean inner product between two vectors or matrices. We use to denote the activation function, which in this paper we assume is Lipschitz continuous and smooth (Lipschitz gradient). The guiding example is softplus: whose Lipschitz and smoothness constants are bounded by . We will use (e.g. ) to denote universal constants that only depend on the activation. Lastly, let and denote standard Big-O and Big-Omega notations, only hiding absolute constants.
2.2 Problem Setup
In this paper, we focus on the empirical risk minimization problem with the quadratic loss function
where are the training inputs, are the labels, is the parameter we optimize over and is the prediction function, which in our case is a neural network. We consider the following architectures.
Multilayer fully-connected neural networks: Let be the input, is the first weight matrix, is the weight at the -th layer for , is the output layer and is the activation function.444We assume intermediate layers are square matrices for simplicity. It is not difficult to generalize our analysis to rectangular weight matrices. The prediction function is defined as
where is a scaling factor to normalize the input in the initialization phase.
ResNet555We will refer to this architecture as ResNet, although this differs by the standard ResNet architecture since the skip-connections at every layer, instead of every two layers. This architecture was previously studied in Hardt and Ma (2016). We study this architecture for the ease of presentation and analysis. It is not hard to generalize our analysis to architectures with skip-connections are every two or more layers. : We use the same notations as the multilayer fully connected neural networks. We define the prediction recursively.
where is a constant that only depends on the activation, , and are constants specified in Section 4. Note here we use a scaling. This scaling plays an important role in guaranteeing the width per layer only needs to scale polynomially with . In practice, the small scaling is enforced by a small initialization of the residual connection (Hardt and Ma, 2016; Anonymous, 2018), which obtains state-of-art performance for deep residual network. We choose to use an explicit scaling, instead of altering the initialization scheme for notational convenience.
Convolutional ResNet: Lastly, we consider the convolutional ResNet architecture. Again we define the prediction function in a recursive way. Let be the input, where is the number of input channels and is the number of pixels. For , we let the number of channels be and number of pixels be . Given for , we first use an operator to divide into patches. Each patch has size and this implies
. For example, when the stride isand
where we defined
, i.e., zero-padding. Note this operator has the property
because each element from at least appears once and at most appears times. In practice, is often small like , so throughout the paper we treat as a constant in our theoretical analysis. To proceed, let , we have
where is a small constant depending only on , , and are constants we will specify in Section 5. Finally, for , the output is defined as
Note here we use the similar scaling as ResNet.
To learn the deep neural network, we consider the randomly initialized gradient descent algorithm to find the global minimizer of the empirical loss (1). Specifically, we use the following random initialization scheme. For every level , each entry is sampled from a standard Gaussian distribution, . For multilayer fully-connected neural networks and ResNet, each entry of the output layer is sampled from a Rademacher distribution, . For convolutional ResNet, for , we sample and set , i.e., we make each channel have the same corresponding output weight. In this paper, we fix the output layer and train lower layers by gradient descent, for and
where is the step size. Hoffer et al. (2018) show that fixing the last layer results in little to no loss in test accuracy.
3 Warm Up: Deep Fully-connected Neural Networks
In this section, we show gradient descent with a constant positive step size converges to the global minimum with a linear rate. To formally state our assumption, we first define the following population Gram matrices in a recursive way. For , let
Intuitively, represents the Gram matrix after compositing times kernel induced by the activation function. As will be apparent in the proof, this Gram matrix natural comes up as goes to infinity. Based on these definitions, we make the following assumption.
The first part of the assumption states that the Gram matrix at the last layer is strictly positive definite and this least eigenvalue determines the convergence rate of the gradient descent algorithm. Note this assumption is a generalization of the non-degeneracy assumption used in Du et al. (2018b) in which they considered a two-layer neural network and assumed the Gram matrix from the second layer is strictly positive definite. The second part of the assumption states every two by two sub-matrix of every layer has a lower bounded eigenvalue. The inverse of this quantity can be viewed as a measure of the stability of the population Gram matrix. As will be apparent in the proof, this condition guarantees that if is large, at the initialization phase, our Gram matrix is close to the population Gram matrix.
Now we are ready to state our main theorem for deep multilayer fully-connected neural networks.
Theorem 3.1 (Convergence Rate of Gradient Descent for Deep Fully Connected Neural Networks).
Assume for all , , for some constant and the number of hidden nodes per layer .
If we set the step size , then with probability at least
, then with probability at leastover the random initialization we have for
This theorem states that if the width is large enough and we set step size appropriately then gradient descent converges to the global minimum with zero loss at linear rate. The width depends on , , and . The dependency on and the least eigenvalue is only polynomial, which is the same as previous work on shallow neural networks (Du et al., 2018b; Li and Liang, 2018; Allen-Zhu et al., 2018a). However, the dependency on the number of layers and stability parameter of the Gram matrices is exponential. As will be clear in Section 6 and proofs, this exponential dependency results from the amplification factor of multilayer fully-connected neural network architecture.
In this section we consider the convergence of gradient descent for training a ResNet. We focus on how much over-parameterization is needed to ensure the global convergence of gradient descent. Similar to the previous section, we define the population Gram matrices, for
These are the asymptotic Gram matrices as goes to infinity. We make the following assumption which determines the convergence rate and the amount of over-parameterization.
Note defined here is different from that of the deep fully-connected neural network because here only depends on . In general, unless there are two data points that are parallel, is always positive here. Now we are ready to state our main theorem for ResNet.
Theorem 4.1 (Convergence Rate of Gradient Descent for ResNet).
Assume for all , , for some constant and the number of hidden nodes per layer
If we set the step size , then with probability at least over the random initialization we have for
In sharp contrast to Theorem 3.1, this theorem is fully polynomial in the sense that both the number of neurons and the convergence rate is polynomially in and . Note the amount of over-parameterization depends on which is the smallest eigenvalue of the -th layer’s Gram matrix. In Section D.1, we show , and does not depend inverse exponentially on . The main reason that we do not have any exponential factor here is that the skip connection block makes the overall architecture more stable in both the initialization phase and the training phase. See Section B for details.
5 Convolutional ResNet
In this section we present convergence result of gradient descent for convolutional ResNet. Again, we focus on how much over-parameterization is needed to ensure the global convergence of gradient descent. Similar to previous sections, we first define the population Gram matrix in order to formally state our assumptions. These are the asymptotic Gram matrices as goes to infinity.
where . We make the following assumption which determines the convergence rate and the amount of over-parameterization.
Note this assumption is basically the same as Assumption 4.1. Now we state our main convergence theorem for the convolutional ResNet.
Theorem 5.1 (Convergence Rate of Gradient Descent for Convolutional ResNet).
Assume for all , , for some constant and the number of hidden nodes per layer If we set the step size , then with probability at least over the random initialization we have for
This theorem is similar to that of ResNet. The number of neurons required per layer is only polynomial in the depth and the number of data points and step size is only polynomially small. The analysis is similar to ResNet and we refer readers to Section C for details.
6 Proof Sketch
and we denote . Note with this notation, we can write the loss as
Our induction hypothesis is just the following convergence rate of empirical loss.
At the -th iteration, we have
Note this condition implies the conclusions we want to prove. To prove Condition 6.1, we consider one iteration on the loss function.
This equation shows if , the loss decreases. Note both terms involves , which we now more carefully analyze. To simplify notations, we define
We look one coordinate of .
Using Taylor expansion, we have
Denote and and so . We will show the term, which is proportional to , drives the loss function to decrease and the term, which is a perturbation term but it is proportional to so it is small. We further unpack the term,
where with -th entry being . Now we unpack this matrix. Recall
Define with . Note by definition is a Gram matrix and thus it is positive semi-definite. Therefore we have
Now we analyze . We can write in a more compact form with .
Now observe that
Now recall the progress of loss function in Equation (7):
For the perturbation terms, through standard calculations, we can show both and are proportional to so if we set sufficiently small, this term is smaller than and thus the loss function decreases with a linear rate. See Lemmas A.6 and A.7.
Therefore, to prove the induction hypothesis, it suffices to prove for , where is independent of . To analyze the least eigenvalue, we first look at the initialization. Using assumptions of the population kernel matrix and concentration inequalities, we can show at the beginning , which implies
Now for the -th iteration, by matrix perturbation analysis, we know it is sufficient to show . To do this, we use a similar approach as in Du et al. (2018b). We show as long as is large enough, every weight matrix is close its initialization in a relative error sense. Ignoring all other parameters except , , and thus the average per-neuron distance from initialization is which tends to zero as increases. See Lemma A.5 for precise statements with all the dependencies.
This fact in turn shows is small (see Lemma A.4). The main difference from Du et al. (2018b) is that we are considering deep neural networks, and when translating the small deviation, to , there is an amplification factor which depends on the neural network architecture.
For deep fully connected neural networks, we show this amplification factor is exponential in . On the other hand, for ResNet and convolutional ResNet we show this amplification factor is only polynomial in . We further show the width required is proportional to this amplification factor.
In this paper, we show that gradient descent on deep overparametrized networks can obtain zero training loss. The key technique is to show that the Gram matrix is increasingly stable under overparametrization, and so every step of gradient descent decreases the loss at a geometric rate.
We list some directions for future research:
The current paper focuses on the train loss, but does not address the test loss. It would be an important problem to show that gradient descent can also find solutions of low test loss. In particular, existing work only demonstrate that gradient descent works under the same situations as kernel methods and random feature methods (Daniely, 2017; Li and Liang, 2018).
The width of the layers is polynomial in all the parameters for the ResNet architecture, but still very large. Realistic networks have number of parameters, not width, a large constant multiple of . We consider improving the analysis to cover commonly utilized networks an important open problem.
The current analysis is for gradient descent, instead of stochastic gradient descent. We believe the analysis can be extended to stochastic gradient, while maintaining the linear convergence rate.
We thank Lijie Chen and Ruosong Wang for useful discussions.
- Allen-Zhu et al. (2018a) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. arXiv preprint arXiv:1810.12065, 2018a.
- Allen-Zhu et al. (2018b) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018b.
- Andoni et al. (2014) Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In International Conference on Machine Learning, pages 1908–1916, 2014.
- Anonymous (2018) Anonymous. The unreasonable effectiveness of (zero) initialization in deep residual learning. Openreview https://openreview.net/pdf?id=H1gsz30cKX, 2018.
- Brutzkus and Globerson (2017) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a ConvNet with gaussian inputs. In International Conference on Machine Learning, pages 605–614, 2017.
- Chizat and Bach (2018) Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv preprint arXiv:1805.09545, 2018.
- Daniely (2017) Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
- Du and Lee (2018) Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation. Proceedings of the 35th International Conference on Machine Learning, pages 1329–1338, 2018.
- Du et al. (2017a) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017a.
- Du et al. (2017b) Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017b.
- Du et al. (2018a) Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. Proceedings of the 35th International Conference on Machine Learning, pages 1339–1348, 2018a.
- Du et al. (2018b) Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018b.
- Freeman and Bruna (2016) C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
Ge et al. (2015)
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points
online stochastic gradient for tensor decomposition.In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
- Haeffele and Vidal (2015) Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
- Hardt and Ma (2016) Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
- Hoffer et al. (2018) Elad Hoffer, Itay Hubara, and Daniel Soudry. Fix your classifier: the marginal value of training the last weight layer. arXiv preprint arXiv:1801.04540, 2018.
- Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732, 2017.
- Kawaguchi (2016) Kenji Kawaguchi. Deep learning without poor local minima. In Advances In Neural Information Processing Systems, pages 586–594, 2016.
- Lee et al. (2016) Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
- Li and Liang (2018) Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.
- Li and Yuan (2017) Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
- Mei et al. (2018) Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layers neural networks. Proceedings of the National Academy of Sciences, pages E7665–E7671, 2018.
- Nguyen and Hein (2017) Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In International Conference on Machine Learning, pages 2603–2612, 2017.
- Safran and Shamir (2016) Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pages 774–782, 2016.
- Safran and Shamir (2018) Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer ReLU neural networks. In International Conference on Machine Learning, pages 4433–4441, 2018.
- Soltanolkotabi (2017) Mahdi Soltanolkotabi. Learning ReLus via gradient descent. In Advances in Neural Information Processing Systems, pages 2007–2017, 2017.
- Soltanolkotabi et al. (2018) Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 2018.
- Soudry and Carmon (2016) Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
- Soudry and Hoffer (2017) Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.
- Tian (2017) Yuandong Tian. An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, pages 3404–3413, 2017.
- Venturi et al. (2018) Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys. arXiv preprint arXiv:1802.06384, 2018.
- Vershynin (2010) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- Wei et al. (2018) Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369, 2018.
- Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. NIN, 8:35–67, 2016.
- Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- Zhang et al. (2018) Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. arXiv preprint arXiv:1806.07808, 2018.
- Zhong et al. (2017a) Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017a.
- Zhong et al. (2017b) Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017b.
- Zhou and Liang (2017) Yi Zhou and Yingbin Liang. Critical points of neural networks: Analytical forms and landscape properties. arXiv preprint arXiv:1710.11205, 2017.
- Zou et al. (2018) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
Appendix A Proofs for Section 3
We first derive the formula of the gradient for the multilayer fully connected neural network is
are the derivative matrices induced by the activation function and
is the output of the -th layer.
Through standard calculation, we can get the expression of of the following form
We first present a lemma shows with high probability the feature of each layer is approximately normalized.
Lemma A.1 (Lemma on Initialization Norms).
If is Lipschitz and , where , then with probability at least over random initialization, for every and , we have
We follow the proof sketch described in Section 6. We first analyze the spectral property of at the initialization phase. The following lemma lower bounds its least eigenvalue. This lemma is a direct consequence of Theorem D.1 and Remark D.4.
Lemma A.2 (Least Eigenvalue at the Initialization).
If , we have
Now we proceed to analyze the training process. We prove the following lemma which characterizes how the perturbation from weight matrices propagates to the input of each layer. This Lemma is used to prove the subsequent lemmas.
Suppose for every , , and for some constant and . If is Lipschitz, we have
Here the assumption of can be shown using Lemma E.4 and taking union bound over . Next, we show with high probability over random initialization, perturbation in weight matrices leads to small perturbation in the Gram matrix.
Suppose is Lipschitz and smooth. Suppose for , , , if