Modern machine learning models often consist of multiple layers. For example, consider a feed-forward deep neural network that defines a prediction function
where are weight matrices in layers, and is a point-wise homogeneousactivation function such as Rectified Linear Unit (ReLU) . A simple observation is that this model is homogeneous: if we multiply a layer by a positive scalar and divide another layer by , the prediction function remains the same, e.g. .
A direct consequence of homogeneity is that a solution can produce small function value while being unbounded, because one can always multiply one layer by a huge number and divide another layer by that number. Theoretically, this possible unbalancedness poses significant difficulty in analyzing first order optimization methods like gradient descent/stochastic gradient descent (GD/SGD), because when parameters are not a priori constrained to a compact set via either coerciveness111A function is coercive if implies . of the loss or an explicit constraint, GD and SGD are not even guaranteed to converge (Lee et al., 2016, Proposition 4.11)
. In the context of deep learning,Shamir (2018) determined that the primary barrier to providing algorithmic results is in that the sequence of parameter iterates is possibly unbounded.
Now we take a closer look at asymmetric matrix factorization, which is a simple two-layer homogeneous model. Consider the following formulation for factorizing a low-rank matrix:
where is a matrix we want to factorize. We observe that due to the homogeneity of , it is not smooth222A function is said to be smooth if its gradient is -Lipschitz continuous for some finite . even in the neighborhood of a globally optimum point. To see this, we compute the gradient of :
Notice that the gradient of is not homogeneous anymore. Further, consider a globally optimal solution such that is of order and is of order ( being very small). A small perturbation on can lead to dramatic change to the gradient of . This phenomenon can happen for all homogeneous functions when the layers are unbalanced. The lack of nice geometric properties of homogeneous functions due to unbalancedness makes first-order optimization methods difficult to analyze.
A common theoretical workaround is to artificially modify the natural objective function as in (1) in order to prove convergence. In (Tu et al., 2015; Ge et al., 2017a), a regularization term for balancing the two layers is added to (1):
For problem (3), the regularizer removes the homogeneity issue and the optimal solution becomes unique (up to rotation).
Ge et al. (2017a) showed that the modified objective (3) satisfies (i) every local minimum is a global minimum, (ii) all saddle points are strict333A saddle point of a function is strict if the Hessian at that point has a negative eigenvalue.
is strict if the Hessian at that point has a negative eigenvalue., and (iii) the objective is smooth. These imply that (noisy) GD finds a global minimum (Ge et al., 2015; Lee et al., 2016; Panageas and Piliouras, 2016).
On the other hand, empirically, removing the homogeneity is not necessary. We use GD with random initialization to solve the optimization problem (1). Figure 0(a) shows that even without regularization term like in the modified objective (3) GD with random initialization converges to a global minimum and the convergence rate is also competitive. A more interesting phenomenon is shown in Figure 0(b) in which we track the Frobenius norms of and in all iterations. The plot shows that the ratio between norms remains a constant in all iterations. Thus the unbalancedness does not occur at all! In many practical applications, many models also admit the homogeneous property (like deep neural networks) and first order methods often converge to a balanced solution. A natural question arises:
Why does GD balance multiple layers and converge in learning homogeneous functions?
In this paper, we take an important step towards answering this question. Our key finding is that the gradient descent algorithm provides an implicit regularization on the target homogeneous function. First, we show that on the gradient flow (gradient descent with infinitesimal step size) trajectory induced by any differentiable loss function, for a large class of homogeneous models, including fully connected and convolutional neural networks with linear, ReLU and Leaky ReLU activations, the differences between squared norms across layers remain invariant. Thus, as long as at the beginning the differences are small, they remain small at all time. Note that small differences arise in commonly used initialization schemes such asGaussian initialization or Xavier/Kaiming initialization schemes (Glorot and Bengio, 2010; He et al., 2016). Our result thus explains why using ReLU activation is a better choice than sigmoid from the optimization point view. For linear activation, we prove an even stronger invariance for gradient flow: we show that stays invariant over time, where and are weight matrices in consecutive layers with linear activation in between.
Next, we go beyond gradient flow and consider gradient descent with positive step size. We focus on the asymmetric matrix factorization problem (1). Our invariance result for linear activation indicates that stays unchanged for gradient flow. For gradient descent, can change over iterations. Nevertheless we show that if the step size decreases like (), will remain small in all iterations. In the set where is small, the loss is coercive, and gradient descent thus ensures that all the iterates are bounded. Using these properties, we then show that gradient descent converges to a globally optimal solution. Furthermore, for rank- asymmetric matrix factorization, we give a finer analysis and show that randomly initialized gradient descent with constant step size converges to the global minimum at a globally linear rate.
1.1 Related Work
The homogeneity issue has been previously discussed by Neyshabur et al. (2015a, b). The authors proposed a variant of stochastic gradient descent that regularizes paths in a neural network, which is related to the max-norm. The algorithm outperforms gradient descent and AdaGrad on several classification tasks.
A line of research focused on analyzing gradient descent dynamics for (convolutional) neural networks with one or two unknown layers (Tian, 2017; Brutzkus and Globerson, 2017; Du et al., 2017a, b; Zhong et al., 2017; Li and Yuan, 2017; Ma et al., 2017; Brutzkus et al., 2017). For one unknown layer, there is no homogeneity issue. While for two unknown layers, existing work either requires learning two layers separately (Zhong et al., 2017; Ge et al., 2017b) or uses re-parametrization like weight normalization to remove the homogeneity issue (Du et al., 2017b). To our knowledge, there is no rigorous analysis for optimizing multi-layer homogeneous functions.
For a general (non-convex) optimization problem, it is known that if the objective function satisfies (i) gradient changes smoothly if the parameters are perturbed, (ii) all saddle points and local maxima are strict (i.e., there exists a direction with negative curvature), and (iii) all local minima are global (no spurious local minimum), then gradient descent (Lee et al., 2016; Panageas and Piliouras, 2016) converges to a global minimum. There have been many studies on the optimization landscapes of neural networks (Kawaguchi, 2016; Choromanska et al., 2015; Du and Lee, 2018; Hardt and Ma, 2016; Bartlett et al., 2018; Haeffele and Vidal, 2015; Freeman and Bruna, 2016; Vidal et al., 2017; Safran and Shamir, 2016; Zhou and Feng, 2017; Nguyen and Hein, 2017a, b; Zhou and Feng, 2017; Safran and Shamir, 2017), showing that the objective functions have properties (ii) and (iii). Nevertheless, the objective function is in general not smooth as we discussed before. Our paper complements these results by showing that the magnitudes of all layers are balanced and in many cases, this implies smoothness.
1.2 Paper Organization
The rest of the paper is organized as follows. In Section 2, we present our main theoretical result on the implicit regularization property of gradient flow for optimizing neural networks. In Section 3, we analyze the dynamics of randomly initialized gradient descent for asymmetric matrix factorization problem with unregularized objective function (1). In Section 4, we empirically verify the theoretical result in Section 2. We conclude and list future directions in Section 5. Some technical proofs are deferred to the appendix.
We use bold-faced letters for vectors and matrices. For a vector, denote by its -th coordinate. For a matrix , we use to denote its -th entry, and use and to denote its -th row and -th column, respectively (both as column vectors). We use or to denote the Euclidean norm of a vector, and use to denote the Frobenius norm of a matrix. We use to denote the standard Euclidean inner product between two vectors or two matrices. Let .
2 The Auto-Balancing Properties in Deep Neural Networks
In this section we study the implicit regularization imposed by gradient descent with infinitesimal step size (gradient flow) in training deep neural networks. In Section 2.1 we consider fully connected neural networks, and our main result (Theorem 2.1
) shows that gradient flow automatically balances the incoming and outgoing weights at every neuron. This directly implies that the weights between different layers are balanced (Corollary2.1). For linear activation, we derive a stronger auto-balancing property (Theorem 2.2). In Section 2.2 we generalize our result from fully connected neural networks to convolutional neural networks. In Section 2.3 we present the proof of Theorem 2.1. The proofs of other theorems in this section follow similar ideas and are deferred to Appendix A.
2.1 Fully Connected Neural Networks
We first formally define a fully connected feed-forward neural network with() layers. Let be the weight matrix in the -th layer, and define as a shorthand of the collection of all the weights. Then the function () computed by this network can be defined recursively: , (), and , where each is an activation function that acts coordinate-wise on vectors.444We omit the trainable bias weights in the network for simplicity, but our results can be directly generalized to allow bias weights. We assume that each () is homogeneous, namely, for all and all elements of the sub-differential when is non-differentiable at . This property is satisfied by functions like ReLU , Leaky ReLU (), and linear function .
Let be a differentiable loss function. Given a training dataset , the training loss as a function of the network parameters is defined as
We consider gradient descent with infinitesimal step size (also known as gradient flow) applied on , which is captured by the differential inclusion:
where is a continuous time index, and is the Clarke sub-differential (Clarke et al., 2008). If curves () evolve with time according to (5) they are said to be a solution of the gradient flow differential inclusion.
Our main result in this section is the following invariance imposed by gradient flow.
Theorem 2.1 (Balanced incoming and outgoing weights at every neuron).
For any and , we have
Note that is a vector consisting of network weights coming into the -th neuron in the -th hidden layer, and is the vector of weights going out from the same neuron. Therefore, Theorem 2.1 shows that gradient flow exactly preserves the difference between the squared -norms of incoming weights and outgoing weights at any neuron.
Taking sum of (6) over , we obtain the following corollary which says gradient flow preserves the difference between the squares of Frobenius norms of weight matrices.
Corollary 2.1 (Balanced weights across layers).
For any , we have
Corollary 2.1 explains why in practice, trained multi-layer models usually have similar magnitudes on all the layers: if we use a small initialization, is very small at the beginning, and Corollary 2.1 implies this difference remains small at all time. This finding also partially explains why gradient descent converges. Although the objective function like (4) may not be smooth over the entire parameter space, given that is small for all , the objective function may have smoothness. Under this condition, standard theory shows that gradient descent converges. We believe this finding serves as a key building block for understanding first order methods for training deep neural networks.
For linear activation, we have the following stronger invariance than Theorem 2.1:
Theorem 2.2 (Stronger balancedness property for linear activation).
If for some we have , then
This result was known for linear networks (Arora et al., 2018), but the proof there relies on the entire network being linear while Theorem 2.2 only needs two consecutive layers to have no nonlinear activations in between.
While Theorem 2.1 shows the invariance in a node-wise manner, Theorem 2.2 shows for linear activation, we can derive a layer-wise invariance. Inspired by this strong invariance, in Section 3 we prove gradient descent with positive step sizes preserves this invariance approximately for matrix factorization.
2.2 Convolutional Neural Networks
Now we show that the conservation property in Corollary 2.1 can be generalized to convolutional neural networks. In fact, we can allow arbitrary sparsity pattern and weight sharing structure within a layer; convolutional layers are a special case.
Neural networks with sparse connections and shared weights.
We use the same notation as in Section 2.1, with the difference that some weights in a layer can be missing or shared. Formally, the weight matrix in layer () can be described by a vector and a function . Here consists of the actual free parameters in this layer and is the number of free parameters (e.g. if there are convolutional filters in layer each with size , we have ). The map represents the sparsity and weight sharing pattern:
Denote by the collection of all the parameters in this network, and we consider gradient flow to learn the parameters:
The following theorem generalizes Corollary 2.1 to neural networks with sparse connections and shared weights:
For any , we have
Therefore, for a neural network with arbitrary sparsity pattern and weight sharing structure, gradient flow still balances the magnitudes of all layers.
2.3 Proof of Theorem 2.1
The proofs of all theorems in this section are similar. They are based on the use of the chain rule (i.e. back-propagation) and the property of homogeneous activations. Below we provide the proof of Theorem2.1 and defer the proofs of other theorems to Appendix A.
Proof of Theorem 2.1.
First we note that we can without loss of generality assume is the loss associated with one data sample , i.e., . In fact, for where , for any single weight in the network we can compute , using the sharp chain rule of differential inclusions for tame functions (Drusvyatskiy et al., 2015; Davis et al., 2018). Thus, if we can prove the theorem for every individual loss , we can prove the theorem for by taking average over .
Therefore in the rest of proof we assume . For convenience, we denote (), which is the input to the -th hidden layer of neurons for and is the output of the network for . We also denote and ().
Now we prove (6). Since () can only affect through , we have for ,
which can be rewritten as
It follows that
On the other hand, only affects through . Using the chain rule, we get
where is interpreted as a set-valued mapping whenever it is applied at a non-differentiable point.555More precisely, the equalities should be an inclusion whenever there is a sub-differential, but as we see in the next display the ambiguity in the choice of sub-differential does not affect later calculations.
It follows that666This holds for any choice of element of the sub-differential, since holds at for any choice of sub-differential.
Comparing the above expression to (7), we finish the proof. ∎
3 Gradient Descent Converges to Global Minimum for Asymmetric Matrix Factorization
In this section we constrain ourselves to the asymmetric matrix factorization problem and analyze the gradient descent algorithm with random initialization. Our analysis is inspired by the auto-balancing properties presented in Section 2. We extend these properties from gradient flow to gradient descent with positive step size.
Formally, we study the following non-convex optimization problem:
3.1 The General Rank- Case
First we consider the general case of . Our main theorem below says that if we use a random small initialization , and set step sizes to be appropriately small, then gradient descent (9) will converge to a solution close to the global minimum of (8). To our knowledge, this is the first result showing that gradient descent with random initialization directly solves the un-regularized asymmetric matrix factorization problem (8).
Proof sketch of Theorem 3.1.
First let’s imagine that we are using infinitesimal step size in GD. Then according to Theorem 2.2 (viewing problem (8) as learning a two-layer linear network where the inputs are all the standard unit vectors in ), we know that will stay invariant throughout the algorithm. Hence when and are initialized to be small, will stay small forever. Combined with the fact that the objective is decreasing over time (which means cannot be too far from ), we can show that and will always stay bounded.
Now we are using positive step sizes , so we no longer have the invariance of . Nevertheless, by a careful analysis of the updates, we can still prove that is small, the objective decreases, and and stay bounded. Formally, we have the following lemma:
With high probability over the initialization , for all we have:
Decreasing objective: ;
Now that we know the GD algorithm automatically constrains in a bounded region, we can use the smoothness of in this region and a standard analysis of GD to show that converges to a stationary point of (Lemma B.2). Furthermore, using the results of (Lee et al., 2016; Panageas and Piliouras, 2016) we know that is almost surely not a strict saddle point. Then the following lemma implies that has to be close to a global optimum since we know from Lemma 3.1 (i). This would complete the proof of Theorem 3.1.
Suppose is a stationary point of such that . Then either , or is a strict saddle point of .
3.2 The Rank- Case
We have shown in Theorem 3.1 that GD with small and diminishing step sizes converges to a global minimum for matrix factorization. Empirically, it is observed that a constant step size is enough for GD to converge quickly to global minimum. Therefore, some natural questions are how to prove convergence of GD with a constant step size, how fast it converges, and how the discretization affects the invariance we derived in Section 2.
While these questions remain challenging for the general rank- matrix factorization, we resolve them for the case of . Our main finding is that with constant step size, the norms of two layers are always within a constant factor of each other (although we may no longer have the stronger balancedness property as in Lemma 3.1), and we utilize this property to prove the linear convergence of GD to a global minimum.
When , the asymmetric matrix factorization problem and its GD dynamics become
Here we assume has rank , i.e., it can be factorized as where and are unit vectors and .
Our main theoretical result is the following.
Theorem 3.2 (Approximate balancedness and linear convergence of GD for rank- matrix factorization).
Suppose , with () for some sufficiently small constant , and for some sufficiently small constant . Then with constant probability over the initialization, for all we have for some universal constants . Furthermore, for any , after iterations, we have .
4 Empirical Verification
We perform experiments to verify the auto-balancing properties of gradient descent in neural networks with ReLU activation. Our results below show that for GD with small step size and small initialization: (1) the difference between the squared Frobenius norms of any two layers remains small in all iterations, and (2) the ratio between the squared Frobenius norms of any two layers becomes close to . Notice that our theorems in Section 2 hold for gradient flow (step size ) but in practice we can only choose a (small) positive step size, so we cannot hope the difference between the squared Frobenius norms to remain exactly the same but can only hope to observe that the differences remain small.
We consider a 3-layer fully connected network of the form where is the input, , , , and is ReLU activation. We use 1,000 data points and the quadratic loss function, and run GD. We first test a balanced initialization: , and , which ensures . After 10,000 iterations we have , and . Figure 1(a) shows that in all iterations and are bounded by which is much smaller than the magnitude of each . Figures 1(b) shows that the ratios between norms approach . We then test an unbalanced initialization: , and . After 10,000 iterations we have , and . Figure 1(c) shows that and are bounded by (and indeed change very little throughout the process), and Figures 1(d) shows that the ratios become close to after about 1,000 iterations.
5 Conclusion and Future Work
In this paper we take a step towards characterizing the invariance imposed by first order algorithms. We show that gradient flow automatically balances the magnitudes of all layers in a deep neural network with homogeneous activations. For the concrete model of asymmetric matrix factorization, we further use the balancedness property to show that gradient descent converges to global minimum. We believe our findings on the invariance in deep models could serve as a fundamental building block for understanding optimization in deep learning. Below we list some future directions.
Other first-order methods.
In this paper we focus on the invariance induced by gradient descent. In practice, different acceleration and adaptive methods are also used. A natural future direction is how to characterize the invariance properties of these algorithms.
From gradient flow to gradient descent: a generic analysis?
As discussed in Section 3
, while strong invariance properties hold for gradient flow, in practice one uses gradient descent with positive step sizes and the invariance may only hold approximately because positive step sizes discretize the dynamics. We use specialized techniques for analyzing asymmetric matrix factorization. It would be very interesting to develop a generic approach to analyze the discretization. Recent findings on the connection between optimization and ordinary differential equations(Su et al., 2014; Zhang et al., 2018) might be useful for this purpose.
We thank Phil Long for his helpful comments on an earlier draft of this paper. JDL acknowledges support from ARO W911NF-11-1-0303.
- Absil et al. (2005) Pierre-Antoine Absil, Robert Mahony, and Benjamin Andrews. Convergence of the iterates of descent methods for analytic cost functions. SIAM Journal on Optimization, 16(2):531–547, 2005.
- Arora et al. (2018) Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018.
- Bartlett et al. (2018) Peter L Bartlett, David P Helmbold, and Philip M Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. arXiv preprint arXiv:1802.06093, 2018.
- Brutzkus and Globerson (2017) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
- Brutzkus et al. (2017) Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
- Choromanska et al. (2015) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015.
- Clarke et al. (2008) Francis H Clarke, Yuri S Ledyaev, Ronald J Stern, and Peter R Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.
- Davis et al. (2018) Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D Lee. Stochastic subgradient method converges on tame functions. arXiv preprint arXiv:1804.07795, 2018.
- Drusvyatskiy et al. (2015) Dmitriy Drusvyatskiy, Alexander D Ioffe, and Adrian S Lewis. Curves of descent. SIAM Journal on Control and Optimization, 53(1):114–138, 2015.
- Du and Lee (2018) Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
- Du et al. (2017a) Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017a.
- Du et al. (2017b) Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017b.
- Freeman and Bruna (2016) C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
Ge et al. (2015)
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points
online stochastic gradient for tensor decomposition.In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
- Ge et al. (2017a) Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning, pages 1233–1242, 2017a.
- Ge et al. (2017b) Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017b.
- Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
- Haeffele and Vidal (2015) Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
- Hardt and Ma (2016) Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
- Kawaguchi (2016) Kenji Kawaguchi. Deep learning without poor local minima. In Advances In Neural Information Processing Systems, pages 586–594, 2016.
- Lee et al. (2016) Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
- Li and Yuan (2017) Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activation. arXiv preprint arXiv:1705.09886, 2017.
- Ma et al. (2017) Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. arXiv preprint arXiv:1712.06559, 2017.
- Neyshabur et al. (2015a) Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015a.
- Neyshabur et al. (2015b) Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Data-dependent path normalization in neural networks. arXiv preprint arXiv:1511.06747, 2015b.
- Nguyen and Hein (2017a) Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017a.
- Nguyen and Hein (2017b) Quynh Nguyen and Matthias Hein. The loss surface and expressivity of deep convolutional neural networks. arXiv preprint arXiv:1710.10928, 2017b.
- Panageas and Piliouras (2016) Ioannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.
- Safran and Shamir (2016) Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pages 774–782, 2016.
- Safran and Shamir (2017) Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
- Shamir (2018) O. Shamir. Are resnets provably better than linear predictors? arXiv preprint arXiv:1804.06739, 2018.
- Su et al. (2014) Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
- Tian (2017) Yuandong Tian. An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560, 2017.
- Tu et al. (2015) Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. arXiv preprint arXiv:1507.03566, 2015.
- Vidal et al. (2017) Rene Vidal, Joan Bruna, Raja Giryes, and Stefano Soatto. Mathematics of deep learning. arXiv preprint arXiv:1712.04741, 2017.
- Zhang et al. (2018) Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretization achieves acceleration. arXiv preprint arXiv:1805.00521, 2018.
- Zhong et al. (2017) Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.
- Zhou and Feng (2017) Pan Zhou and Jiashi Feng. The landscape of deep learning algorithms. arXiv preprint arXiv:1705.07038, 2017.
Appendix A Proofs for Section 2
Proof of Theorem 2.2.
Same as the proof of Theorem 2.1, we assume without loss of generality that for some . We also denote (), and .
Now we suppose for some . Denote . Then we have . Using the chain rule, we can directly compute
Then we have
Comparing the above two equations we know . ∎