and reinforcement learning applied in classic games like Go(Silver et al., 2016). Despite the highly non-convex nature of the objective function, simple first-order algorithms like stochastic gradient descent and its variants often train such networks successfully. On the other hand, the success of convolutional neural network remains elusive from an optimization perspective.
When the input distribution is not constrained, existing results are mostly negative, such as hardness of learning a 3-node neural network (Blum & Rivest, 1989) or a non-overlap convolutional filter (Brutzkus & Globerson, 2017). Recently, Shamir (2016) showed learning a simple one-layer fully connected neural network is hard for some specific input distributions.
These negative results suggest that, in order to explain the empirical success of SGD for learning neural networks, stronger assumptions on the input distribution are needed. Recently, a line of research (Tian, 2017; Brutzkus & Globerson, 2017; Li & Yuan, 2017; Soltanolkotabi, 2017; Zhong et al., 2017) assumed the input distribution be standard Gaussian and showed (stochastic) gradient descent is able to recover neural networks with ReLU activation in polynomial time.
One major issue of these analysis is that they rely on specialized analytic properties of the Gaussian distribution (c.f. Section1.1) and thus cannot be generalized to the non-Gaussian case, in which real-world distributions fall into. For general input distributions, new techniques are needed.
In this paper we consider a simple architecture: a convolution layer, followed by a ReLU activation function, and then average pooling. Formally, we let be an input sample, e.g., an image, we generate patches from , each with size : where the -th column is the -th patch generated by some known function . For a filter with size
and stride, is the -th and -th pixels. Since for convolutional filters, we only need to focus on the patches instead of the input, in the following definitions and theorems, we will refer as input and let as the distribution of : ( is the ReLU activation function)
We address the realizable case, where training data are generated from (1) with some unknown teacher parameter under input distribution . Consider the loss . We learn by (stochastic) gradient descent, i.e.,
where is the step size which may change over time and is a random function where its expectation equals to the population gradient The goal of our analysis is to understand the conditions where , if is optimized under (stochastic) gradient descent.
In this setup, our main contributions are as follows:
Learnability of Filters: We show if the input patches are highly correlated (Section 3), i.e., for some small , then gradient descent and stochastic gradient descent with random initialization recovers the filter in polynomial time.111Note since in this paper we focus on continuous distribution over , our results do not conflict with previous negative results(Blum & Rivest, 1989; Brutzkus & Globerson, 2017) whose constructions rely on discrete distributions.
Furthermore, strong correlations imply faster convergence. To the best of our knowledge, this is the first recovery guarantee of randomly initialized gradient-based algorithms for learning filters (even for the simplest one-layer one-neuron network) on non-Gaussian input distribution, answering an open problem in(Tian, 2017).
Distribution-Aware Convergence Rate
. We formally establish the connection between the smoothness of the input distribution and the convergence rate for filter weights recovery where the smoothness in our paper is defined as the ratio between the largest and the least eigenvalues of the second moment of the activation region (Section2). We show that a smoother input distribution leads to faster convergence, and Gaussian distribution is a special case that leads to the tightest bound. This theoretical finding also justifies the two-stage learning rate strategy proposed by (He et al., 2016; Szegedy et al., 2017) if the step size is allowed to change over time.
1.1 Related Works
In recent years, theorists have tried to explain the success of deep learning from different perspectives. From optimization point of view, optimizing neural network is a non-convex optimization problem. Pioneered byGe et al. (2015), a class of non-convex optimization problems that satisfy strict saddle property can be optimized by perturbed (stochastic) gradient descent in polynomial time (Jin et al., 2017).222Gradient descent is not guaranteed to converge to a local minima in polynomial time (Du et al., 2017; Lee et al., 2016). This motivates the research of studying the landscape of neural networks (Soltanolkotabi et al., 2017; Kawaguchi, 2016; Choromanska et al., 2015; Hardt & Ma, 2016; Haeffele & Vidal, 2015; Mei et al., 2016; Freeman & Bruna, 2016; Safran & Shamir, 2016; Zhou & Feng, 2017; Nguyen & Hein, 2017) However, these results cannot be directly applied to analyzing the convergence of gradient-based methods for ReLU activated neural networks.
From learning theory point of view, it is well known that training a neural network is hard in the worst cases (Blum & Rivest, 1989; Livni et al., 2014; Šíma, 2002; Shalev-Shwartz et al., 2017a, b) and recently, Shamir (2016) showed either “niceness” of the target function or of the input distribution alone is sufficient for optimization algorithms used in practice to succeed. With some additional assumptions, many works tried to design algorithms that provably learn a neural network with polynomial time and sample complexity (Goel et al., 2016; Zhang et al., 2016, 2015; Sedghi & Anandkumar, 2014; Janzamin et al., 2015; Gautier et al., 2016; Goel & Klivans, 2017). However, these algorithms are tailored for certain architecture and cannot explain why (stochastic) gradient based optimization algorithms work well in practice.
Focusing on gradient-based algorithms, a line of research analyzed the behavior of (stochastic) gradient descent for Gaussian input distribution. Tian (2017) showed population gradient descent is able to find the true weight vector with random initialization for one-layer one-neuron model. Brutzkus & Globerson (2017) showed population gradient descent recovers the true weights of a convolution filter with non-overlapping input in polynomial time. Li & Yuan (2017) showed SGD can recover the true weights of a one-layer ResNet model with ReLU activation under the assumption that the spectral norm of the true weights is bounded by a small constant. All the methods use explicit formulas for Gaussian input, which enable them to apply trigonometric inequalities to derive the convergence. With the same Gaussian assumption, Soltanolkotabi (2017) shows that the true weights can be exactly recovered by projected gradient descent with enough samples in linear time, if the number of inputs is less than the dimension of the weights.
Other approaches combine tensor approaches with assumptions of input distribution.Zhong et al. (2017) proved that with sufficiently good initialization, which can be implemented by tensor method, gradient descent can find the true weights of a 3-layer fully connected neural network. However, their approach works with known input distributions. Soltanolkotabi (2017) used Gaussian width (c.f. Definition 2.2 of (Soltanolkotabi, 2017)) for concentrations and his approach cannot be directly extended to learning a convolutional filter.
In this paper, we adopt a different approach that only relies on the definition of ReLU. We show as long as the input distribution satisfies weak smoothness assumptions, we are able to find the true weights by SGD in polynomial time. Using our conclusions, we can justify the effectiveness of large amounts of data (which may eliminate saddle points), two-stage and adaptive learning rates used by He et al. (2016); Szegedy et al. (2017), etc.
This paper is organized as follows. In Section 2, we analyze the simplest one-layer one-neuron model where we state our key observation and establish the connection between smoothness and convergence rate. In Section 3, we discuss the performance of (stochastic) gradient descent for learning a convolutional filter. We provide empirical illustrations in Section 4 and conclude in Section 5. We place most of our detailed proofs in the Appendix.
Let denote the Euclidean norm of a finite-dimensional vector. For a matrix , we use
to denote its largest singular value andits smallest singular value. Note if is a positive semidefinite matrix, and represent the largest and smallest eigenvalues of , respectively. Let and denote the standard Big-O and Big-Theta notations that hide absolute constants. We assume the gradient function is uniformly bounded, i.e., There exists such that . This condition is satisfied as long as patches, and noise are all bounded.
2 Warm Up: Analyzing One-Layer One-Neuron Model
Before diving into the convolutional filter, we first analyze the special case for , which is equivalent to the one-layer one-neuron architecture. The analysis in this simple case will give us insights for the fully general case. For the ease of presentation, we define following two events and corresponding second moments
where is the indicator function. Intuitively, is the joint activation region of and and is the joint activation region of and . See Figure 2 (a) for the graphical illustration. With some simple algebra we can derive the population gradient.
Suppose for any with , and the initialization satisfies then gradient descent algorithm recovers .
The first assumption is about the non-degeneracy of input distribution. For , one case that the assumption fails is that the input distribution is supported on a low-dimensional space, or degenerated. The second assumption on the initialization is to ensure that gradient descent does not converge to , at which the gradient is undefined. This is a general convergence theorem that holds for a wide class of input distribution and initialization points. In particular, it includes Theorem 6 of (Tian, 2017) as a special case. If the input distribution is degenerate, i.e., there are holes in the input space, the gradient descent may stuck around saddle points and we believe more data are needed to facilitate the optimization procedure This is also consistent with empirical evidence in which more data are helpful for optimization.
2.1 Convergence Rate of One-Layer One-Neuron Model
In the previous section we showed if the distribution is regular and the weights are initialized appropriately, gradient descent recovers the true weights when it converges. In practice we also want to know how many iterations are needed. To characterize the convergence rate, we need some quantitative assumptions. We note that different set of assumptions will lead to a different rate and ours is only one possible choice. In this paper we use the following quantities.
Definition 2.1 (The Largest/Smallest eigenvalue Values of the Second Moment on Intersection of two Half Spaces).
For , define
These two conditions quantitatively characterize the angular smoothness of the input distribution. For a given angle , if the difference between and
is large then there is one direction has large probability mass and one direction has small probability mass, meaning the input distribution is not smooth. On the other hand, ifand are close, then all directions have similar probability mass, which means the input distribution is smooth. The smoothest input distributions are rotationally invariant distributions (e.g. standard Gaussian) which have . For analogy, we can think of as Lipschitz constant of the gradient and as the strong convexity parameter in the optimization literature but here we also allow they change with the angle. Also observe that when , because the intersection has measure and both and are monotonically decreasing.
Our next assumption is on the growth of . Note that when , then because the intersection between and has measure. Also, grows as the angle between and becomes larger.
In the following, we assume the operator norm of increases smoothly with respect to the angle. The intuition is that as long as input distribution bounded probability density with respect to the angle, the operator norm of is bounded. We show in Theorem A.1 that for rotational invariant distribution and in Theorem A.2 that for standard Gaussian distribution.
We assume there exists that for , .
Now we are ready to state the convergence rate.
Suppose the initialization satisfies . Denote then if step size is set as , we have for
Note both and increases as decreases so we can choose a constant step size . This theorem implies that we can find the -close solution of in iterations. It also suggests a direct relation between the smoothness of the distribution and the convergence rate. For smooth distribution where and are close and is small then is relatively small and we need fewer iterations. On the other hand, if or is much larger than , we will need more iterations. We verify this intuition in Section 4.
If we are able to choose the step sizes adaptively , like using methods proposed by Lin & Xiao (2014), we may improve the computational complexity to . This justifies the use of two-stage learning rate strategy proposed by He et al. (2016); Szegedy et al. (2017) where at the beginning we need to choose learning to be small because is small and later we can choose a large learning rate because as the angle between and becomes smaller, becomes bigger.
The theorem requires the initialization satisfying , which can be achieved by random initialization with constant success probability. See Section 3.2 for a detailed discussion.
3 Main Results for Learning a Convolutional Filter
In this section we generalize ideas from the previous section to analyze the convolutional filter. First, for given and we define four events that divide the input space of each patch . Each event corresponds to a different activation region induced by and , similar to (3).
Please check Figure 2 (a) again for illustration. For the ease of presentation we also define the average over all patches in each region
For , define
We assume for all , for some .
The main difference between the simple one-layer one-neuron network and the convolution filter is two patches may appear in different regions. For a given sample, there may exists patch and such that and and their interaction plays an important role in the convergence of (stochastic) gradient descent. Here we assume the second moment of this interaction, i.e., cross-covariance, also grows smoothly with respect to the angle.
We assume there exists such that
First note if , then and has measure and this assumption models the growth of cross-covariance. Next note this represents the closeness of patches. If and are very similar, then the joint probability density of and is small which implies is small. In the extreme setting, , we have because in this case the events , and all have measure .
Now we are ready to present our result on learning a convolutional filter by gradient descent.
If the initialization satisfies and denote which satisfies . Then if we choose , we have for and
Our theorem suggests if the initialization satisfies , we obtain linear convergence rate. In Section 3.1, we give a concrete example showing closeness of patches implies large and small . Similar to Theorem 2.2, if the step size is chosen so that , in iterations, we can find the -close solution of and the proof is also similar to that of Theorem 3.1.
In practice,we never get a true population gradient but only stochastic gradient (c.f. Equation (2)). The following theorem shows SGD also recovers the underlying filter.
Let . Denote , and . For sufficiently small, if , then we have in iterations, with probability at least we have
Unlike the vanilla gradient descent case, here the convergence rate depends on instead of . This is because of the randomness in SGD and we need a more robust initialization. We choose to be the average of and for the ease of presentation. As will be apparent in the proof we only require not very close to . The proof relies on constructing a martingale and use Azuma-Hoeffding inequality and this idea has been previously used by Ge et al. (2015).
3.1 What distribution is easy for SGD to learn a convolutional filter?
Different from One-Layer One-Neuron model, here we also requires the Lipschitz constant for closeness to be relatively small and to be relatively large. A natural question is: What input distributions satisfy this condition?
Here we give an example. We show if (1) patches are close to each other (2) the input distribution has small probability mass around the decision boundary then the assumption in Theorem 3.1 is satisfied. See Figure 1 (b)-(c) for the graphical illustrations.
Denote . Suppose all patches have unit norm 333This is condition can be relaxed to the norm and the angle of each patch are independent and the norm of each pair is independent of others. and for all for all , . Further assume there exists such that for any and for all
then we have
where , analogue to Definition 2.1.
Several comments are in sequel. We view as a quantitative measure of the closeness between different patches, i.e., small means they are similar.
This lower bound is monotonically decreasing as a function of and note when , which recovers Definition 2.1.
For the upper bond on , represents the upper bound of the probability density around the decision boundary. For example if , then for in a small neighborhood around , say radius , we have
. This assumption is usually satisfied in real world examples like images because the image patches are not usually close to the decision boundary. For example, in computer vision, the local image patches often form clusters and is not evenly distributed over the appearance space. Therefore, if we use linear classifier to separate their cluster centers from the rest of the clusters, near the decision boundary the probability mass should be very low.
3.2 The Power of Random Initialization
For one-layer one-neuron model, we need initialization and for the convolution filter, we need a stronger initialization . The following theorem shows with uniformly random initialization we have constant probability to obtain a good initialization. Note with this theorem at hand, we can boost the success probability to arbitrary close to by random restarts. The proof is similar to (Tian, 2017).
If we uniformly sample from a -dimensional ball with radius so that , then with probability at least , we have .
To apply this general initialization theorem to our convolution filter case, we can choose . Therefore, with some simple algebra we have the following corollary.
Suppose , then if is uniformly sampled from a ball with center and radius , we have with probability at least
The assumption of this corollary is satisfied if the patches are close to each other as discussed in the previous section.
In this section we use simulations to verify our theoretical findings. We first test how the smoothness affect the convergence rate in one-layer one-neuron model described in Section 2 To construct input distribution with different , and (c.f. Definition 2.1 and Assumption 2.1), we fix the patch to have unit norm and use a mixture of truncated Gaussian distribution to model on the angle around and around the Specifically, the probability density of is sampled from Note by definitions of and if the probability mass is centered around , so the distribution is very spiky and and will be large. On the other hand, if , then input distribution is close to the rotation invariant distribution and and will be small. Figure 3a verifies our prediction where we fix the initialization and step size.
Next we test how the closeness of patches affect the convergence rate in the convolution setting. We first generate a single patch using the above model with , then generate each unit norm whose angle with , is sampled from . Figure 3
b shows as variance between patches becomes smaller, we obtain faster convergence rate, which coincides with Theorem3.1.
We also test whether SGD can learn a filter on real world data. Here we choose MNIST data and generate labels using two filters. One is random filter where each entry is sampled from a standard Gaussian distribution (Figure 3(a)) and the other is a Gabor filter (Figure 3(b)). Figure 3a and Figure 3
c show convergence rates of SGD with different initializations. Here, better initializations give faster rates, which coincides our theory. Note that here we report the relative loss, logarithm of squared error divided by the square of mean of data points instead of the difference between learned filter and true filter because we found SGD often cannot converge to the exact filter but rather a filter with near zero loss. We believe this is because the data are approximately lying in a low dimensional manifold in which the learned filter and the true filter are equivalent. To justify this conjecture, we try to interpolate the learned filter and the true filter linearly and the result filter has similar low loss (c.f. Figure5). Lastly, we visualize the true filters and the learned filters in Figure 4 and we can see that the they have similar patterns.
5 Conclusions and Future Works
In this paper we provide the first recovery guarantee of (stochastic) gradient descent algorithm with random initialization for learning a convolution filter when the input distribution is not Gaussian. Our analyses only used the definition of ReLU and some mild structural assumptions on the input distribution. Here we list some future directions.
One possibility is to extend our result to deeper and wider architectures. Even for two-layer fully-connected network, the convergence of (stochastic) gradient descent with random initialization is not known. Existing results either requires sufficiently good initialization (Zhong et al., 2017) or relies on special architecture (Li & Yuan, 2017). However, we believe the insights from this paper is helpful to understand the behaviors of gradient-based algorithms in these settings.
Another direction is to consider the agnostic setting, where the label is not equal to the output of a neural network. This will lead to different dynamics of (stochastic) gradient descent and we may need to analyze the robustness of the optimization procedures. This problem is also related to the expressiveness of the neural network (Raghu et al., 2016) where if the underlying function is not equal bot is close to a neural network. We believe our analysis can be extend to this setting.
The authors would like to thank Hanzhang Hu, Tengyu Ma, Yuanzhi Li, Jialei Wang and Kai Zhong for useful discussions.
- Blum & Rivest (1989) Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-complete. In Advances in neural information processing systems, pp. 494–501, 1989.
- Brutzkus & Globerson (2017) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
- Choromanska et al. (2015) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pp. 192–204, 2015.
- Dauphin et al. (2016) Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083, 2016.
- Du et al. (2017) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Barnabas Poczos, and Aarti Singh. Gradient descent can take exponential time to escape saddle points. arXiv preprint arXiv:1705.10412, 2017.
- Freeman & Bruna (2016) C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
- Gautier et al. (2016) Antoine Gautier, Quynh N Nguyen, and Matthias Hein. Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods. In Advances in Neural Information Processing Systems, pp. 1687–1695, 2016.
- Ge et al. (2015) Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pp. 797–842, 2015.
- Goel & Klivans (2017) Surbhi Goel and Adam Klivans. Learning depth-three neural networks in polynomial time. arXiv preprint arXiv:1709.06010, 2017.
- Goel et al. (2016) Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. arXiv preprint arXiv:1611.10258, 2016.
- Haeffele & Vidal (2015) Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
- Hardt & Ma (2016) Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Janzamin et al. (2015) Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
- Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.
- Kawaguchi (2016) Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- Lee et al. (2016) Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pp. 1246–1257, 2016.
- Li & Yuan (2017) Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. arXiv preprint arXiv:1705.09886, 2017.
- Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
Lin & Xiao (2014)
Qihang Lin and Lin Xiao.
An adaptive accelerated proximal gradient method and its homotopy
continuation for sparse optimization.
International Conference on Machine Learning, pp. 73–81, 2014.
- Livni et al. (2014) Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pp. 855–863, 2014.
- Mei et al. (2016) Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016.
- Milletari et al. (2016) Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3D Vision (3DV), 2016 Fourth International Conference on, pp. 565–571. IEEE, 2016.
- Nguyen & Hein (2017) Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017.
- Raghu et al. (2016) Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
- Safran & Shamir (2016) Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.
- Sedghi & Anandkumar (2014) Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv preprint arXiv:1412.2693, 2014.
- Shalev-Shwartz et al. (2017a) Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning, pp. 3067–3075, 2017a.
- Shalev-Shwartz et al. (2017b) Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Weight sharing is crucial to succesful optimization. arXiv preprint arXiv:1706.00687, 2017b.
- Shamir (2016) Ohad Shamir. Distribution-specific hardness of learning neural networks. arXiv preprint arXiv:1609.01037, 2016.
- Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Šíma (2002) Jiří Šíma. Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709–2728, 2002.
- Soltanolkotabi (2017) Mahdi Soltanolkotabi. Learning relus via gradient descent. arXiv preprint arXiv:1705.04591, 2017.
- Soltanolkotabi et al. (2017) Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.
Szegedy et al. (2017)
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, pp. 4278–4284, 2017.
- Tian (2017) Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560, 2017.
- Zhang et al. (2015) Yuchen Zhang, Jason D Lee, Martin J Wainwright, and Michael I Jordan. Learning halfspaces and neural networks with random initialization. arXiv preprint arXiv:1511.07948, 2015.
- Zhang et al. (2016) Yuchen Zhang, Jason D Lee, and Michael I Jordan. l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pp. 993–1001, 2016.
- Zhong et al. (2017) Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.
- Zhou & Feng (2017) Pan Zhou and Jiashi Feng. The landscape of deep learning algorithms. arXiv preprint arXiv:1705.07038, 2017.
Appendix A Proofs and Additional Theorems
a.1 Proofs of the Theorem in Section 2
and both terms are non-negative.
Since and (positive-semidefinite), both the first term and one part of the second term are non-negative. The other part of the second term is
Proof of Theorem 2.1.
The assumption on the input distribution ensures when , and when , . Now when gradient descent converges we have . We have the following theorem. By assumption, since and gradient descent only decreases function value, we will not converge to . Note that at any critical points, , from Lemma A.1, we have:
Suppose we are converging to a critical point . There are two cases:
If , then we have, which contradicts with Eqn. 6.
If , without loss of generality, let for some . By the assumption we know . Now the second equation becomes , which contradicts with Eqn. 7.
Therefore we have . ∎
Proof of Theorem 2.2.
Our proof relies on the following simple but crucial observation: if , then
We denote and by the observation we have . Recall the gradient descent dynamics,
Consider the squared distance to the optimal weight
By our analysis in the previous section, the second term is smaller than
where we have used our assumption on the angle. For the third term, we expand it as
Therefore, in summary,
where the first inequality is by our assumption of the step size and second is because and is monotonically decreasing. ∎
Theorem A.1 (Rotational Invariant Distribution).
For any unit norm rotational invariant input distribution, we have .
Proof of Theorem a.1.
Without loss of generality, we only need to focus on the plane spanned by and and suppose . Then
It has two eigenvalues
Therefore, for . ∎
If , then
Note in previous theorem we can integrate angle and radius separately then multiply them together. For Gaussian distribution, we have . The result follows. ∎