In recent years, much effort has been devoted to understanding why neural networks are successfully trained with simple, gradient-based methods, despite the inherent non-convexity of the learning problem. However, our understanding of this is still partial at best.
In this paper, we focus on the simplest possible nonlinear neural network, composed of a single neuron, of the form , where
is the parameter vector andis some fixed non-linear activation function. Moreover, we consider a realizable setting, where the inputs are sampled from some distribution , the target values are generated by some unknown target neuron (possibly corrupted by independent zero-mean noise), and we wish to train our neuron with respect to the squared loss. Mathematically, this boils down to minimizing the following objective function:
For this problem, we are interested in the performance of gradient-based methods, which are the workhorse of modern machine learning systems. These methods initializerandomly, and proceed by taking (generally stochastic) gradient steps w.r.t. . If we hope to explain the success of such methods on complicated neural networks, it seems reasonable to expect a satisfying explanation for their convergence on single neurons.
Although the learning of single neurons was studied in a number of papers (see the related work section below for more details), the existing analyses all suffer from one or several limitations: Either they apply for a specific distribution
, which is convenient to analyze but not very practical (such as a standard Gaussian distribution); Apply to gradient methods only with a specific initialization (rather than a standard random one); Or require smoothness and strict monotonicity conditions on the activation function
(which excludes, for example, the common ReLU function
). However, a bit of experimentation strongly suggests that none of these assumptions is really necessary for standard gradient methods to succeed on this simple problem. Thus, our understanding of this problem is probably still incomplete.
The goal of this paper is to study to what extent the limitations above can be relaxed, with the following contributions:
We begin by asking whether positive results are possible without any explicit assumptions on the distribution or the activation (other than, say, bounded support and Lipschitz continuity). Although this seems reasonable at first glance, we show in Sec. 3 that unfortunately, this is not the case: For the ReLU activation function, there are bounded distributions on which gradient descent will fail to optimize Eq. (1) with probability exponentially close to . Moreover, even for which is a standard Gaussian, there are Lipschitz activation functions on which gradient methods will likely fail.
Motivated by the above, we ask whether it is possible to prove positive results with mild assumptions on the distribution and activation function, which does not exclude the ReLU function and go beyond a standard Gaussian distribution. In Sec. 4, we prove a key technical result, which implies that if the distribution is sufficiently “spread” and the activation function satisfies a weak monotonicity condition (satisfied by ReLU and all standard activation functions), then is positive in most of the domain. This implies that an exact gradient step with sufficiently small step size will bring us closer to in “most” places. Building on this result, we prove in Sec. 5
a constant-probability convergence guarantee for several variants of gradient methods (gradient descent, stochastic gradient descent, and gradient flow) with random initialization.
In Sec. 6, we consider more specifically the case where is any spherically symmetric distribution (which includes the standard Gaussian as a special case), and the ReLU activation function, and show that the convergence results can be made to hold with high probability. As we discuss later on, the case of the ReLU function and a standard Gaussian distribution was also considered in [20, 21], but that analysis crucially relied on initialization at the origin and a Gaussian distribution, whereas our results apply to more generic initialization schemes and distributions.
A natural question arising from these results is whether a high-probability result can be proved for non-spherically symmetric distributions. We study this empirically in Subsection 6.2, and show that perhaps surprisingly, this cannot be done with standard potential-based methods (involving the angle or distance to the target
), already when we consider unit-variance Gaussian distributions with a non-zero mean.
Overall, we hope our work contributes to a better understanding of the dynamics of gradient methods on simple neural networks, and suggests some natural avenues for future research.
1.1 Related Work
We begin by emphasizing that the problem of learning a single target neuron is not inherently hard: Indeed, it can be efficiently performed with minimal assumptions, using the Isotron algorithm and its variants (Kalai and Sastry , Kakade et al. ). Also, other algorithms exist for even more complicated networks or more general settings, under certain assumptions (e.g., Goel et al. , Janzamin et al. ). However, these are non-standard algorithms, whereas our focus here is on standard gradient methods.
For this setting, an important positive result was provided in Mei et al. , showing that gradient descent on the empirical risk function (with sampled i.i.d. from and sufficiently large) successfully yields a good approximation of . However, the analysis requires to be strictly monotonic, and to have uniformly bounded derivatives up to the third order. This excludes standard activation functions such as the ReLU, which are neither strictly monotonic nor differentiable. Indeed, assuming that the activation is strictly monotonic makes the analysis much easier, as we show later on in Thm. 3.2. A related analysis under strict monotonicity conditions is provided in Oymak and Soltanolkotabi .
In the landmark papers Soltanolkotabi  and Soltanolkotabi et al. , the authors studied the setting where is the ReLU function, and gradient descent or stochastic gradient descent is performed on the empirical risk function , where are sampled from a standard Gaussian distribution . However, that analysis is specific to the Gaussian distribution, and crucially relied on initialization at precisely , as well as a certain assumption on how the derivative of the ReLU function is computed at . In more details, we impose the convention that even though the ReLU function is not differentiable at , we take to be some fixed positive number, and the gradient of the population objective at to be
Assuming , we get that the gradient is non-zero and proportional to . For a Gaussian distribution (and more generally, spherically symmetric distributions), this turns out to be proportional to , so that an exact gradient step from will lead us precisely in the direction of the target parameter vector . As a result, if we calculate a sufficiently precise approximation of this direction from a random sample, we can get arbitrarily close to in a single iteration (see Soltanolkotabi et al. [21, Remark 3.2] for a discussion of this). Unfortunately, this unique behavior is specific to initialization at with a certain convention about (note that even locally around , the gradient may not approximate , since it is generally discontinuous around ). Thus, although the analysis is important and insightful, it is difficult to apply more generally.
A line of recent works established the effectiveness of gradient methods in solving non-convex optimization problems with a strict saddle property, which implies that all near-stationary points with nearly positive definite Hessians are close to global minima (see Jin et al. , Ge et al. , Sun et al. ). A relevant example is phase retrieval, which actually fits our setting with being the quadratic function (Sun et al. ). However, these results can only be applied to smooth problems, where the objective function is twice differentiable with Lipschitz-continuous Hessians (excluding, for example, problems involving the ReLU activation function). An interesting recent exception is the work of Tan and Vershynin , which considered the case . However, their results are specific to that activation, and assumes a specific input distribution (uniform on a scaled origin-centered sphere). In contrast, our focus here is on more general families of distributions and activations.
Brutzkus and Globerson  show that gradient descent learns a simple convolutional network with non-overlapping patches, when the inputs have a standard Gaussian distribution. Similar to the analysis in our paper, they rely on showing that the angle between the learned parameter vector and a target parameter vector monotonically decreases with gradient methods. However, the network architecture studied is different than ours, and their proof heavily relies on the symmetry of the Gaussian distribution.
Less directly related to our setting, a popular line of recent works showed how gradient methods on highly over-parameterized neural networks can learn various target functions in polynomial time (e.g., Allen-Zhu et al. , Daniely , Arora et al. , Cao and Gu ). However, as pointed out in Yehudai and Shamir , this type of analysis cannot be used to explain learnability of single neurons.
We use bold-faced letters to denote vectors. For a vector , we let denote its -th coordinate. We denote by the ReLU function, i.e. . For a vector , we let , and by we denote the all-ones vector . Given vectors , we let denote the angle between and . We use to denote probability. Denote the indicator function which equals if and otherwise.
Unless stated otherwise, we assume that the target vector in Eq. (1) is unit norm, .
When is differentiable, the gradient of the objective function in Eq. (1) is
When is not differentiable, we will still assume that it is differentiable almost everywhere (up to a finite number of points), and that in every point of non-differentiability , there are well-defined left and right derivatives. In that case, practical implementations of gradient methods fix to be some number between its left and right derivatives (for example, for the ReLU function, is defined as some number in ). Following that convention, the expected gradient used by these methods still corresponds to Eq. (2), and we will follow the same convention here.
In our paper, we focus on the following three standard gradient methods:
Gradient Flow: We initialize at some , and for every , we set to be the solution of the differential equation:
This can be thought of as a continuous form of gradient descent, where we consider an infinitesimal learning rate.
Gradient Descent: We initialize at some and set a fixed learning rate . At each iteration , we do a single step in the negative direction of the gradient:
Stochastic Gradient Descent (SGD): We initialize at some and set a fixed learning rate . At each iteration , we sample an input , and calculate a stochastic gradient:
and do a single step in the negative direction of the stochastic gradient:
Note that here we consider SGD on the population loss, which is different from SGD on a fixed training set. We also note that our proof techniques easily extend to mini-batch SGD, where is taken to be the average of stochastic gradients w.r.t. sampled i.i.d. from . However, for simplicity we will focus on .
3 Assumptions on the Distribution and Activation are Necessary
The main concern of this paper is under what assumptions can a single neuron be provably learned. In this section, we show that learning even a single neuron can be hopeless, unless we make non-trivial assumptions on both the input distribution and the activation function.
3.1 Assumptions on the Input Distribution are Necessary
We begin by asking whether Eq. (1) can be minimized by gradient methods in a distribution-free manner (with no assumptions beyond, say, bounded support), as in learning problems where the population objective is convex. Perhaps surprisingly, we show that the answer is negative, even if we consider specifically the ReLU activation, and a distribution supported on the unit Euclidean ball. This is based on the following key result:
Suppose that is the ReLU function, and assume that is sampled from a product distribution (namely, each is sampled independently from some distribution ). Then there exists a distribution over the inputs, supported on , and with such that the following holds: With probability at least over the initialization point sampled from , if we run gradient flow, gradient descent or stochastic gradient descent, then for every we have (for gradient flow ).
For each distribution , let . We define the following dataset:
where is the standard -th unit vector, and if and
otherwise. Denote the random variableand . We have that are independent, , and . Using Hoeffding’s inequality, we get that w.p it holds that , which means that there are at least indices such that .
to be uniform distribution on. Using Eq. (2) and the fact that is the ReLU function, we get
In particular, for every index for which we have that .
Next, we define with (note that ). We condition on the event above – namely, that there are indices for which – and let these indices be . Under this event, for at initialization we have that
We will now show that for every index , using gradient methods will not change the -th coordinate of from its initial value. Let be such a coordinate. For gradient flow we have that , hence . For gradient descent we have that , hence . For stochastic gradient descent, at each iteration we sample from the distribution defined in Thm. 3.1, and define the stochastic gradient as in Eq. (3). If then hence , otherwise, if then hence . In both cases the -th coordinate of the stochastic gradient is zero, hence . Thus, we have shown that for every iteration for gradient descent or SGD we have that (and for gradient flow, for every time , we have ).
We end by noting that although the distribution defined here is discrete over a finite dataset, the same argument can also be made for a non-discrete distribution, by considering a mixture of smooth distributions concentrated around the support points of the discrete distribution above. ∎
The theorem above applies to any product initialization scheme, which includes most standard initializations used in practice (e.g., the standard Xavier initialization ). The theorem implies that it is impossible to prove positive guarantees in our setting without distributional assumptions on ths inputs. Inspecting the construction, the source of the problem (at least for the ReLU neuron) appears to be the fact that the distribution is supported on a small number of well-separated regions. Thus, in our positive results, we will assume that the distribution is sufficiently “spread”, as formalized later on in Sec. 4
3.2 Assumptions on the Activation Function
We now turn to discuss the activation function, explaining why even if the activation is Lipschitz and the input distribution is a standard Gaussian, this is likely insufficient for positive guarantees in our setting.
In particular, let us consider the case that is a -Lipschitz periodic function. Then Theorem in  implies that for a large family of input distributions on (including a standard Gaussian), if we assume that the vector in the target neuron is a uniformly distributed unit vector, then for any fixed ,
This implies that the gradient at is virtually independent of the underlying target vector : In fact, it is extremely concentrated around a fixed value which does not depend on . Theorem 4 from  goes further and shows that for any gradient method, even an exponentially small amount of arbitrary noise will be enough to make its trajectory (after at most iterations) independent of , in which case it cannot possibly succeed in this setting. We note that their result is even more general as they consider a general function instead of , so our setting can be seen as a private case.
When considering a standard Gaussian distribution, the above argument can be easily extended to activations which are periodic only in a segment of length around the origin. This can be seen by extending the activation to which is periodic on , applying the above argument to it, and noting that the probability mass outside of a ball of radius is exponentially small (for example, see  Proposition 4.2, where they consider an activation which is a finite sum of ReLU functions and periodic in a segment of length ).
The above discussion motivates us to impose some condition on the activation function which excludes periodic functions. One such mild assumptions, which we will adopt in the rest of the paper (and corresponds to virtually all activations used in practice) is that the activation is monotonically increasing. Before continuing, we remark that by assuming a slight strengthening of this assumption, namely that the function is strictly monotonically increasing, it is easy to prove a positive guarantee, as evidenced by Thm. 3.2. However, this excludes popular activations such as the ReLU function.
Assume for some , and the following for some :
is positive definite with minimal eigenvalue
Then starting from any point , after doing iterations of gradient descent with learning rate , we have that:
The proof can be found in Appendix A, and can be easily generalized to apply also to gradient flow and SGD. The above shows that if we assume strict monotonicity of the activation, then under very mild assumptions on the data will converge exponentially fast to . In the rest of the paper, however, we focus on results which only require weak monotonicity.
4 Under Mild Assumptions the Gradient Points in a Good Direction
Motivated by the results in Sec. 3, we use the following assumptions on the distribution and activation:
The following holds for some fixed :
The distribution satisfies the following: For any vector , let denote the marginal distribution of on the subspace spanned by (as a distribution over ). Then any such distribution has a density function such that .
is monotonically increasing; and satisfies .
The distributional assumption is such that in every -dimensional subspace, the marginal distribution is sufficiently “spread” in any direction close to the origin. For example, for a standard Gaussian distribution, this is true for regardless of the dimension (as the marginal distribution of a standard Gaussian on the subspace is a standard -dimensional Gaussian). Also, for any distribution, it can be made to hold by mixing it with a bit of a Gaussian or uniform distribution if possible. The activation assumption covers ReLU or ReLU-like activations (e.g. leaky-ReLU, Softplus). It also covers sigmoid and tanh activations, for which the gradient in any bounded interval is lower bounded by a positive constant.
With these assumptions, we prove the following key technical result, which implies that the gradient of the objective has a positive correlation with the direction of the global minimum (at ), if the angle between and and the norm of are not too large:
Under Assumptions 4.1, for any such that and for some , it holds that
The theorem implies that for suitable values of , gradient methods (which move in the negative gradient direction) will decrease the distance from . When this behavior occurs, it is easy to show that gradient methods succeed in learning the target neuron, like in the previous Thm. 3.2 for the strictly monotonic case. The main challenge is to guarantee that the trajectory of the algorithm will indeed never violate the theorem’s conditions, in particular that the angle between and indeed remains bounded away from (and in fact, later on we will show that such a guarantee is not always possible).
The formal proof of the theorem can be found in Appendix B, but its intuition can be described as follows: we want to bound below the term
Using the assumption on , the term inside the above expectation is nonnegative for every . This is because and for any monotonic function we have . Thus, viewing the expectation as an integral over a nonnegative function, we can lower bound it by taking the integral over the smaller set .
The resulting integral depends only on dot products of with and . Thus, it is enough to consider the marginal distribution on the -dimensional plane spanned by and .
By the assumption on the distribution, the density function of this marginal distribution is always at least on any such that . This means we can lower bound the integral above by integrating over with a uniform distribution on this set and multiplying by .
In total, the expression above is lower bounded by a -dimensional integral with uniform measure and with no terms on the set:
where are the -dimensional vectors representing on the -dimensional plane spanned by them. We lower bound this integral by a term that scales with the angle .
Remark 4.3 (Implication on Optimization Landscape).
The proof of the theorem can be shown to imply that for the ReLU activation, under the theorem’s conditions, the only stationary point that is not the global minimum must be at the origin. In particular, the proof implies that any stationary point (with ) must be along the ray . For the ReLU activation (which satisfies for any and ), the gradient at such points equals
This implies that might be zero only if either (i.e., the origin), or with probability , which cannot happen according to Assumptions 4.1.
5 Convergence with Constant Probability Under Mild Assumptions
In this section, we use Thm. 4.2 in order to show that under some assumption on the initialization of , gradient methods will be able to learn a single neuron with probability at least (close to) . Note that the loss surface of is not convex, and as explained in Remark 4.3, there may be a stationary point at . This stationary point can cause difficulties, as it is not obvious how to control the angle between and close to the origin (which is required for Thm. 4.2 to apply). But, if we assume at initialization for some small , then we are not close to this stationary point and we can ensure that it will remain that way throughout the optimization process. One such initialization, which guarantees this with at least constant probability, is a zero-mean Gaussian initialization with small enough variance:
Assume . If we sample for then w.p we have that
In order to bound each gradient step we will need these additional assumptions:
The following holds for some positive :
almost surely over
With these assumptions, we show convergence for gradient flow, gradient descent and stochastic gradient descent:
(Gradient Flow) Let , and assume that . Running gradient flow, then for every time we have
(Gradient Descent) Let , and assume that . Let for and . Running gradient descent with step size , we have that for every , after iterations:
(Stochastic Gradient Descent) Let , and assume that . Let where and . Then w.p , after iterations we have that:
Combined with Lemma 5.1, Thm. 5.3 shows that with proper initialization, gradient flow, gradient descent as well as stochastic gradient descent successfully minimize Eq. (1) with probability (close to) , and for the first two algorithms, the convergence rate is exponential.
The full proof of the theorem can be found in Appendix C, and its intuition for gradient flow and gradient is as described above (namely, that if , it will stay that way and will just continue to shrink over time, using Thm. 4.2). The proof for stochastic gradient descent is much more delicate. This is because the update at each iteration is noisy, so we need to ensure we remain in the region where Thm. 4.2 is applicable. Here we give a short proof intuition:
Assume we initialized with . In order for the analysis to work we need that throughout the algorithm’s run. Otherwise, if we won’t be able to use Thm. 4.2 with a constant angle , and also we may be close to the stationary point at . Thus, we show (using a maximal version of Azuma’s inequality) that if is small enough, and we take at most gradient steps then w.h.p for every :
The next step is to show that if , then for an appropriate . This is done using Thm. 4.2, as in the gradient descent case, but note that here this only holds in expectation over the sample selected at iteration .
Next, we use Azuma’s inequality again on iterations for a small enough , to show that w.h.p does not move too far away from where the expectation is taken over . Also, we show that after iterations for a constant smaller than
. This shows that w.h.p., after a single epoch ofiterations, shrinks by a constant factor.
We then repeat this analysis across epochs (each consisting of iterations), and use a union bound. Overall, we get that after sufficiently many iterations, with high probability, the iterates get as close as we want to zero.
We note the optimization analysis for stochastic gradient descent is inspired by the analysis in 
for the different non-convex problem of principal component analysis (PCA), which also attempts to avoid a problematic stationary point. An interesting question for future research is to understand to what extent the polynomial dependencies in the problem parameters can be improved.
Our assumption on the data that is made for simplicity. For the gradient descent case, it is easy to verify that the proof only requires that the fourth moment of the data is bounded by some constant, which ensures that the gradients of the objective function used by the algorithm are bounded. For SGD it is enough to assume that the input distribution is sub-Gaussian. The proof proceeds in the same manner, by using a variant of Azuma’s inequality for martingales with sub-Gaussian tail, e.g.
is made for simplicity. For the gradient descent case, it is easy to verify that the proof only requires that the fourth moment of the data is bounded by some constant, which ensures that the gradients of the objective function used by the algorithm are bounded. For SGD it is enough to assume that the input distribution is sub-Gaussian. The proof proceeds in the same manner, by using a variant of Azuma’s inequality for martingales with sub-Gaussian tail, e.g..
6 High-Probability Convergence
The results in the previous section hold under mild conditions, but unfortunately only guarantee a constant probability of success. In this section, we consider the possibility of proving guarantees which hold with high probability (arbitrarily close to ). On the one hand, in Subsection 6.1, we provide such a result for the ReLU activation, assuming the input distribution is spherically symmetric. On the other hand, in Subsection 6.2, we point out non-trivial obstacles to extending such a result to non-spherically symmetric distributions. Overall, we believe that getting high-probability convergence guarantees for non-spherically symmetric distributions is an interesting avenue for future research.
6.1 Convergence for Spherically Symmetric Distributions
In this subsection, we make the following assumptions:
has a spherically symmetric distribution. That is, for every orthogonal matrix:
The activation function is the standard ReLU function .
These assumptions are significantly stronger than Assumptions 4.1, but allow us to prove a stronger high-probability convergence result. Note that even with these assumptions the loss surface is still not convex, and may contain a spurious stationary point (see Remark 4.3). For simplicity, we will focus on proving the result for gradient flow. The result can then be extended to gradient descent and stochastic gradient descent, along similar lines as in the proof of Thm. 5.3.
The proof strategy in this case is quite different from that of the constant-probability guarantee, and relies on the following key technical result:
If , then
The lemma (which relies on the spherical symmetry of the distribution) implies that if we initialize at any point , then the angle between and is strictly less than , and will remain so as long as . As a result, we can apply Thm. 4.2 to prove that decays at an exponential rate. The only potential difficulty is that may converge to the potential stationary point at the origin (at which the angle is not well-defined), but fortunately this cannot happen due to the following lemma:
Let and assume that . If then
The lemma can be shown to imply that as long as remains bounded away from , then cannot decrease below some positive number (as its derivative is positive close enough to zero, and is a continuous function of ). The proof idea of both lemmas is based on a technical calculation, where we project the spherically symmetric distribution on the -dimensional subspace spanned by and .
Using the lemmas above, we can get the following convergence guarantee:
Assume we initialize such that , for some and that Assumption 4.1(1) holds. Then running gradient flow, we have for all
We now note that the assumption of the theorem holds with exponentially high probability under standard initialization schemes. For example, if we use a Gaussian initialization , then by standard concentration of measure arguments, it holds w.p that is at most (say) , and w.p that . As a result, by Thm. 6.4, w.p over the initialization we have for all . The full proof of the theorem can be found in Appendix D.
If we further assume that the distribution is a standard Gaussian, then it is possible to prove Lemma 6.2 and Lemma 6.3 in a much easier fashion. The reason is that specifically for a standard Gaussian distribution there is a closed-form expression (without the expectation) for the loss and the gradient, see , . We provide the relevant versions of the lemmas, as well as their proofs, in Subsection D.1.
6.2 Non-monotonic Angle Behavior
The results in the previous subsection crucially relied on the fact that at almost any point , the angle decreases. This type of analysis was also utilized in works on related settings (e.g. Brutzkus and Globerson ).
Based on this, it might be tempting to conjecture that this monotonically decreasing angle property (and as a result, high-probability guarantees) can be shown to hold more generally, not just for symmetrically spherical distributions. Perhaps surprisingly, we show empirically that this may not be the case, already when we discuss the simple setting of unit variance Gaussian with a non-zero mean. We emphasize that this does not necessarily mean that gradient methods will not succeed, only that an analysis based on showing monotonic behavior of the relevant geometric quantities will not work in general.
In particular, in Figure 1 we report the result of running gradient descent (with constant step size ) on our objective function in , where the input distribution is a unit-variance Gaussian with mean at , and our target vector is . We initialize at three different locations: . Although the algorithm eventually reaches the global minimum , the angle between them is clearly non-monotonic, and actually is initially increasing rather than decreasing. Even worse, the angle appears to attain every value in , so it appears that any analysis using angle-based “safe regions” is bound to fail.
Overall, we conclude that proving a high-probability convergence guarantee for gradient methods appears to be an interesting open problem, already in the case of unit-variance, non-zero-mean Gaussian input distributions. We leave tackling this problem to future work.
Acknowledgements. This research is supported in part by European Research Council (ERC) grant 754705.
- Allen-Zhu et al.  Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, 2019.
- Arora et al.  S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
- Brutzkus and Globerson  A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.
- Cao and Gu  Y. Cao and Q. Gu. A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. arXiv preprint arXiv:1902.01384, 2019.
- Daniely  A. Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
Ge et al. 
R. Ge, F. Huang, C. Jin, and Y. Yuan.
Escaping from saddle points—online stochastic gradient for tensor decomposition.In Conference on Learning Theory, pages 797–842, 2015.
Glorot and Bengio 
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
- Goel et al.  S. Goel, V. Kanade, A. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. arXiv preprint arXiv:1611.10258, 2016.
- Hoeffding  W. Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer, 1994.
- Janzamin et al.  M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
- Jin et al.  C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724–1732. JMLR. org, 2017.
- Kakade et al.  S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, pages 927–935, 2011.
- Kalai and Sastry  A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT. Citeseer, 2009.
- Mei et al.  S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016.
- Oymak and Soltanolkotabi  S. Oymak and M. Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004, 2018.
- Safran and Shamir  I. Safran and O. Shamir. Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
- Shamir  O. Shamir. A variant of azuma’s inequality for martingales with subgaussian tails. arXiv preprint arXiv:1110.2392, 2011.
- Shamir  O. Shamir. A stochastic pca and svd algorithm with an exponential convergence rate. In International Conference on Machine Learning, pages 144–152, 2015.
- Shamir  O. Shamir. Distribution-specific hardness of learning neural networks. The Journal of Machine Learning Research, 19(1):1135–1163, 2018.
- Soltanolkotabi  M. Soltanolkotabi. Learning relus via gradient descent. In Advances in Neural Information Processing Systems, pages 2007–2017, 2017.
- Soltanolkotabi et al.  M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2019.
- Sun et al.  J. Sun, Q. Qu, and J. Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.
- Sun et al.  J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
- Tan and Vershynin  Y. S. Tan and R. Vershynin. Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval. arXiv preprint arXiv:1910.12837, 2019.
- Yehudai and Shamir  G. Yehudai and O. Shamir. On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems, 2019.
Appendix A Proofs from Sec. 3
Proof of Thm. 3.2.
We have that:
where is by monotonicity of (hence always), and is by the assumption that . Next, we bound the gradient :
At iteration we have that:
Using induction over the above proves the lemma.
Appendix B Proofs from Sec. 4
We will first need the following lemma:
Fix some , and let be two vectors in such that for some . Then
It is enough to lower bound
The inner infimum is attained at some such that . This is because does not depend on and , and the volume for which the indicator function inside the integral is non-zero is smallest when the angle is largest. Setting this and switching the order of the infima, we get
When , we note that the set is simply a “pie slice” of radial width out of a ball of radius . Since the expression is invariant to rotating the coordinates, we will consider without loss of generality the set , and the expression above reduces to
where is from the fact that is symmetric around the -axis (namely, if and only if ).
We now note that the set contains the two (disjoint and equally-sized) rectangular sets
where we used the fact that and therefore . The integral is simply the volume of , and since and are disjoint and equally sized rectanges, this equals twice the volume of , namely . Plugging into the above, we get
where again we used the fact that .
We now turn to prove the theorem:
Proof of Thm. 4.2.
We note that since is monotonically increasing, then for any , and