Learning a Single Neuron with Gradient Methods

01/15/2020 ∙ by Gilad Yehudai, et al. ∙ Weizmann Institute of Science 0

We consider the fundamental problem of learning a single neuron x σ(w^ x) using standard gradient methods. As opposed to previous works, which considered specific (and not always realistic) input distributions and activation functions σ(·), we ask whether a more general result is attainable, under milder assumptions. On the one hand, we show that some assumptions on the distribution and the activation function are necessary. On the other hand, we prove positive guarantees under mild assumptions, which go beyond those studied in the literature so far. We also point out and study the challenges in further strengthening and generalizing our results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, much effort has been devoted to understanding why neural networks are successfully trained with simple, gradient-based methods, despite the inherent non-convexity of the learning problem. However, our understanding of this is still partial at best.

In this paper, we focus on the simplest possible nonlinear neural network, composed of a single neuron, of the form , where

is the parameter vector and

is some fixed non-linear activation function. Moreover, we consider a realizable setting, where the inputs are sampled from some distribution , the target values are generated by some unknown target neuron (possibly corrupted by independent zero-mean noise), and we wish to train our neuron with respect to the squared loss. Mathematically, this boils down to minimizing the following objective function:


For this problem, we are interested in the performance of gradient-based methods, which are the workhorse of modern machine learning systems. These methods initialize

randomly, and proceed by taking (generally stochastic) gradient steps w.r.t. . If we hope to explain the success of such methods on complicated neural networks, it seems reasonable to expect a satisfying explanation for their convergence on single neurons.

Although the learning of single neurons was studied in a number of papers (see the related work section below for more details), the existing analyses all suffer from one or several limitations: Either they apply for a specific distribution

, which is convenient to analyze but not very practical (such as a standard Gaussian distribution); Apply to gradient methods only with a specific initialization (rather than a standard random one); Or require smoothness and strict monotonicity conditions on the activation function

(which excludes, for example, the common ReLU function

). However, a bit of experimentation strongly suggests that none of these assumptions is really necessary for standard gradient methods to succeed on this simple problem. Thus, our understanding of this problem is probably still incomplete.

The goal of this paper is to study to what extent the limitations above can be relaxed, with the following contributions:

  • We begin by asking whether positive results are possible without any explicit assumptions on the distribution or the activation (other than, say, bounded support and Lipschitz continuity). Although this seems reasonable at first glance, we show in Sec. 3 that unfortunately, this is not the case: For the ReLU activation function, there are bounded distributions on which gradient descent will fail to optimize Eq. (1) with probability exponentially close to . Moreover, even for which is a standard Gaussian, there are Lipschitz activation functions on which gradient methods will likely fail.

  • Motivated by the above, we ask whether it is possible to prove positive results with mild assumptions on the distribution and activation function, which does not exclude the ReLU function and go beyond a standard Gaussian distribution. In Sec. 4, we prove a key technical result, which implies that if the distribution is sufficiently “spread” and the activation function satisfies a weak monotonicity condition (satisfied by ReLU and all standard activation functions), then is positive in most of the domain. This implies that an exact gradient step with sufficiently small step size will bring us closer to in “most” places. Building on this result, we prove in Sec. 5

    a constant-probability convergence guarantee for several variants of gradient methods (gradient descent, stochastic gradient descent, and gradient flow) with random initialization.

  • In Sec. 6, we consider more specifically the case where is any spherically symmetric distribution (which includes the standard Gaussian as a special case), and the ReLU activation function, and show that the convergence results can be made to hold with high probability. As we discuss later on, the case of the ReLU function and a standard Gaussian distribution was also considered in [20, 21], but that analysis crucially relied on initialization at the origin and a Gaussian distribution, whereas our results apply to more generic initialization schemes and distributions.

  • A natural question arising from these results is whether a high-probability result can be proved for non-spherically symmetric distributions. We study this empirically in Subsection 6.2, and show that perhaps surprisingly, this cannot be done with standard potential-based methods (involving the angle or distance to the target

    ), already when we consider unit-variance Gaussian distributions with a non-zero mean.

Overall, we hope our work contributes to a better understanding of the dynamics of gradient methods on simple neural networks, and suggests some natural avenues for future research.

1.1 Related Work

We begin by emphasizing that the problem of learning a single target neuron is not inherently hard: Indeed, it can be efficiently performed with minimal assumptions, using the Isotron algorithm and its variants (Kalai and Sastry [13], Kakade et al. [12]). Also, other algorithms exist for even more complicated networks or more general settings, under certain assumptions (e.g., Goel et al. [8], Janzamin et al. [10]). However, these are non-standard algorithms, whereas our focus here is on standard gradient methods.

For this setting, an important positive result was provided in Mei et al. [14], showing that gradient descent on the empirical risk function (with sampled i.i.d. from and sufficiently large) successfully yields a good approximation of . However, the analysis requires to be strictly monotonic, and to have uniformly bounded derivatives up to the third order. This excludes standard activation functions such as the ReLU, which are neither strictly monotonic nor differentiable. Indeed, assuming that the activation is strictly monotonic makes the analysis much easier, as we show later on in Thm. 3.2. A related analysis under strict monotonicity conditions is provided in Oymak and Soltanolkotabi [15].

In the landmark papers Soltanolkotabi [20] and Soltanolkotabi et al. [21], the authors studied the setting where is the ReLU function, and gradient descent or stochastic gradient descent is performed on the empirical risk function , where are sampled from a standard Gaussian distribution . However, that analysis is specific to the Gaussian distribution, and crucially relied on initialization at precisely , as well as a certain assumption on how the derivative of the ReLU function is computed at . In more details, we impose the convention that even though the ReLU function is not differentiable at , we take to be some fixed positive number, and the gradient of the population objective at to be

Assuming , we get that the gradient is non-zero and proportional to . For a Gaussian distribution (and more generally, spherically symmetric distributions), this turns out to be proportional to , so that an exact gradient step from will lead us precisely in the direction of the target parameter vector . As a result, if we calculate a sufficiently precise approximation of this direction from a random sample, we can get arbitrarily close to in a single iteration (see Soltanolkotabi et al. [21, Remark 3.2] for a discussion of this). Unfortunately, this unique behavior is specific to initialization at with a certain convention about (note that even locally around , the gradient may not approximate , since it is generally discontinuous around ). Thus, although the analysis is important and insightful, it is difficult to apply more generally.

A line of recent works established the effectiveness of gradient methods in solving non-convex optimization problems with a strict saddle property, which implies that all near-stationary points with nearly positive definite Hessians are close to global minima (see Jin et al. [11], Ge et al. [6], Sun et al. [22]). A relevant example is phase retrieval, which actually fits our setting with being the quadratic function (Sun et al. [23]). However, these results can only be applied to smooth problems, where the objective function is twice differentiable with Lipschitz-continuous Hessians (excluding, for example, problems involving the ReLU activation function). An interesting recent exception is the work of Tan and Vershynin [24], which considered the case . However, their results are specific to that activation, and assumes a specific input distribution (uniform on a scaled origin-centered sphere). In contrast, our focus here is on more general families of distributions and activations.

Brutzkus and Globerson [3] show that gradient descent learns a simple convolutional network with non-overlapping patches, when the inputs have a standard Gaussian distribution. Similar to the analysis in our paper, they rely on showing that the angle between the learned parameter vector and a target parameter vector monotonically decreases with gradient methods. However, the network architecture studied is different than ours, and their proof heavily relies on the symmetry of the Gaussian distribution.

Less directly related to our setting, a popular line of recent works showed how gradient methods on highly over-parameterized neural networks can learn various target functions in polynomial time (e.g., Allen-Zhu et al. [1], Daniely [5], Arora et al. [2], Cao and Gu [4]). However, as pointed out in Yehudai and Shamir [25], this type of analysis cannot be used to explain learnability of single neurons.

2 Preliminaries

We use bold-faced letters to denote vectors. For a vector , we let denote its -th coordinate. We denote by the ReLU function, i.e. . For a vector , we let , and by we denote the all-ones vector . Given vectors , we let denote the angle between and . We use to denote probability. Denote the indicator function which equals if and otherwise.

Unless stated otherwise, we assume that the target vector in Eq. (1) is unit norm, .

When is differentiable, the gradient of the objective function in Eq. (1) is


When is not differentiable, we will still assume that it is differentiable almost everywhere (up to a finite number of points), and that in every point of non-differentiability , there are well-defined left and right derivatives. In that case, practical implementations of gradient methods fix to be some number between its left and right derivatives (for example, for the ReLU function, is defined as some number in ). Following that convention, the expected gradient used by these methods still corresponds to Eq. (2), and we will follow the same convention here.

In our paper, we focus on the following three standard gradient methods:

  • Gradient Flow: We initialize at some , and for every , we set to be the solution of the differential equation:

    This can be thought of as a continuous form of gradient descent, where we consider an infinitesimal learning rate.

  • Gradient Descent: We initialize at some and set a fixed learning rate . At each iteration , we do a single step in the negative direction of the gradient:

  • Stochastic Gradient Descent (SGD): We initialize at some and set a fixed learning rate . At each iteration , we sample an input , and calculate a stochastic gradient:


    and do a single step in the negative direction of the stochastic gradient:

    Note that here we consider SGD on the population loss, which is different from SGD on a fixed training set. We also note that our proof techniques easily extend to mini-batch SGD, where is taken to be the average of stochastic gradients w.r.t. sampled i.i.d. from . However, for simplicity we will focus on .

3 Assumptions on the Distribution and Activation are Necessary

The main concern of this paper is under what assumptions can a single neuron be provably learned. In this section, we show that learning even a single neuron can be hopeless, unless we make non-trivial assumptions on both the input distribution and the activation function.

3.1 Assumptions on the Input Distribution are Necessary

We begin by asking whether Eq. (1) can be minimized by gradient methods in a distribution-free manner (with no assumptions beyond, say, bounded support), as in learning problems where the population objective is convex. Perhaps surprisingly, we show that the answer is negative, even if we consider specifically the ReLU activation, and a distribution supported on the unit Euclidean ball. This is based on the following key result:

Theorem 3.1.

Suppose that is the ReLU function, and assume that is sampled from a product distribution (namely, each is sampled independently from some distribution ). Then there exists a distribution over the inputs, supported on , and with such that the following holds: With probability at least over the initialization point sampled from , if we run gradient flow, gradient descent or stochastic gradient descent, then for every we have (for gradient flow ).


For each distribution , let . We define the following dataset:

where is the standard -th unit vector, and if and

otherwise. Denote the random variable

and . We have that are independent, , and . Using Hoeffding’s inequality, we get that w.p it holds that , which means that there are at least indices such that .


to be uniform distribution on

. Using Eq. (2) and the fact that is the ReLU function, we get

In particular, for every index for which we have that .

Next, we define with (note that ). We condition on the event above – namely, that there are indices for which – and let these indices be . Under this event, for at initialization we have that


We will now show that for every index , using gradient methods will not change the -th coordinate of from its initial value. Let be such a coordinate. For gradient flow we have that , hence . For gradient descent we have that , hence . For stochastic gradient descent, at each iteration we sample from the distribution defined in Thm. 3.1, and define the stochastic gradient as in Eq. (3). If then hence , otherwise, if then hence . In both cases the -th coordinate of the stochastic gradient is zero, hence . Thus, we have shown that for every iteration for gradient descent or SGD we have that (and for gradient flow, for every time , we have ).

We end by noting that although the distribution defined here is discrete over a finite dataset, the same argument can also be made for a non-discrete distribution, by considering a mixture of smooth distributions concentrated around the support points of the discrete distribution above. ∎

The theorem above applies to any product initialization scheme, which includes most standard initializations used in practice (e.g., the standard Xavier initialization [7]). The theorem implies that it is impossible to prove positive guarantees in our setting without distributional assumptions on ths inputs. Inspecting the construction, the source of the problem (at least for the ReLU neuron) appears to be the fact that the distribution is supported on a small number of well-separated regions. Thus, in our positive results, we will assume that the distribution is sufficiently “spread”, as formalized later on in Sec. 4

3.2 Assumptions on the Activation Function

We now turn to discuss the activation function, explaining why even if the activation is Lipschitz and the input distribution is a standard Gaussian, this is likely insufficient for positive guarantees in our setting.

In particular, let us consider the case that is a -Lipschitz periodic function. Then Theorem in [19] implies that for a large family of input distributions on (including a standard Gaussian), if we assume that the vector in the target neuron is a uniformly distributed unit vector, then for any fixed ,

This implies that the gradient at is virtually independent of the underlying target vector : In fact, it is extremely concentrated around a fixed value which does not depend on . Theorem 4 from [19] goes further and shows that for any gradient method, even an exponentially small amount of arbitrary noise will be enough to make its trajectory (after at most iterations) independent of , in which case it cannot possibly succeed in this setting. We note that their result is even more general as they consider a general function instead of , so our setting can be seen as a private case.

When considering a standard Gaussian distribution, the above argument can be easily extended to activations which are periodic only in a segment of length around the origin. This can be seen by extending the activation to which is periodic on , applying the above argument to it, and noting that the probability mass outside of a ball of radius is exponentially small (for example, see [25] Proposition 4.2, where they consider an activation which is a finite sum of ReLU functions and periodic in a segment of length ).

The above discussion motivates us to impose some condition on the activation function which excludes periodic functions. One such mild assumptions, which we will adopt in the rest of the paper (and corresponds to virtually all activations used in practice) is that the activation is monotonically increasing. Before continuing, we remark that by assuming a slight strengthening of this assumption, namely that the function is strictly monotonically increasing, it is easy to prove a positive guarantee, as evidenced by Thm. 3.2. However, this excludes popular activations such as the ReLU function.

Theorem 3.2.

Assume for some , and the following for some :

  • is positive definite with minimal eigenvalue

  •  .

Then starting from any point , after doing iterations of gradient descent with learning rate , we have that:

The proof can be found in Appendix A, and can be easily generalized to apply also to gradient flow and SGD. The above shows that if we assume strict monotonicity of the activation, then under very mild assumptions on the data will converge exponentially fast to . In the rest of the paper, however, we focus on results which only require weak monotonicity.

4 Under Mild Assumptions the Gradient Points in a Good Direction

Motivated by the results in Sec. 3, we use the following assumptions on the distribution and activation:

Assumption 4.1.

The following holds for some fixed :

  1. The distribution satisfies the following: For any vector , let denote the marginal distribution of on the subspace spanned by (as a distribution over ). Then any such distribution has a density function such that .

  2. is monotonically increasing; and satisfies .

The distributional assumption is such that in every -dimensional subspace, the marginal distribution is sufficiently “spread” in any direction close to the origin. For example, for a standard Gaussian distribution, this is true for regardless of the dimension (as the marginal distribution of a standard Gaussian on the subspace is a standard -dimensional Gaussian). Also, for any distribution, it can be made to hold by mixing it with a bit of a Gaussian or uniform distribution if possible. The activation assumption covers ReLU or ReLU-like activations (e.g. leaky-ReLU, Softplus). It also covers sigmoid and tanh activations, for which the gradient in any bounded interval is lower bounded by a positive constant.

With these assumptions, we prove the following key technical result, which implies that the gradient of the objective has a positive correlation with the direction of the global minimum (at ), if the angle between and and the norm of are not too large:

Theorem 4.2.

Under Assumptions 4.1, for any such that and for some , it holds that

The theorem implies that for suitable values of , gradient methods (which move in the negative gradient direction) will decrease the distance from . When this behavior occurs, it is easy to show that gradient methods succeed in learning the target neuron, like in the previous Thm. 3.2 for the strictly monotonic case. The main challenge is to guarantee that the trajectory of the algorithm will indeed never violate the theorem’s conditions, in particular that the angle between and indeed remains bounded away from (and in fact, later on we will show that such a guarantee is not always possible).

The formal proof of the theorem can be found in Appendix B, but its intuition can be described as follows: we want to bound below the term

Note that:

  1. Using the assumption on , the term inside the above expectation is nonnegative for every . This is because and for any monotonic function we have . Thus, viewing the expectation as an integral over a nonnegative function, we can lower bound it by taking the integral over the smaller set .

  2. The resulting integral depends only on dot products of with and . Thus, it is enough to consider the marginal distribution on the -dimensional plane spanned by and .

  3. By the assumption on the distribution, the density function of this marginal distribution is always at least on any such that . This means we can lower bound the integral above by integrating over with a uniform distribution on this set and multiplying by .

In total, the expression above is lower bounded by a -dimensional integral with uniform measure and with no terms on the set:

where are the -dimensional vectors representing on the -dimensional plane spanned by them. We lower bound this integral by a term that scales with the angle .

Remark 4.3 (Implication on Optimization Landscape).

The proof of the theorem can be shown to imply that for the ReLU activation, under the theorem’s conditions, the only stationary point that is not the global minimum must be at the origin. In particular, the proof implies that any stationary point (with ) must be along the ray . For the ReLU activation (which satisfies for any and ), the gradient at such points equals

In particular,

This implies that might be zero only if either (i.e., the origin), or with probability , which cannot happen according to Assumptions 4.1.

5 Convergence with Constant Probability Under Mild Assumptions

In this section, we use Thm. 4.2 in order to show that under some assumption on the initialization of , gradient methods will be able to learn a single neuron with probability at least (close to) . Note that the loss surface of is not convex, and as explained in Remark 4.3, there may be a stationary point at . This stationary point can cause difficulties, as it is not obvious how to control the angle between and close to the origin (which is required for Thm. 4.2 to apply). But, if we assume at initialization for some small , then we are not close to this stationary point and we can ensure that it will remain that way throughout the optimization process. One such initialization, which guarantees this with at least constant probability, is a zero-mean Gaussian initialization with small enough variance:

Lemma 5.1.

Assume . If we sample for then w.p we have that

In order to bound each gradient step we will need these additional assumptions:

Assumption 5.2.

The following holds for some positive :

  1. almost surely over

  2. for all

With these assumptions, we show convergence for gradient flow, gradient descent and stochastic gradient descent:

Theorem 5.3.

Under assumptions 4.1 and 5.2 we have:

  1. (Gradient Flow) Let , and assume that . Running gradient flow, then for every time we have

    where .

  2. (Gradient Descent) Let , and assume that . Let for and . Running gradient descent with step size , we have that for every , after iterations:

  3. (Stochastic Gradient Descent) Let , and assume that . Let where and . Then w.p , after iterations we have that:

Combined with Lemma 5.1, Thm. 5.3 shows that with proper initialization, gradient flow, gradient descent as well as stochastic gradient descent successfully minimize Eq. (1) with probability (close to) , and for the first two algorithms, the convergence rate is exponential.

The full proof of the theorem can be found in Appendix C, and its intuition for gradient flow and gradient is as described above (namely, that if , it will stay that way and will just continue to shrink over time, using Thm. 4.2). The proof for stochastic gradient descent is much more delicate. This is because the update at each iteration is noisy, so we need to ensure we remain in the region where Thm. 4.2 is applicable. Here we give a short proof intuition:

  1. Assume we initialized with . In order for the analysis to work we need that throughout the algorithm’s run. Otherwise, if we won’t be able to use Thm. 4.2 with a constant angle , and also we may be close to the stationary point at . Thus, we show (using a maximal version of Azuma’s inequality) that if is small enough, and we take at most gradient steps then w.h.p for every :

  2. The next step is to show that if , then for an appropriate . This is done using Thm. 4.2, as in the gradient descent case, but note that here this only holds in expectation over the sample selected at iteration .

  3. Next, we use Azuma’s inequality again on iterations for a small enough , to show that w.h.p does not move too far away from where the expectation is taken over . Also, we show that after iterations for a constant smaller than

    . This shows that w.h.p., after a single epoch of

    iterations, shrinks by a constant factor.

  4. We then repeat this analysis across epochs (each consisting of iterations), and use a union bound. Overall, we get that after sufficiently many iterations, with high probability, the iterates get as close as we want to zero.

We note the optimization analysis for stochastic gradient descent is inspired by the analysis in [18]

for the different non-convex problem of principal component analysis (PCA), which also attempts to avoid a problematic stationary point. An interesting question for future research is to understand to what extent the polynomial dependencies in the problem parameters can be improved.

Remark 5.4.

Our assumption on the data that

is made for simplicity. For the gradient descent case, it is easy to verify that the proof only requires that the fourth moment of the data is bounded by some constant, which ensures that the gradients of the objective function used by the algorithm are bounded. For SGD it is enough to assume that the input distribution is sub-Gaussian. The proof proceeds in the same manner, by using a variant of Azuma’s inequality for martingales with sub-Gaussian tail, e.g.


6 High-Probability Convergence

The results in the previous section hold under mild conditions, but unfortunately only guarantee a constant probability of success. In this section, we consider the possibility of proving guarantees which hold with high probability (arbitrarily close to ). On the one hand, in Subsection 6.1, we provide such a result for the ReLU activation, assuming the input distribution is spherically symmetric. On the other hand, in Subsection 6.2, we point out non-trivial obstacles to extending such a result to non-spherically symmetric distributions. Overall, we believe that getting high-probability convergence guarantees for non-spherically symmetric distributions is an interesting avenue for future research.

6.1 Convergence for Spherically Symmetric Distributions

In this subsection, we make the following assumptions:

Assumption 6.1.

Assume that:

  1. has a spherically symmetric distribution. That is, for every orthogonal matrix


  2. The activation function is the standard ReLU function .

These assumptions are significantly stronger than Assumptions 4.1, but allow us to prove a stronger high-probability convergence result. Note that even with these assumptions the loss surface is still not convex, and may contain a spurious stationary point (see Remark 4.3). For simplicity, we will focus on proving the result for gradient flow. The result can then be extended to gradient descent and stochastic gradient descent, along similar lines as in the proof of Thm. 5.3.

The proof strategy in this case is quite different from that of the constant-probability guarantee, and relies on the following key technical result:

Lemma 6.2.

If , then

The lemma (which relies on the spherical symmetry of the distribution) implies that if we initialize at any point , then the angle between and is strictly less than , and will remain so as long as . As a result, we can apply Thm. 4.2 to prove that decays at an exponential rate. The only potential difficulty is that may converge to the potential stationary point at the origin (at which the angle is not well-defined), but fortunately this cannot happen due to the following lemma:

Lemma 6.3.

Let and assume that . If then

The lemma can be shown to imply that as long as remains bounded away from , then cannot decrease below some positive number (as its derivative is positive close enough to zero, and is a continuous function of ). The proof idea of both lemmas is based on a technical calculation, where we project the spherically symmetric distribution on the -dimensional subspace spanned by and .

Using the lemmas above, we can get the following convergence guarantee:

Theorem 6.4.

Assume we initialize such that , for some and that Assumption 4.1(1) holds. Then running gradient flow, we have for all

where .

We now note that the assumption of the theorem holds with exponentially high probability under standard initialization schemes. For example, if we use a Gaussian initialization , then by standard concentration of measure arguments, it holds w.p that is at most (say) , and w.p that . As a result, by Thm. 6.4, w.p over the initialization we have for all . The full proof of the theorem can be found in Appendix D.

Remark 6.5.

If we further assume that the distribution is a standard Gaussian, then it is possible to prove Lemma 6.2 and Lemma 6.3 in a much easier fashion. The reason is that specifically for a standard Gaussian distribution there is a closed-form expression (without the expectation) for the loss and the gradient, see [3], [16]. We provide the relevant versions of the lemmas, as well as their proofs, in Subsection D.1.

6.2 Non-monotonic Angle Behavior

Figure 1: Gradient descent for -dimensional data (best viewed in color). The left figure represents the trajectory of gradient descent over the loss surface. The red ”x” marker represents the global minimum at . The right figure shows the angle between and as a function of the number of iterations, the angle ranges from to . The plot colors in the right figure correspond to the trajectory color in the left figure

The results in the previous subsection crucially relied on the fact that at almost any point , the angle decreases. This type of analysis was also utilized in works on related settings (e.g. Brutzkus and Globerson [3]).

Based on this, it might be tempting to conjecture that this monotonically decreasing angle property (and as a result, high-probability guarantees) can be shown to hold more generally, not just for symmetrically spherical distributions. Perhaps surprisingly, we show empirically that this may not be the case, already when we discuss the simple setting of unit variance Gaussian with a non-zero mean. We emphasize that this does not necessarily mean that gradient methods will not succeed, only that an analysis based on showing monotonic behavior of the relevant geometric quantities will not work in general.

In particular, in Figure 1 we report the result of running gradient descent (with constant step size ) on our objective function in , where the input distribution is a unit-variance Gaussian with mean at , and our target vector is . We initialize at three different locations: . Although the algorithm eventually reaches the global minimum , the angle between them is clearly non-monotonic, and actually is initially increasing rather than decreasing. Even worse, the angle appears to attain every value in , so it appears that any analysis using angle-based “safe regions” is bound to fail.

Overall, we conclude that proving a high-probability convergence guarantee for gradient methods appears to be an interesting open problem, already in the case of unit-variance, non-zero-mean Gaussian input distributions. We leave tackling this problem to future work.

Acknowledgements. This research is supported in part by European Research Council (ERC) grant 754705.


Appendix A Proofs from Sec. 3

Proof of Thm. 3.2.

We have that:

where is by monotonicity of (hence always), and is by the assumption that . Next, we bound the gradient :

At iteration we have that:

Using induction over the above proves the lemma.

Appendix B Proofs from Sec. 4

We will first need the following lemma:

Lemma B.1.

Fix some , and let be two vectors in such that for some . Then


It is enough to lower bound

The inner infimum is attained at some such that . This is because does not depend on and , and the volume for which the indicator function inside the integral is non-zero is smallest when the angle is largest. Setting this and switching the order of the infima, we get

When , we note that the set is simply a “pie slice” of radial width out of a ball of radius . Since the expression is invariant to rotating the coordinates, we will consider without loss of generality the set , and the expression above reduces to


where is from the fact that is symmetric around the -axis (namely, if and only if ).

We now note that the set contains the two (disjoint and equally-sized) rectangular sets


Figure 2: An illustration of the sets for the case of . The set , colored in gray, is a ”pie slice” and the rectangles are contained in .

(see Figure 2 for an illustration). Therefore, we can lower bound Eq. (5) by

where we used the fact that and therefore . The integral is simply the volume of , and since and are disjoint and equally sized rectanges, this equals twice the volume of , namely . Plugging into the above, we get

where again we used the fact that .

We now turn to prove the theorem:

Proof of Thm. 4.2.

We have:


We note that since is monotonically increasing, then for any , and