# When is a Convolutional Filter Easy To Learn?

We analyze the convergence of (stochastic) gradient descent algorithm for learning a convolutional filter with Rectified Linear Unit (ReLU) activation function. Our analysis does not rely on any specific form of the input distribution and our proofs only use the definition of ReLU, in contrast with previous works that are restricted to standard Gaussian input. We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. To the best of our knowledge, this is the first recovery guarantee of gradient-based algorithms for convolutional filter on non-Gaussian input distributions. Our theory also justifies the two-stage learning rate strategy in deep neural networks. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

## Authors

• 63 publications
• 65 publications
• 43 publications
• ### Learning One-hidden-layer ReLU Networks via Gradient Descent

We study the problem of learning one-hidden-layer neural networks with R...
06/20/2018 ∙ by Xiao Zhang, et al. ∙ 2

• ### Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs

Deep learning models are often successfully trained using gradient desce...
02/26/2017 ∙ by Alon Brutzkus, et al. ∙ 0

• ### Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime

We analyze the convergence of the averaged stochastic gradient descent f...
06/22/2020 ∙ by Atsushi Nitanda, et al. ∙ 14

We introduce a general method for improving the convergence rate of grad...
03/14/2017 ∙ by Atilim Gunes Baydin, et al. ∙ 0

• ### Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels

In this paper, we consider parameter recovery for non-overlapping convol...
11/08/2017 ∙ by Kai Zhong, et al. ∙ 0

• ### A Provably Correct Algorithm for Deep Learning that Actually Works

We describe a layer-by-layer algorithm for training deep convolutional n...
03/26/2018 ∙ by Eran Malach, et al. ∙ 0

• ### Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA

We study the problem of recovering the subspace spanned by the first k p...
06/04/2015 ∙ by Chun-Liang Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep convolutional neural networks (CNN) have achieved the state-of-the-art performance in many applications such as computer vision

(Krizhevsky et al., 2012)(Dauphin et al., 2016)

and reinforcement learning applied in classic games like Go

(Silver et al., 2016). Despite the highly non-convex nature of the objective function, simple first-order algorithms like stochastic gradient descent and its variants often train such networks successfully. On the other hand, the success of convolutional neural network remains elusive from an optimization perspective.

When the input distribution is not constrained, existing results are mostly negative, such as hardness of learning a 3-node neural network (Blum & Rivest, 1989) or a non-overlap convolutional filter (Brutzkus & Globerson, 2017). Recently, Shamir (2016) showed learning a simple one-layer fully connected neural network is hard for some specific input distributions.

These negative results suggest that, in order to explain the empirical success of SGD for learning neural networks, stronger assumptions on the input distribution are needed. Recently, a line of research (Tian, 2017; Brutzkus & Globerson, 2017; Li & Yuan, 2017; Soltanolkotabi, 2017; Zhong et al., 2017) assumed the input distribution be standard Gaussian and showed (stochastic) gradient descent is able to recover neural networks with ReLU activation in polynomial time.

One major issue of these analysis is that they rely on specialized analytic properties of the Gaussian distribution (c.f. Section

1.1) and thus cannot be generalized to the non-Gaussian case, in which real-world distributions fall into. For general input distributions, new techniques are needed.

In this paper we consider a simple architecture: a convolution layer, followed by a ReLU activation function, and then average pooling. Formally, we let be an input sample, e.g., an image, we generate patches from , each with size : where the -th column is the -th patch generated by some known function . For a filter with size

and stride

, is the -th and -th pixels. Since for convolutional filters, we only need to focus on the patches instead of the input, in the following definitions and theorems, we will refer as input and let as the distribution of : ( is the ReLU activation function)

 f(w,Z)=1kk∑i=1σ(w⊤Zi). (1)

See Figure 1 (a) for a graphical illustration. Such architectures have been used as the first layer of many works in computer vision (Lin et al., 2013; Milletari et al., 2016).

We address the realizable case, where training data are generated from (1) with some unknown teacher parameter under input distribution . Consider the loss . We learn by (stochastic) gradient descent, i.e.,

 wt+1 =wt−ηtg(wt) (2)

where is the step size which may change over time and is a random function where its expectation equals to the population gradient The goal of our analysis is to understand the conditions where , if is optimized under (stochastic) gradient descent.

In this setup, our main contributions are as follows:

• Learnability of Filters: We show if the input patches are highly correlated (Section 3), i.e., for some small , then gradient descent and stochastic gradient descent with random initialization recovers the filter in polynomial time.111Note since in this paper we focus on continuous distribution over , our results do not conflict with previous negative results(Blum & Rivest, 1989; Brutzkus & Globerson, 2017) whose constructions rely on discrete distributions.

Furthermore, strong correlations imply faster convergence. To the best of our knowledge, this is the first recovery guarantee of randomly initialized gradient-based algorithms for learning filters (even for the simplest one-layer one-neuron network) on non-Gaussian input distribution, answering an open problem in

(Tian, 2017).

• Distribution-Aware Convergence Rate

. We formally establish the connection between the smoothness of the input distribution and the convergence rate for filter weights recovery where the smoothness in our paper is defined as the ratio between the largest and the least eigenvalues of the second moment of the activation region (Section

2). We show that a smoother input distribution leads to faster convergence, and Gaussian distribution is a special case that leads to the tightest bound. This theoretical finding also justifies the two-stage learning rate strategy proposed by (He et al., 2016; Szegedy et al., 2017) if the step size is allowed to change over time.

### 1.1 Related Works

In recent years, theorists have tried to explain the success of deep learning from different perspectives. From optimization point of view, optimizing neural network is a non-convex optimization problem. Pioneered by

Ge et al. (2015), a class of non-convex optimization problems that satisfy strict saddle property can be optimized by perturbed (stochastic) gradient descent in polynomial time (Jin et al., 2017).222Gradient descent is not guaranteed to converge to a local minima in polynomial time (Du et al., 2017; Lee et al., 2016). This motivates the research of studying the landscape of neural networks (Soltanolkotabi et al., 2017; Kawaguchi, 2016; Choromanska et al., 2015; Hardt & Ma, 2016; Haeffele & Vidal, 2015; Mei et al., 2016; Freeman & Bruna, 2016; Safran & Shamir, 2016; Zhou & Feng, 2017; Nguyen & Hein, 2017) However, these results cannot be directly applied to analyzing the convergence of gradient-based methods for ReLU activated neural networks.

From learning theory point of view, it is well known that training a neural network is hard in the worst cases (Blum & Rivest, 1989; Livni et al., 2014; Šíma, 2002; Shalev-Shwartz et al., 2017a, b) and recently, Shamir (2016) showed either “niceness” of the target function or of the input distribution alone is sufficient for optimization algorithms used in practice to succeed. With some additional assumptions, many works tried to design algorithms that provably learn a neural network with polynomial time and sample complexity (Goel et al., 2016; Zhang et al., 2016, 2015; Sedghi & Anandkumar, 2014; Janzamin et al., 2015; Gautier et al., 2016; Goel & Klivans, 2017). However, these algorithms are tailored for certain architecture and cannot explain why (stochastic) gradient based optimization algorithms work well in practice.

Focusing on gradient-based algorithms, a line of research analyzed the behavior of (stochastic) gradient descent for Gaussian input distribution. Tian (2017) showed population gradient descent is able to find the true weight vector with random initialization for one-layer one-neuron model. Brutzkus & Globerson (2017) showed population gradient descent recovers the true weights of a convolution filter with non-overlapping input in polynomial time. Li & Yuan (2017) showed SGD can recover the true weights of a one-layer ResNet model with ReLU activation under the assumption that the spectral norm of the true weights is bounded by a small constant. All the methods use explicit formulas for Gaussian input, which enable them to apply trigonometric inequalities to derive the convergence. With the same Gaussian assumption, Soltanolkotabi (2017) shows that the true weights can be exactly recovered by projected gradient descent with enough samples in linear time, if the number of inputs is less than the dimension of the weights.

Other approaches combine tensor approaches with assumptions of input distribution.

Zhong et al. (2017) proved that with sufficiently good initialization, which can be implemented by tensor method, gradient descent can find the true weights of a 3-layer fully connected neural network. However, their approach works with known input distributions. Soltanolkotabi (2017) used Gaussian width (c.f. Definition 2.2 of (Soltanolkotabi, 2017)) for concentrations and his approach cannot be directly extended to learning a convolutional filter.

In this paper, we adopt a different approach that only relies on the definition of ReLU. We show as long as the input distribution satisfies weak smoothness assumptions, we are able to find the true weights by SGD in polynomial time. Using our conclusions, we can justify the effectiveness of large amounts of data (which may eliminate saddle points), two-stage and adaptive learning rates used by He et al. (2016); Szegedy et al. (2017), etc.

### 1.2 Organization

This paper is organized as follows. In Section 2, we analyze the simplest one-layer one-neuron model where we state our key observation and establish the connection between smoothness and convergence rate. In Section 3, we discuss the performance of (stochastic) gradient descent for learning a convolutional filter. We provide empirical illustrations in Section 4 and conclude in Section 5. We place most of our detailed proofs in the Appendix.

### 1.3 Notations

Let denote the Euclidean norm of a finite-dimensional vector. For a matrix , we use

to denote its largest singular value and

its smallest singular value. Note if is a positive semidefinite matrix, and represent the largest and smallest eigenvalues of , respectively. Let and denote the standard Big-O and Big-Theta notations that hide absolute constants. We assume the gradient function is uniformly bounded, i.e., There exists such that . This condition is satisfied as long as patches, and noise are all bounded.

## 2 Warm Up: Analyzing One-Layer One-Neuron Model

Before diving into the convolutional filter, we first analyze the special case for , which is equivalent to the one-layer one-neuron architecture. The analysis in this simple case will give us insights for the fully general case. For the ease of presentation, we define following two events and corresponding second moments

 S(w,w∗)={Z:w⊤Z≥0,w⊤∗Z≥0}, S(w,−w∗)={Z:w⊤Z≥0,w⊤∗Z≤0}, (3) Aw,w∗=E[ZZ⊤I{S(w,w∗)}], Aw,−w∗=E[ZZ⊤I{S(w,−w∗)}].

where is the indicator function. Intuitively, is the joint activation region of and and is the joint activation region of and . See Figure 2 (a) for the graphical illustration. With some simple algebra we can derive the population gradient.

 E[∇ℓ(w,Z)]=Aw,w∗(w−w∗)+Aw,−w∗w.

One key observation is we can write the inner product as the sum of two non-negative terms (c.f. Lemma A.1). This observation directly leads to the following Theorem 2.1.

###### Theorem 2.1.

Suppose for any with , and the initialization satisfies then gradient descent algorithm recovers .

The first assumption is about the non-degeneracy of input distribution. For , one case that the assumption fails is that the input distribution is supported on a low-dimensional space, or degenerated. The second assumption on the initialization is to ensure that gradient descent does not converge to , at which the gradient is undefined. This is a general convergence theorem that holds for a wide class of input distribution and initialization points. In particular, it includes Theorem 6 of (Tian, 2017) as a special case. If the input distribution is degenerate, i.e., there are holes in the input space, the gradient descent may stuck around saddle points and we believe more data are needed to facilitate the optimization procedure This is also consistent with empirical evidence in which more data are helpful for optimization.

### 2.1 Convergence Rate of One-Layer One-Neuron Model

In the previous section we showed if the distribution is regular and the weights are initialized appropriately, gradient descent recovers the true weights when it converges. In practice we also want to know how many iterations are needed. To characterize the convergence rate, we need some quantitative assumptions. We note that different set of assumptions will lead to a different rate and ours is only one possible choice. In this paper we use the following quantities.

###### Definition 2.1 (The Largest/Smallest eigenvalue Values of the Second Moment on Intersection of two Half Spaces).

For , define

 γ(ϕ)=minw:∠w,w∗=ϕλmin(Aw,w∗),L(ϕ)=maxw:∠w,w∗=ϕλmax(Aw,w∗),

These two conditions quantitatively characterize the angular smoothness of the input distribution. For a given angle , if the difference between and

is large then there is one direction has large probability mass and one direction has small probability mass, meaning the input distribution is not smooth. On the other hand, if

and are close, then all directions have similar probability mass, which means the input distribution is smooth. The smoothest input distributions are rotationally invariant distributions (e.g. standard Gaussian) which have . For analogy, we can think of as Lipschitz constant of the gradient and as the strong convexity parameter in the optimization literature but here we also allow they change with the angle. Also observe that when , because the intersection has measure and both and are monotonically decreasing.

Our next assumption is on the growth of . Note that when , then because the intersection between and has measure. Also, grows as the angle between and becomes larger.

In the following, we assume the operator norm of increases smoothly with respect to the angle. The intuition is that as long as input distribution bounded probability density with respect to the angle, the operator norm of is bounded. We show in Theorem A.1 that for rotational invariant distribution and in Theorem A.2 that for standard Gaussian distribution.

###### Assumption 2.1.

We assume there exists that for , .

Now we are ready to state the convergence rate.

###### Theorem 2.2.

Suppose the initialization satisfies . Denote then if step size is set as , we have for

 ∥wt+1−w∗∥22≤(1−ηtγ(ϕt)2)∥wt−w∗∥22.

Note both and increases as decreases so we can choose a constant step size . This theorem implies that we can find the -close solution of in iterations. It also suggests a direct relation between the smoothness of the distribution and the convergence rate. For smooth distribution where and are close and is small then is relatively small and we need fewer iterations. On the other hand, if or is much larger than , we will need more iterations. We verify this intuition in Section 4.

If we are able to choose the step sizes adaptively , like using methods proposed by Lin & Xiao (2014), we may improve the computational complexity to . This justifies the use of two-stage learning rate strategy proposed by He et al. (2016); Szegedy et al. (2017) where at the beginning we need to choose learning to be small because is small and later we can choose a large learning rate because as the angle between and becomes smaller, becomes bigger.

The theorem requires the initialization satisfying , which can be achieved by random initialization with constant success probability. See Section 3.2 for a detailed discussion.

## 3 Main Results for Learning a Convolutional Filter

In this section we generalize ideas from the previous section to analyze the convolutional filter. First, for given and we define four events that divide the input space of each patch . Each event corresponds to a different activation region induced by and , similar to (3).

 S(w,w∗)i={Zi:w⊤Zi≥0,w⊤∗Zi≥0}, S(−w,−w∗)i={Zi:w⊤Zi≤0,w⊤∗Zi≤0},

Please check Figure 2 (a) again for illustration. For the ease of presentation we also define the average over all patches in each region

 ZS(w,w∗)=1kk∑i=1 ZiI{S(w,w∗)i},ZS(w,−w∗)=1kk∑i=1ZiI{S(w,−w∗)i}, ZS(−w,w∗)=1kk∑i=1ZiI{S(−w,w∗)i}.

Next, we generalize the smoothness conditions analogue to Definition 2.1 and Assumption 2.1. Here the smoothness is defined over the average of patches.

###### Assumption 3.1.

For , define

 γ(ϕ)=minw:θ(w,w∗)=ϕλmin(E[ZS(w,w∗)Z⊤S(w,w∗)]), L(ϕ)=maxw:θ(w,w∗)=ϕλmax(E[ZS(w,w∗)Z⊤S(w,w∗)]). (4)

We assume for all , for some .

The main difference between the simple one-layer one-neuron network and the convolution filter is two patches may appear in different regions. For a given sample, there may exists patch and such that and and their interaction plays an important role in the convergence of (stochastic) gradient descent. Here we assume the second moment of this interaction, i.e., cross-covariance, also grows smoothly with respect to the angle.

###### Assumption 3.2.

We assume there exists such that

 maxw:θ(w,w∗)≤ϕλmax(E[ZS(w,w∗)Z⊤S(w,−w∗)])+ λmax(E[ZS(w,w∗)Z⊤S(−w,w∗)]) + λmax(E[ZS(w,−w∗)Z⊤S(−w,w∗)])≤Lcrossϕ.

First note if , then and has measure and this assumption models the growth of cross-covariance. Next note this represents the closeness of patches. If and are very similar, then the joint probability density of and is small which implies is small. In the extreme setting, , we have because in this case the events , and all have measure .

Now we are ready to present our result on learning a convolutional filter by gradient descent.

###### Theorem 3.1.

If the initialization satisfies and denote which satisfies . Then if we choose , we have for and

 ∥wt+1−w∗∥22≤(1−η(γ(ϕt)−6Lcross)2)∥wt−w∗∥22

Our theorem suggests if the initialization satisfies , we obtain linear convergence rate. In Section 3.1, we give a concrete example showing closeness of patches implies large and small . Similar to Theorem 2.2, if the step size is chosen so that , in iterations, we can find the -close solution of and the proof is also similar to that of Theorem 3.1.

In practice,we never get a true population gradient but only stochastic gradient (c.f. Equation (2)). The following theorem shows SGD also recovers the underlying filter.

###### Theorem 3.2.

Let . Denote , and . For sufficiently small, if , then we have in iterations, with probability at least we have

Unlike the vanilla gradient descent case, here the convergence rate depends on instead of . This is because of the randomness in SGD and we need a more robust initialization. We choose to be the average of and for the ease of presentation. As will be apparent in the proof we only require not very close to . The proof relies on constructing a martingale and use Azuma-Hoeffding inequality and this idea has been previously used by Ge et al. (2015).

### 3.1 What distribution is easy for SGD to learn a convolutional filter?

Different from One-Layer One-Neuron model, here we also requires the Lipschitz constant for closeness to be relatively small and to be relatively large. A natural question is: What input distributions satisfy this condition?

Here we give an example. We show if (1) patches are close to each other (2) the input distribution has small probability mass around the decision boundary then the assumption in Theorem 3.1 is satisfied. See Figure 1 (b)-(c) for the graphical illustrations.

###### Theorem 3.3.

Denote . Suppose all patches have unit norm 333This is condition can be relaxed to the norm and the angle of each patch are independent and the norm of each pair is independent of others. and for all for all , . Further assume there exists such that for any and for all

 P[θ(Zi,w∗)∈[π2−ϕ,π2+ϕ]]≤μϕ,P[θ(Zi,w∗)∈−[π2−ϕ,−π2+ϕ]]≤μϕ,

then we have

 γ(ϕ0)≥γavg(ϕ0)−4(1−cosρ) and Lcross≤3μ.

where , analogue to Definition 2.1.

Several comments are in sequel. We view as a quantitative measure of the closeness between different patches, i.e., small means they are similar.

This lower bound is monotonically decreasing as a function of and note when , which recovers Definition 2.1.

For the upper bond on , represents the upper bound of the probability density around the decision boundary. For example if , then for in a small neighborhood around , say radius , we have

. This assumption is usually satisfied in real world examples like images because the image patches are not usually close to the decision boundary. For example, in computer vision, the local image patches often form clusters and is not evenly distributed over the appearance space. Therefore, if we use linear classifier to separate their cluster centers from the rest of the clusters, near the decision boundary the probability mass should be very low.

### 3.2 The Power of Random Initialization

For one-layer one-neuron model, we need initialization and for the convolution filter, we need a stronger initialization . The following theorem shows with uniformly random initialization we have constant probability to obtain a good initialization. Note with this theorem at hand, we can boost the success probability to arbitrary close to by random restarts. The proof is similar to (Tian, 2017).

###### Theorem 3.4.

If we uniformly sample from a -dimensional ball with radius so that , then with probability at least , we have .

To apply this general initialization theorem to our convolution filter case, we can choose . Therefore, with some simple algebra we have the following corollary.

###### Corollary 3.1.

Suppose , then if is uniformly sampled from a ball with center and radius , we have with probability at least

The assumption of this corollary is satisfied if the patches are close to each other as discussed in the previous section.

## 4 Experiments

In this section we use simulations to verify our theoretical findings. We first test how the smoothness affect the convergence rate in one-layer one-neuron model described in Section 2 To construct input distribution with different , and (c.f. Definition 2.1 and Assumption 2.1), we fix the patch to have unit norm and use a mixture of truncated Gaussian distribution to model on the angle around and around the Specifically, the probability density of is sampled from Note by definitions of and if the probability mass is centered around , so the distribution is very spiky and and will be large. On the other hand, if , then input distribution is close to the rotation invariant distribution and and will be small. Figure 3a verifies our prediction where we fix the initialization and step size.

Next we test how the closeness of patches affect the convergence rate in the convolution setting. We first generate a single patch using the above model with , then generate each unit norm whose angle with , is sampled from . Figure 3

b shows as variance between patches becomes smaller, we obtain faster convergence rate, which coincides with Theorem

3.1.

We also test whether SGD can learn a filter on real world data. Here we choose MNIST data and generate labels using two filters. One is random filter where each entry is sampled from a standard Gaussian distribution (Figure 3(a)) and the other is a Gabor filter (Figure 3(b)). Figure 3a and Figure 3

c show convergence rates of SGD with different initializations. Here, better initializations give faster rates, which coincides our theory. Note that here we report the relative loss, logarithm of squared error divided by the square of mean of data points instead of the difference between learned filter and true filter because we found SGD often cannot converge to the exact filter but rather a filter with near zero loss. We believe this is because the data are approximately lying in a low dimensional manifold in which the learned filter and the true filter are equivalent. To justify this conjecture, we try to interpolate the learned filter and the true filter linearly and the result filter has similar low loss (c.f. Figure

5). Lastly, we visualize the true filters and the learned filters in Figure 4 and we can see that the they have similar patterns.

## 5 Conclusions and Future Works

In this paper we provide the first recovery guarantee of (stochastic) gradient descent algorithm with random initialization for learning a convolution filter when the input distribution is not Gaussian. Our analyses only used the definition of ReLU and some mild structural assumptions on the input distribution. Here we list some future directions.

One possibility is to extend our result to deeper and wider architectures. Even for two-layer fully-connected network, the convergence of (stochastic) gradient descent with random initialization is not known. Existing results either requires sufficiently good initialization (Zhong et al., 2017) or relies on special architecture (Li & Yuan, 2017). However, we believe the insights from this paper is helpful to understand the behaviors of gradient-based algorithms in these settings.

Another direction is to consider the agnostic setting, where the label is not equal to the output of a neural network. This will lead to different dynamics of (stochastic) gradient descent and we may need to analyze the robustness of the optimization procedures. This problem is also related to the expressiveness of the neural network (Raghu et al., 2016) where if the underlying function is not equal bot is close to a neural network. We believe our analysis can be extend to this setting.

#### Acknowledgment

The authors would like to thank Hanzhang Hu, Tengyu Ma, Yuanzhi Li, Jialei Wang and Kai Zhong for useful discussions.

## Appendix A Proofs and Additional Theorems

### a.1 Proofs of the Theorem in Section 2

###### Lemma A.1.
 ⟨∇wℓ(w),w−w∗⟩=(w−w∗)⊤Aw,w∗(w−w∗)+(w−w∗)⊤Aw,−w∗w. (5)

and both terms are non-negative.

###### Proof.

Since and (positive-semidefinite), both the first term and one part of the second term are non-negative. The other part of the second term is

 −w⊤∗Aw,−w∗w=−E[(w⊤∗Z)(w⊤Z)I{w⊤Z≥0,w⊤∗Z≤0}]≥0.

###### Proof of Theorem 2.1.

The assumption on the input distribution ensures when , and when , . Now when gradient descent converges we have . We have the following theorem. By assumption, since and gradient descent only decreases function value, we will not converge to . Note that at any critical points, , from Lemma A.1, we have:

 (w−w∗)⊤Aw,w∗(w−w∗) =0 (6) (w−w∗)⊤Aw,−w∗w =0. (7)

Suppose we are converging to a critical point . There are two cases:

• If , then we have, which contradicts with Eqn. 6.

• If , without loss of generality, let for some . By the assumption we know . Now the second equation becomes , which contradicts with Eqn. 7.

Therefore we have . ∎

###### Proof of Theorem 2.2.

Our proof relies on the following simple but crucial observation: if , then

 θ(w,w∗)≤arcsin(∥w−w∗∥2∥w∗∥2).

We denote and by the observation we have . Recall the gradient descent dynamics,

 wt+1 =wt−η∇wtℓ(wt) =wt−η(E[ZZ⊤I{w⊤tZ≥0,w⊤∗Z≥0}](wt−w∗)−E[w⊤Z≥0,w⊤∗Z≤0]wt).

Consider the squared distance to the optimal weight

 ∥wt+1−w∗∥22 = ∥wt−w∗∥22 −η(wt−w∗)⊤(E[ZZ⊤I{w⊤tZ≥0,w⊤∗Z≥0}](wt−w∗)−E[w⊤Z≥0,w⊤∗Z≤0]wt)

By our analysis in the previous section, the second term is smaller than

where we have used our assumption on the angle. For the third term, we expand it as

 = ≤ L2(θt)∥wt−w∗∥22+2L(θt)∥wt−w∗∥2⋅2β∥wt−w∗∥∥w∗∥2+(2β∥wt−w∗∥2∥w∗∥2)2∥wt∥22 ≤ ≤ (L2(θt)+8L(θt)β+16β2)∥w−w∗∥22.

Therefore, in summary,

 ∥wt+1−w∗∥22≤ (1−ηγ(θt)+η2(L(θt)+4β)2)∥wt−w∗∥22 ≤ (1−ηγ(θt)2)∥wt−w∗∥22 ≤ (1−ηγ(ϕt)2)∥wt−w∗∥22

where the first inequality is by our assumption of the step size and second is because and is monotonically decreasing. ∎

###### Theorem A.1 (Rotational Invariant Distribution).

For any unit norm rotational invariant input distribution, we have .

###### Proof of Theorem a.1.

Without loss of generality, we only need to focus on the plane spanned by and and suppose . Then

It has two eigenvalues

 λ1(ϕ)=ϕ+sinϕ2 and λ2(ϕ)=ϕ−sinϕ2.

Therefore, for . ∎

If , then

###### Proof.

Note in previous theorem we can integrate angle and radius separately then multiply them together. For Gaussian distribution, we have . The result follows. ∎

### a.2 Proofs of Theorems in Section 3

###### Proof of Theorem 3.1.

The proof is very similar to Theorem 2.2. Notation-wise, for two events we use as a shorthand for and as a shorthand for . Denote First note with some routine algebra, we can write the gradient as

 ∇wtℓ(wt) = E⎡⎣(d,d)∑(i,j)=(1,1)ZiZ⊤jI{S(w,w∗)iS(w,w∗)j}⎤⎦(w−w∗) +E⎡⎣(d,d)∑(i,j)=(1,1)ZiZ⊤jI{S(w,−w∗)iS(w,−w∗)j}⎤⎦w

We first examine the inner product between the gradient and .

 =(w−w∗)⊤E⎡⎣(d,d)∑(i,j)=(1,1)ZiZjI{S(w,w∗)iS(w,w∗)j}⎤⎦(w−w∗) +(w−w∗)⊤E⎡⎣(d,d)∑(i,j)=(1,1)ZiZjI{S(w,w∗)iS(w,−w∗)j+S(w,−w∗)iS(w,w∗)j+S(w,−w∗)iS(w,−w∗)j}⎤⎦w −(w−w∗)⊤E⎡⎣(d,d)∑(i,j)=(1,1)ZiZjI{S(w,w∗)iS(−w,w∗)j+S(w,−w∗)iS(w,w∗)j+S(w,−w∗)iS(−w,w∗)j}⎤⎦w∗ ≥(w−w∗)⊤E⎡⎣(d,d)∑(i,j)=(1,1)ZiZj