Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

12/03/2017 ∙ by Simon S. Du, et al. ∙ 0

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation function, i.e., f(Z; w, a) = ∑_j a_jσ(w^Z_j), in which both the convolutional weights w and the output weights a are parameters to be learned. We prove that with Gaussian input Z, there is a spurious local minimum that is not a global mininum. Surprisingly, in the presence of local minimum, starting from randomly initialized weights, gradient descent with weight normalization can still be proven to recover the true parameters with constant probability (which can be boosted to arbitrarily high accuracy with multiple restarts). We also show that with constant probability, the same procedure could also converge to the spurious local minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradient descent. Furthermore, a quantitative analysis shows that the gradient descent dynamics has two phases: it starts off slow, but converges much faster after several iterations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (DCNN) have achieved the state-of-the-art performance in many applications such as computer vision 

(Krizhevsky et al., 2012)

, natural language processing 

(Dauphin et al., 2016)

and reinforcement learning applied in classic games like Go 

(Silver et al., 2016)

. Despite the highly non-convex nature of the objective function, simple first-order algorithms like stochastic gradient descent and its variants often train such networks successfully. Why such simple methods in learning DCNN is successful remains elusive from the optimization perspective.

Recently, a line of research (Tian, 2017; Brutzkus & Globerson, 2017; Li & Yuan, 2017; Soltanolkotabi, 2017; Shalev-Shwartz et al., 2017b) assumed the input distribution is Gaussian and showed that stochastic gradient descent with random or initialization is able to train a neural network with ReLU activation in polynomial time. However, these results all assume there is only one unknown layer , while

is a fixed vector. A natural question thus arises:

Does randomly initialized (stochastic) gradient descent learn neural networks with multiple layers?

(a) Convolutional neural network with an unknown non-overlapping filter and an unknown output layer. In the first (hidden) layer, a filter is applied to nonoverlapping parts of the input , which then passes through a ReLU activation function. The final output is the inner product between an output weight vector and the hidden layer outputs.
(b) The convergence of gradient descent for learning a CNN described in Figure 0(a) with Gaussian input using different initializations. The success case and the failure case correspond to convergence to the global minimum and the spurious local minimum, respectively. In the first iterations the convergence is slow. After that gradient descent converges at a fast linear rate.
Figure 1: Network architecture that we consider in this paper and convergence of gradient descent for learning the parameters of this network.

In this paper, we take an important step by showing that randomly initialized gradient descent learns a non-linear convolutional neural network with two unknown layers and . To our knowledge, our work is the first of its kind.

Formally, we consider the convolutional case in which a filter is shared among different hidden nodes. Let be an input sample, e.g., an image. We generate patches from , each with size : where the -th column is the -th patch generated by selecting some coordinates of : . We further assume there is no overlap between patches. Thus, the neural network function has the following form:

We focus on the realizable case, i.e., the label is generated according to for some true parameters and and use loss to learn the parameters:

We assume

is sampled from a Gaussian distribution and there is no overlap between patches. This assumption is equivalent to that each entry of

is sampled from a Gaussian distribution (Brutzkus & Globerson, 2017; Zhong et al., 2017b). Following (Zhong et al., 2017a, b; Li & Yuan, 2017; Tian, 2017; Brutzkus & Globerson, 2017; Shalev-Shwartz et al., 2017b), in this paper, we mainly focus on the population loss:

We study whether the global convergence and can be achieved when optimizing using randomly initialized gradient descent.

A crucial difference between our two-layer network and previous one-layer models is there is a positive-homogeneity issue. That is, for any , . This interesting property allows the network to be rescaled without changing the function computed by the network. As reported by (Neyshabur et al., 2015), it is desirable to have scaling-invariant learning algorithm to stabilize the training process.

One commonly used technique to achieve stability is weight-normalization introduced by Salimans & Kingma (2016). As reported in (Salimans & Kingma, 2016), this re-parametrization improves the conditioning of the gradient because it couples the magnitude of the weight vector from the direction of the weight vector and empirically accelerates stochastic gradient descent optimization.

In our setting, we re-parametrize the first layer as and the prediction function becomes

(1)

The loss function is

(2)

In this paper we focus on using randomly initialized gradient descent for learning this convolutional neural network. The pseudo-code is listed in Algorithm 1.111With some simple calculations, we can see the optimal solution for is unique, which we denote as whereas the optimal for is not because for every optimal solution , for is also an optimal solution. In this paper, with a little abuse of the notation, we use to denote the equivalent class of optimal solutions.

Main Contributions. Our paper have three contributions. First, we show if is initialized by a specific random initialization, then with high probability, gradient descent from converges to teacher’s parameters . We can further boost the success rate with more trials.

Second, perhaps surprisingly, we prove that the objective function (Equation (2)) does have a spurious local minimum: using the same random initialization scheme, there exists a pair so that gradient descent from converges to this bad local minimum. In contrast to previous works on guarantees for non-convex objective functions whose landscape satisfies “no spurious local minima” property (Li et al., 2016; Ge et al., 2017a, 2016; Bhojanapalli et al., 2016; Ge et al., 2017b; Kawaguchi, 2016), our result provides a concrete counter-example and highlights a conceptually surprising phenomenon:

Randomly initialized local search can find a global minimum in the presence of spurious local minima.

Finally, we conduct a quantitative study of the dynamics of gradient descent. We show that the dynamics of Algorithm 1 has two phases. At the beginning (around first 50 iterations in Figure 0(b)), because the magnitude of initial signal (angle between and ) is small, the prediction error drops slowly. After that, when the signal becomes stronger, gradient descent converges at a much faster rate and the prediction error drops quickly.

1:  Input: Initialization , , learning rate .
2:  for  do
3:     ,
4:     .
5:  end for
Algorithm 1 Gradient Descent for Learning One-Hidden-Layer CNN with Weight Normalization

Technical Insights. The main difficulty of analyzing the convergence is the presence of local minima. Note that local minimum and the global minimum are disjoint (c.f. Figure 0(b)). The key technique we adopt is to characterize the attraction basin for each minimum. We consider the sequence generated by Algorithm 1 with step size using initialization point . The attraction basin for a minimum is defined as the

The goal is to find a distribution for weight initialization so that the probability that the initial weights are in of the global minimum is bounded below:

for some absolute constant .

While it is hard to characterize , we find that the set is a subset of (c.f. Lemma 5.2-Lemma 5.4). Furthermore, when the learning rate is sufficiently small, we can design a specific distribution so that:

This analysis emphasizes that for non-convex optimization problems, we need to carefully characterize both the trajectory of the algorithm and the initialization. We believe that this idea is applicable to other non-convex problems.

To obtain the convergence rate, we propose a potential function (also called Lyapunov function in the literature). For this problem we consider the quantity where and we show it shrinks at a geometric rate (c.f. Lemma 5.5).

Organization This paper is organized as follows. In Section 3 we introduce the necessary notations and analytical formulas of gradient updates in Algorithm 1. In Section 4, we provide our main theorems on the performance of the algorithm and their implications. In Section 6, we use simulations to verify our theories. In Section 5, we give a proof sketch of our main theorem. We conclude and list future directions in Section 7. We place most of our detailed proofs in the appendix.

2 Related Works

From the point of view of learning theory, it is well known that training a neural network is hard in the worst cases (Blum & Rivest, 1989; Livni et al., 2014; Šíma, 2002; Shalev-Shwartz et al., 2017a, b) and recently, Shamir (2016) showed that assumptions on both the target function and the input distribution are needed for optimization algorithms used in practice to succeed.

Solve NN without gradient descent. With some additional assumptions, many works tried to design algorithms that provably learn a neural network with polynomial time and sample complexity (Goel et al., 2016; Zhang et al., 2015; Sedghi & Anandkumar, 2014; Janzamin et al., 2015; Goel & Klivans, 2017a, b). However these algorithms are specially designed for certain architectures and cannot explain why (stochastic) gradient based optimization algorithms work well in practice.

Gradient-based optimization with Gaussian Input. Focusing on gradient-based algorithms, a line of research analyzed the behavior of (stochastic) gradient descent for Gaussian input distribution. Tian (2017)

showed that population gradient descent is able to find the true weight vector with random initialization for one-layer one-neuron model.

Soltanolkotabi (2017) later improved this result by showing the true weights can be exactly recovered by empirical projected gradient descent with enough samples in linear time. Brutzkus & Globerson (2017) showed population gradient descent recovers the true weights of a convolution filter with non-overlapping input in polynomial time. Zhong et al. (2017b, a)

proved that with sufficiently good initialization, which can be implemented by tensor method, gradient descent can find the true weights of a one-hidden-layer fully connected and convolutional neural network.

Li & Yuan (2017) showed SGD can recover the true weights of a one-layer ResNet model with ReLU activation under the assumption that the spectral norm of the true weights is within a small constant of the identity mapping. (Panigrahy et al., 2018) also analyzed gradient descent for learning a two-layer neural network but with different activation functions. This paper also follows this line of approach that studies the behavior of gradient descent algorithm with Gaussian inputs.

Local minimum and Global minimum. Finding the optimal weights of a neural network is non-convex problem. Recently, researchers found that if the objective functions satisfy the following two key properties, (1) all saddle points and local maxima are strict (i.e., there exists a direction with negative curvature), and (2) all local minima are global (no spurious local minmum), then perturbed (stochastic) gradient descent (Ge et al., 2015) or methods with second order information (Carmon et al., 2016; Agarwal et al., 2017) can find a global minimum in polynomial time. 222Lee et al. (2016) showed vanilla gradient descent only converges to minimizers with no convergence rates guarantees. Recently, Du et al. (2017a) gave an exponential time lower bound for the vanilla gradient descent. In this paper, we give polynomial convergence guarantee on vanilla gradient descent. Combined with geometric analyses, these algorithmic results have shown a large number problems, including tensor decomposition (Ge et al., 2015), dictionary learning (Sun et al., 2017), matrix sensing (Bhojanapalli et al., 2016; Park et al., 2017), matrix completion (Ge et al., 2017a, 2016) and matrix factorization (Li et al., 2016) can be solved in polynomial time with local search algorithms.

This motivates the research of studying the landscape of neural networks (Kawaguchi, 2016; Choromanska et al., 2015; Hardt & Ma, 2016; Haeffele & Vidal, 2015; Mei et al., 2016; Freeman & Bruna, 2016; Safran & Shamir, 2016; Zhou & Feng, 2017; Nguyen & Hein, 2017a, b; Ge et al., 2017b; Zhou & Feng, 2017; Safran & Shamir, 2017). In particular, Kawaguchi (2016); Hardt & Ma (2016); Zhou & Feng (2017); Nguyen & Hein (2017a, b); Feizi et al. (2017) showed that under some conditions, all local minima are global. Recently, Ge et al. (2017b) showed using a modified objective function satisfying the two properties above, one-hidden-layer neural network can be learned by noisy perturbed gradient descent. However, for nonlinear activation function, where the number of samples larger than the number of nodes at every layer, which is usually the case in most deep neural network, and natural objective functions like , it is still unclear whether the strict saddle and “all locals are global” properties are satisfied. In this paper, we show that even for a one-hidden-layer neural network with ReLU activation, there exists a spurious local minimum. However, we further show that randomly initialized local search can achieve global minimum with constant probability.

3 Preliminaries

We use bold-faced letters for vectors and matrices. We use to denote the Euclidean norm of a finite-dimensional vector. We let and be the parameters at the -th iteration and and be the optimal weights. For two vector and , we use to denote the angle between them. is the -th coordinate of and is the transpose of the -th row of (thus a column vector). We denote the -dimensional unit sphere and the ball centered at with radius .

In this paper we assume every patch

is vector of i.i.d Gaussian random variables. The following theorem gives an explicit formula for the population loss. The proof uses basic rotational invariant property and polar decomposition of Gaussian random variables. See Section 

A for details.

Theorem 3.1.

If every entry of is i.i.d. sampled from a Gaussian distribution with mean

and variance

, then population loss is

(3)

where and

Using similar techniques, we can show the gradient also has an analytical form.

Theorem 3.2.

Suppose every entry of is i.i.d. sampled from a Gaussian distribution with mean and variance . Denote . Then the expected gradient of and can be written as

As a remark, if the second layer is fixed, upon proper scaling, the formulas for the population loss and gradient of are equivalent to the corresponding formulas derived in (Brutzkus & Globerson, 2017; Cho & Saul, 2009). However, when the second layer is not fixed, the gradient of depends on , which plays an important role in deciding whether converging to the global or the local minimum.

4 Main Result

We begin with our main theorem about the convergence of gradient descent.

Theorem 4.1.

Suppose the initialization satisfies , , and step size satisfies

Then the convergence of gradient descent has two phases.
(Phase I: Slow Initial Rate) There exists such that we have and where .
(Phase II: Fast Rate) Suppose at the -th iteration, and , then there exists 333 hides logarithmic factors on and such that .

Theorem 4.1 shows under certain conditions of the initialization, gradient descent converges to the global minimum. The convergence has two phases, at the beginning because the initial signal () is small, the convergence is quite slow. After iterations, the signal becomes stronger and we enter a regime with a faster convergence rate. See Lemma 5.5 for technical details.

Initialization plays an important role in the convergence. First, Theorem 4.1 needs the initialization satisfy , and . Second, the step size and the convergence rate in the first phase also depends on the initialization. If the initial signal is very small, for example, which makes close to , we can only choose a very small step size and because depends on the inverse of , we need a large number of iterations to enter phase II. We provide the following initialization scheme which ensures the conditions required by Theorem 4.1 and a large enough initial signal.

Theorem 4.2.

Let and , then exists

that , and . Further, with high probability, the initialization satisfies , and .

Theorem 4.2 shows after generating a pair of random vectors , trying out all sign combinations of , we can find the global minimum by gradient descent. Further, because the initial signal is not too small, we only need to set the step size to be and the number of iterations in phase I is at most . Therefore, Theorem 4.1 and Theorem 4.2 together show that randomly initialized gradient descent learns an one-hidden-layer convolutional neural network in polynomial time. The proof of the first part of Theorem 4.2 uses the symmetry of unit sphere and ball and the second part is a standard application of random vector in high-dimensional spaces. See Lemma 2.5 of (Hardt & Price, 2014) for example.

Remark 1: For the second layer we use type initialization, verifying common initialization techniques (Glorot & Bengio, 2010; He et al., 2015; LeCun et al., 1998).

Remark 2: The Gaussian input assumption is not necessarily true in practice, although this is a common assumption appeared in the previous papers (Brutzkus & Globerson, 2017; Li & Yuan, 2017; Zhong et al., 2017a, b; Tian, 2017; Xie et al., 2017; Shalev-Shwartz et al., 2017b) and also considered plausible in (Choromanska et al., 2015). Our result can be easily generalized to rotation invariant distributions. However, extending to more general distributional assumption, e.g., structural conditions used in (Du et al., 2017b) remains a challenging open problem.

Remark 3: Since we  only require initialization to be smaller than some quantities of and . In practice, if the optimization fails, i.e., the initialization is too large, one can halve the initialization size, and eventually these conditions will be met.

4.1 Gradient Descent Can Converge to the Spurious Local Minimum

Theorem 4.2 shows that among , there is a pair that enables gradient descent to converge to the global minimum. Perhaps surprisingly, the next theorem shows that under some conditions of the underlying truth, there is also a pair that makes gradient descent converge to the spurious local minimum.

Theorem 4.3.

Without loss of generality, we let . Suppose and is sufficiently small. Let and , then with high probability, there exists that , , . If is used as the initialization, when Algorithm 1 converges, we have

and .

Unlike Theorem 4.1 which requires no assumption on the underlying truth , Theorem 4.3 assumes . This technical condition comes from the proof which requires invariance for all iterations. To ensure there exists which makes , we need relatively small. See Section E for more technical insights.

A natural question is whether the ratio becomes larger, the probability randomly gradient descent converging to the global minimum, becomes larger as well. We verify this phenomenon empirically in Section 6.

5 Proof Sketch

In Section 5.1, we give qualitative high level intuition on why the initial conditions are sufficient for gradient descent to converge to the global minimum. In Section 5.2, we explain why the gradient descent has two phases.

5.1 Qualitative Analysis of Convergence

The convergence to global optimum relies on a geometric characterization of saddle points and a series of invariants throughout the gradient descent dynamics. The next lemma gives the analysis of stationary points. The main step is to check the first order condition of stationary points using Theorem 3.2.

Lemma 5.1 (Stationary Point Analysis).

When the gradient descent converges, and , we have either

This lemma shows that when the algorithm converges, and and are not orthogonal, then we arrive at either a global optimal point or a local minimum. Now recall the gradient formula of : . Notice that and is just the projection matrix onto the complement of . Therefore, the sign of inner product between and plays a crucial role in the dynamics of Algorithm 1 because if the inner product is positive, the gradient update will decrease the angle between and and if it is negative, the angle will increase. This observation is formalized in the lemma below.

Lemma 5.2 (Invariance I: Tje Angle between and always decreases.).

If , then .

This lemma shows that when for all , gradient descent converges to the global minimum. Thus, we need to study the dynamics of . For the ease of presentation, without loss of generality, we assume . By the gradient formula of , we have

(4)

We can use induction to prove the invariance. If and the first term of Equation (4) is non-negative. For the second term, notice that if , we have , so the second term is non-negative. Therefore, as long as is also non-negative, we have the desired invariance. The next lemma summarizes the above analysis.

Lemma 5.3 (Invariance II: Positive Signal from the Second Layer.).

If , , and , then .

It remains to prove . Again, we study the dynamics of this quantity. Using the gradient formula and some algebra, we have

where have used the fact that for all . Therefore we have

These imply the third invariance.

Lemma 5.4 (Invariance III: Summation of Second Layer Always Small.).

If and then .

To sum up, if the initialization satisfies (1) , (2) and (3) , with Lemma 5.25.35.4, by induction we can show the convergence to the global minimum. Further, Theorem 4.2 shows these three conditions are true with constant probability using random initialization.

5.2 Quantitative Analysis of Two Phase Phenomenon

In this section we demonstrate why there is a two-phase phenomenon. Throughout this section, we assume the conditions in Section 5.1 hold. We first consider the convergence of the first layer. Because we are using weight-normalization, only the angle between and will affect the prediction. Therefore, in this paper, we study the dynamics . The following lemma quantitatively characterize the shrinkage of this quantity of one iteration.

Lemma 5.5 (Convergence of Angle between and ).

Under the same assumptions as in Theorem 4.1. Let . If the step size satisfies , we have

where .

This lemma shows the convergence rate depends on two crucial quantities, and . At the beginning, both and are small. Nevertheless, Lemma C.3 shows is universally lower bounded by . Therefore, after we have . Once , Lemma C.2 shows, after iterations, . Combining the facts (Lemma C.3) and , we have . Now we enter phase II.

In phase II, Lemma 5.5 shows

for some positive absolute constant . Therefore, we have much faster convergence rate than that in the Phase I. After only iterations, we obtain .

Once we have this, we can use Lemma C.4 to show after iterations. Next, using Lemma C.5, we can show after iterations, . Lastly, Lemma C.6 shows if and we have we have .

6 Experiments

In this section, we illustrate our theoretical results with numerical experiments. Again without loss of generality, we assume in this section.

6.1 Multi-phase Phenomenon

In Figure 2, we set , and we consider 4 key quantities in proving Theorem 4.1, namely, angle between and (c.f. Lemma 5.5), (c.f. Lemma C.5), (c.f. Lemma C.4) and prediction error (c.f. Lemma C.6).

When we achieve the global minimum, all these quantities are . At the beginning (first iterations), and the prediction error drop quickly. This is because for the gradient of , is the dominating term which will make closer to quickly.

After that, for the next iterations, all quantities decrease at a slow rate. This phenomenon is explained to the Phase I stage in Theorem 4.1. The rate is slow because the initial signal is small.

After iterations, all quantities drop at a much faster rate. This is because the signal is very strong and since the convergence rate is proportional to this signal, we have a much faster convergence rate (c.f. Phase II of Theorem 4.1).

Figure 2: Convergence of different measures we considered in proving Theorem 4.1. In the first iterations, all quantities drop slowly. After that, these quantities converge at much faster linear rates.

6.2 Probability of Converging to the Global Minimum

In this section we test the probability of converging to the global minimum using the random initialization scheme described in Theorem 4.2. We set and vary and . We run 5000 random initializations for each and compute the probability of converging to the global minimum.

In Theorem 4.3, we showed if is sufficiently small, randomly initialized gradient descent converges to the spurious local minimum with constant probability. Table 1 empirically verifies the importance of this assumption. For every fixed if becomes larger, the probability of converging to the global minimum becomes larger.

An interesting phenomenon is for every fixed ratio when becomes lager, the probability of converging to the global minimum becomes smaller. How to quantitatively characterize the relationship between the success probability and the dimension of the second layer is an open problem.

7 Conclusion and Future Works

In this paper we proved the first polynomial convergence guarantee of randomly initialized gradient descent algorithm for learning a one-hidden-layer convolutional neural network. Our result reveals an interesting phenomenon that randomly initialized local search algorithm can converge to a global minimum or a spurious local minimum. We give a quantitative characterization of gradient descent dynamics to explain the two-phase convergence phenomenon. Experimental results also verify our theoretical findings. Here we list some future directions.

Our analysis focused on the population loss with Gaussian input. In practice one uses (stochastic) gradient descent on the empirical loss. Concentration results in (Mei et al., 2016; Soltanolkotabi, 2017) are useful to generalize our results to the empirical version. A more challenging question is how to extend the analysis of gradient dynamics beyond rotationally invariant input distributions. Du et al. (2017b) proved the convergence of gradient descent under some structural input distribution assumptions in the one-layer convolutional neural network. It would be interesting to bring their insights to our setting.

Another interesting direction is to generalize our result to deeper and wider architectures. Specifically, an open problem is under what conditions randomly initialized gradient descent algorithms can learn one-hidden-layer fully connected neural network or a convolutional neural network with multiple kernels. Existing results often require sufficiently good initialization (Zhong et al., 2017a, b). We believe the insights from this paper, especially the invariance principles in Section 5.1 are helpful to understand the behaviors of gradient-based algorithms in these settings.

k 0 1 4 9 16 25
25 0.50 0.55 0.73 1 1 1
36 0.50 0.53 0.66 0.89 1 1
49 0.50 0.53 0.61 0.78 1 1
64 0.50 0.51 0.59 0.71 0.89 1
81 0.50 0.53 0.57 0.66 0.81 0.97
100 0.50 0.50 0.57 0.63 0.75 0.90
Table 1: Probability of converging to the global minimum with different and . For every fixed , when becomes larger, the probability of converging to the global minimum becomes larger and for every fixed ratio when becomes lager, the probability of converging to the global minimum becomes smaller.

8 Acknowledgment

This research was partly funded by NSF grant IIS1563887, AFRL grant FA8750-17-2-0212 DARPA D17AP00001. J.D.L. acknowledges support of the ARO under MURI Award W911NF-11-1-0303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative. The authors thank Xiaolong Wang and Kai Zhong for useful discussions.

References

Appendix A Proofs of Section 3

Proof of Theorem  3.1.

We first expand the loss function directly.

where for simplicity, we denote

(5)
(6)

For , using the second identity of Lemma A.1, we can compute

For

, using the second moment formula of half-Gaussian distribution we can compute

Therefore

Now let us compute . For , similar to , using the independence property of Gaussian, we have

Next, using the fourth identity of Lemma A.1, we have

Therefore, we can also write in a compact form

Plugging in the formulas of and and , we obtain the desired result. ∎

Proof of Theorem 3.2.

We first compute the expect gradient for . From(Salimans & Kingma, 2016), we know

Recall the gradient formula,

(7)
(8)

Now we calculate expectation of Equation (7) and (8) separately. For (7), by first two formulas of Lemma A.1, we have