On the Power of Over-parametrization in Neural Networks with Quadratic Activation

03/03/2018 ∙ by Simon S. Du, et al. ∙ 0

We provide new theoretical insights on why over-parametrization is effective in learning neural networks. For a k hidden node shallow network with quadratic activation and n training data points, we show as long as k >√(2n), over-parametrization enables local search algorithms to find a globally optimal solution for general smooth and convex loss functions. Further, despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, we show with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian. To prove when k>√(2n), the loss function has benign landscape properties, we adopt an idea from smoothed analysis, which may have other applications in studying loss surfaces of neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have achieved a remarkable impact on many applications such computer vision, reinforcement learning and natural language processing. Though neural networks are successful in practice, their theoretical properties are not yet well understood. Specifically, there are two intriguing empirical observations that existing theories cannot explain.

  • Optimization

    : Despite the highly non-convex nature of the objective function, simple first-order algorithms like stochastic gradient descent are able to minimize the training loss of neural networks. Researchers have conjectured that the use of over-parametrization 

    (Livni et al., 2014; Safran and Shamir, 2017) is the primary reason why local search algorithms can achieve low training error. The intuition is over-parametrization alters the loss function to have a large manifold of globally optimal solutions, which in turn allows local search algorithms to more easily find a global optimal solution.

  • Generalization: From the statistical point of view, over-parametrization may hinder effective generalization, since it greatly increases the number of parameters to the point of having number of parameters exceed the sample size. To address this, practitioners often use explicit forms of regularization such as weight decay, dropout, or early stopping to improve generalization. However in the non-convex setting, theoretically, we do not have a good quantitative understanding on how these regularizations help generalization for neural network models.

In this paper, we provide new theoretical insights into the optimization landscape and generalization ability of over-parametrized neural networks. Specifically we consider the neural network of the following form:

(1)

In the above

is the input vector,

with denotes the -th row of and ’s are the weights in the second layer. Finally

denotes the activation function applied to each hidden node. When the neural network is over-parameterized, the number of hidden notes

can be very large compared with input dimension or the number of training samples.

In our setting, we fix the second layer to be . Although it is simpler than the case where the second layer is not fixed, the effect of over-parameterization can be studied in this setting as well because we do not have any restriction on the number of hidden nodes.

We focus on quadratic activation function . Though quadratic activations are rarely used in practice, stacking multiple such two-layer blocks can be used to simulate higher-order polynomial neural networks and sigmodial activated neural networks (Livni et al., 2014; Soltani and Hegde, 2017).

In practice, we have training samples and solve the following optimization problem to learn a neural network

where is some loss function such as or logistic loss. For gradient descent we use the following update

where is the step size.

To improve the generalization ability, we often add explicit regularization. In this paper, we focus on a particular regularization technique, weight decay for which we slightly change the gradient descent algorithm to

where is the decay rate. Note this algorithm is equivalent to applying the gradient descent algorithm on the regularized loss

(2)

In this setup, we make the following theoretical contributions to explain why over-parametrization helps optimization and still allows for generalization.

1.1 Main Contributions

Over-parametrization Helps Optimization.

We analyze two kinds of over-parameterization. First we show that for

then all local minima in Problem (2) is global and all saddle points are strict. This properties together with recent algorithmic advances in non-convex optimization (Lee et al., 2016) imply gradient descent can find a globally optimal solution with random initialization. This is a minor generalization of results in (Soltanolkotabi et al., 2017) which only includes loss, and (Haeffele and Vidal, 2015; Haeffele et al., 2014) which only include .

Second, we consider another form of over-parametrization,

This condition on the amount of over-parameterization is much milder than , a condition used in many previous papers (Nguyen and Hein, 2017a, b). Further in practice, is a much milder requirement than , since if and then . In this setting, we consider the perturbed version of the Problem (2):

(3)

where is a random positive semidefinite matrix with arbitrarily small Frobenius norm. We show that if , Problem (3

) also has the desired properties that all local minima are global and all saddle points are strict with probability

. Since has small Frobenius norm, the optimal value of Problem (3) is very close to that of Problem (2). See Section 3 for the precise statement.

To prove this surprising fact, we bring forward ideas from smoothed analysis in constructing the perturbed loss function (3), which we believe is useful for analyzing the landscape of non-convex losses.

Weight-decay Helps Generalization.

We show because of weight-decay, the optimal solution of Problem (2) also generalizes well. The major observation is weight-decay ensures the solution of Problem (2) has low Frobenius norm, which is equivalent to matrix having low nuclear norm (Srebro et al., 2005)

. This observation allows us to use theory of Rademacher complexity to directly obtain quantitative generalization bounds. Our theory applies to a wide range of data distribution and in particular, does not need to assume the model is realizable. Further, the generalization bound does not depend on the number of epochs SGD runs or the number of hidden nodes.

To sum up, in this paper we justify the following folklore.

Over-parametrization allows us to find global optima and with weight decay, the solution also generalizes well.

1.2 Organization

This paper is organized as follows. In Section 2 we introduce necessary background and definitions. In Section 3 we present our main theorems on why over-parametrization helps optimization when or . In Section 4, we give quantitative generalization bounds to explain why weight decay helps generalization in the presence of over-parametrization. In Section 5, we prove our main theorems. We conclude and list future works in Section 6.

1.3 Related Works

Neural networks have enjoyed great success in many practical applications (Krizhevsky et al., 2012; Dauphin et al., 2016; Silver et al., 2016). To explain this success, many works have studied the expressiveness of neural networks. The expressive ability of shallow neural network dates back to 90s (Barron, 1994). Recent results give more refined analysis on deeper models (Bölcskei et al., 2017; Telgarsky, 2016; Wiatowski et al., 2017).

However, from the point of view of learning theory, it is well known that training a neural network is hard in the worst case (Blum and Rivest, 1989). Despite the worst-case pessimism, local search algorithms such as gradient descent are very successful in practice. With some additional assumptions, many works tried to design algorithms that provably learn a neural network (Goel et al., 2016; Sedghi and Anandkumar, 2014; Janzamin et al., 2015). However these algorithms are not gradient-based and do not provide insight on why local search algorithm works well.

Focusing on gradient-based algorithms, a line of research (Tian, 2017; Brutzkus and Globerson, 2017; Zhong et al., 2017a, b; Li and Yuan, 2017; Du et al., 2017b, c) analyzed the behavior of (stochastic) gradient descent with a structural assumption on the input distribution. The major drawback of these papers is that they all focus on the regression setting with least-squares loss and further assume the model is realizable meaning the label is the output of a neural network plus a zero mean noise, which is unrealistic. In the case of more than one hidden unit, the papers of (Li and Yuan, 2017; Zhong et al., 2017b) further require a stringent initialization condition to recover the true parameters.

Finding the optimal weights of a neural network is non-convex problem. Recently, researchers found that if the objective functions satisfy the following two key properties: (1) all local minima are global and (2) all saddle points and local maxima are strict, then first order method like gradient descent (Ge et al., 2015; Jin et al., 2017; Levy, 2016; Du et al., 2017a; Lee et al., 2016) can find a global minimum.

This motivates the research of studying the landscape of neural networks (Kawaguchi, 2016; Choromanska et al., 2015; Freeman and Bruna, 2016; Zhou and Feng, 2017; Nguyen and Hein, 2017a, b; Ge et al., 2017; Safran and Shamir, 2017; Soltanolkotabi et al., 2017; Poston et al., 1991; Haeffele and Vidal, 2015; Haeffele et al., 2014; Soudry and Hoffer, 2017) In particular, Haeffele and Vidal (2015); Poston et al. (1991); Nguyen and Hein (2017a, b)studied the effect of over-parameterization on training the neural networks. These results require a large amount of over-parameterization that the width of one of the hidden layers has to be greater than the number of training examples, which is unrealistic in commonly used neural networks. Recently, Soltanolkotabi et al. (2017) showed for shallow neural networks, the number of hidden nodes is only required to be larger or equal to the input dimension for -loss. In comparison, our theorems work for general loss functions with regularization under the same assumption. Further we also propose a new form of over-parameterization, namely as long as , the loss function also admits a benign landscape.

We now turn our attention to generalization ability of learned neural networks. It is well known that the classical learning theory cannot explain the generalization ability because VC-dimension of neural networks is large (Harvey et al., 2017; Zhang et al., 2016). A line of research tries to explain this phenomenon by studying the implicit regularization from stochastic gradient descent algorithm (Hardt et al., 2015; Pensia et al., 2018; Mou et al., 2017; Brutzkus et al., 2017; Li et al., 2017). However, the generalization bounds of these papers often depend on the number of epochs SGD runs, which is large in practice. Another direction is to study the generalization ability based on the norms of weight matrices in neural networks (Neyshabur et al., 2015, 2017a, 2017b; Bartlett et al., 2017; Liang et al., 2017; Golowich et al., 2017; Dziugaite and Roy, 2017; Wu et al., 2017). Our theorem on generalization ability also uses this idea but is more specialized to the network architecture (1).

After the initial submission of this manuscript, we became aware of concurrent work of (Bhojanapalli et al., 2018), which also considered the smoothed analysis technique to solve semi-definite programs in penalty form. The mathematical techniques in our work and (Bhojanapalli et al., 2018) are similar, but the focus is on two distinct problems of solving semi-definite programs and quadratic activation neural networks.

2 Preliminaries

We use bold-faced letters for vectors and matrices. For a vector , we use to denote the Euclidean norm. For a matrix , we denote the spectral norm and the Frobenius norm. We let to denote the left null-space of , i.e.

We use to denote the set of matrices with Frobenius norm bounded by and to denote the set of rank- matrices with spectral norm bounded by . We also denote the set of symmetric positive semidefinite matrices.

In this paper, we characterize the landscape of over-parameterized neural networks. More specifically we study the properties of critical points of empirical loss. Here for a loss function , a critical point satisfies . A critical point can be a local minimum or a saddle point.111We do not differentiate between saddle points and local maxima in this paper. If is a local minimum, then there is a neighborhood around such that for all . If is a saddle point, then for all neighborhood around , there is a such that .

Ideally, we would like a loss function that satisfies the following two geometric properties.

Property 2.1 (All local minima are global).

If is a local minimum of it is also the global minimum, i.e., .

Property 2.2 (All saddles are strict).

At a saddle point , there is a direction such that

If a loss function satisfies Property 2.1 and Property 2.2, recent algorithmic advances in non-convex optimization show randomly initialized gradient descent algorithm or perturbed gradient descent can find a global minimum (Lee et al., 2016; Ge et al., 2015; Jin et al., 2017; Du et al., 2017a).

Lastly, standard applications of Rademacher complexity theory will be used to derive generalization bounds.

Definition 2.1 (Definition of Rademacher Complexity).

Given a sample , the empirical Rademacher complexity of a function class is defined as

where are independent random varaibles drawn from the Rademacher distribution, i.e., for .

3 Overparametrization Helps Optimization

In this section we present our main results on explaining why over-parametrization helps local search algorithms find a global optimal solution. We consider two kinds of over-parameterization, and . We begin with the simpler case when .

Theorem 3.1.

Assume we have an arbitrary data set of input/label pairs and for and a convex loss . Consider a neural network of the form with and denoting the weights connecting input to hidden layers. Suppose . Then, the training loss as a function of weight of the hidden layers

obeys Property 2.1 and Property 2.2.

The above result states that given an arbitrary data set, the optimization landscape has benign properties that facilitate finding globally optimal neural networks. In particular, by setting the last layer to be the average pooling layer, all local minima are global minima and all saddles have a direction of negative curvature. This in turn implies that gradient descent on the first layer weights, when initialized at random, converges to a global optimum. These desired properties hold as long as the hidden layer is wide ().

An interesting and perhaps surprising aspect of Theorem 3.1 is its generality. It applies to arbitrary data set of any size with any convex differentiable loss function.

Now we consider the second case when . As mentioned earlier, in practice this is often a milder requirement than , and one of the main novelties of this paper.

Theorem 3.2.

Assume we have an arbitrary data set of input/label pairs and for , and a convex loss . Consider a neural network of the form with and denoting the weights connecting input to hidden layers. Suppose and is a random positive semidefinite matrix with whose distribution is absolutely continuous with respect to Lebesgue measure. Then, the training loss as a function of weight of the hidden layers

(4)

obeys Property 2.1 and Property 2.2. Further, any global optimal solution of Problem (4) satisfies

where

Similar to Theorem 3.1, Theorem 3.2 states that if , then for an arbitrary data set, the perturbed objective function (3

) has the desired properties that enable local search heuristics to find globally optimal solution for a general class of loss functions. Further, we can choose this perturbation to be arbitrarily small so the minimum of (

3) is close to (2).

The proof of theorem is inspired by a line of literature started by Pataki (1998, 2000); Burer and Monteiro (2003); Boumal et al. (2016). In summary, Boumal et al. (2016) showed that for “almost all” semidefinite programs, every local minima of the rank non-convex formulation of an SDP is a global minimum of the original SDP. However, this theorem applies with the important caveat of only applying to semidefinite programs that do not fall into a measure zero set. Our primary contribution is to develop a procedure that exploits this by a) constructing a perturbed objective to avoid the measure zero set, b) proving that the perturbed objective has Property 2.1 and 2.2, and c) showing the optimal value of the perturbed objective is close to the original objective. Further note that the analysis of (Boumal et al., 2016) does not apply since our loss functions, such as the logistic loss, are not semi-definite representable. We refer readers to Section 5.2 for more technical insights.

4 Weight-decay Helps Generalization

In this section we switch our focus to the generalization ability of the learned neural network. Since we use weight-decay or equivalently regularization in (2), the Frobenius norm of learned weight is bounded. Therefore, in this section we focus weight matrix in bounded Frobenius norm space, i.e., .

To derive the generalization bound, we first recall the classical generalization bound based on Rademacher complexity bound (c.f. Theorem 2 of (Koltchinskii and Panchenko, 2002)).

Theorem 4.1.

Assume each data point is sampled i.i.d from some distribution , i.e.,

We denote and and . Suppose loss function is -Lipschitz in the first argument, then for all , we have with high probability

where is an absolute constant and is the Rademacher complexity of .

With Theorem 4.1 at hand, we only need to bound the Rademacher complexity of . Note that Rademacher complexity is a distribution dependent quantity. If the data is arbitrary, we cannot have any guarantee. We begin with a theorem for bounded input domain.

Theorem 4.2.

Suppose input is sampled from a bounded ball in , i.e., for some and , then the Rademacher complexity satisfies

Combining Theorem 4.1 and Theorem 4.2 we can obtain a generalization bound.

Theorem 4.3.

Under the same assumptions of Theorem 4.1 and Theorem 4.2, we have

for some absolute constant .

While Theorem 4.2 is a valid bound, it is rather pessimistic because we only assume

is bounded. Consider the following scenario in which each input is sampled from a standard Gaussian distribution

. Then ignoring the logarithmic factors, using standard Gaussian concentration bound we can show with high probability . 222 hides logarithmic factors. Plugging in this bound we have

(5)

Note in this bound, it has a quadratic dependency on the dimension, so we need to have to have a meaningful bound.

In fact, for specific distributions like Gaussian using Theorem 5.2, we can often derive a stronger generalization bound.

Corollary 4.1.

Suppose for . If the number of samples satisfies , we have with high probability that Rademacher complexity satisfies

for some absolute constant .

Again, combining Theorem 4.1 and Corollary 4.1 we obtain the following generalization bound for Gaussian input distribution

Theorem 4.4.

Under the same assumptions of Theorem 4.1 and Corollary 4.1, we have

for some absolute constant .

Comparing Theorem 4.4 with generalization bound (5), Theorem 4.4 has an advantage. Theorem 4.4 has the dependency, which is the usual parametric rate. Further in practice, number of training samples and input dimension are often of the same order for common datasets and architectures (Zhang et al., 2016).

Corollary 4.1 is a special case of the more general Theorem 5.2

which only requires a bound on the fourth moment,

. In general, our theorems suggest if the Frobenius norm of weight matrix is small and the input is sampled from a benign distribution with controlled 4th moments, then we have good generalization.

As a concrete scenario, consider a favorable setting where the true data can be correctly classified by a small network using only

hidden units. The weights are non-zero only in the first rows and . From Theorem 4.4 to reach generalization gap of , we have sample complexity of , which only depends on the effective number of hidden units . The same result can be reached for more general input distributions by using Theorem 5.2 in place of Theorem 4.4.

5 Proofs

5.1 Proof of Theorem 3.1 and Theorem 3.2

Our proofs of over-parametrization helps optimization build upon existing geometric characterization on matrix factorization. We first cite a useful Theorem by Haeffele et al. (2014).333Theorem 2 of (Haeffele et al., 2014) assumes is a local minimum, but scrutinizing its proof, we can see that the assumption can be relaxed to .

Theorem 5.1 (Theorem 2 of (Haeffele et al., 2014) adapted to our setting).

Let be a twice differentiable convex function in the first argument. If the function defined in (2) at a rank-deficient matrix satisfies

then is a global minimum.

Proof of Theorem 3.1.

We prove Property 2.1 and Property 2.2 simultaneously by showing if a satisfy

then it is a global minimum.

If , we can directly apply Theorem 5.1. Thus it remains to consider the case . We first notice that is equivalent to

(6)

where . Since and , we know has a left pseudo-inverse, i.e., there exists such that . Multiplying on the left in Equation (6), we have

(7)

To prove Theorem 3.1, the key idea is to consider the follow reference optimization problem.

(8)

Problem (8) is a convex optimization problem in and has the same global minimum as the original problem. Now we denote . Since this is a convex function, the first-order optimality condition for global optimality is

is a global minimum.

Using Equation (7), we know achieves the global minimum in Problem (8). The proof is thus complete. ∎

5.2 Proof of Theorem 3.2

Proof.

We first prove satisfies Property 2.1 and Property 2.2. Similar to the proof of Theorem 3.1, we prove these two properties simultaneously by showing if a satisfy

(9)

then it is a global minimum. Because of Theorem 5.1, we only need to show that if satisfy condition (9), it is rank deficient, i.e. .

For the gradient condition, we have

For simplicity we denote where and . Using the first order condition we know is in the null space of . Thus, we can bound the rank of by

We prove by contradiction. Assume , we must have

Now define with

Thus we have following conditions

The key idea is to use these two conditions to upper bound the dimension of . To this end, we first define the set

Note that the dimension of the manifold is

where the first term is the dimension of matrices, the second term is the dimension of the null space and the last term is dimension of for , which is upper bounded by .

Next note that for . Therefore, we can compute the dimension of the union

Note because we assume , we have . However, recall by definition, so we have is sampled from a low-dimensional manifold which has Lebesuge measure . Since we sample from a distribution that is absolute continuous with respect to the Lebesgue measure, the event happens with probability . Therefore, with probability , . The proof of the first part of Theorem 3.2 is complete.

For the second part. Let and . Therefore we have

Note because and are both positive semidefinite, we have . Thus . ∎

5.3 Proof of Theorem 4.2 and Theorem 4.4

Our proof is inspired by (Srebro and Shraibman, 2005)

which exploits the structure of nuclear norm bounded space. We first prove a general Theorem that only depends on the property of the fourth-moment of input random variables.

Theorem 5.2.

Suppose the input random variable satisfies . Then the Rademacher complexity of is bounded by

Proof.

For a given set of inputs in our context, we can write Rademacher complexity as

Since Rademacher complexity does not change when taking convex combinations, we can first bound Rademacher complexity of the class of rank- matrices with spectral norm bounded by and then take convex hull and scale by . Note for , we can write with and . Using this expression, we can obtain an explicit formula of Rademacher complexity.

Now, to bound

, we can use the results from random matrix theory on Rademacher series. Recall that we assume

and notice that

Applying Rademacher matrix series expectation bound (Theorem 4.6.1 of (Tropp et al., 2015)), we have

Now taking the convex hull and, scaling by we obtain the desired result. ∎

With Theorem 5.2 at hand, for different distributions, we only need to bound .

Proof of Theorem 4.3.

Since we assume , we directly have

Plugging this bound in Theorem 5.2 we obtain the desired inequality. ∎

Proof of Corollary 4.1 and Theorem 4.4.

To prove Corollary 4.1, we use Theorem 5.2 and Lemma 4.7 in (Soltanolkotabi et al., 2017) to upper bound . By letting in Lemma 4.7, we find that

with probability at least . This completes the proof of Corollary 4.1.

Using this bound in Theorem 5.2 comletes the proof of Theorem 4.4. ∎

6 Conclusion and Future Works

In this paper we provided new theoretical results on over-parameterized neural networks. Using smoothed analysis, we showed as long as the number of hidden nodes is bigger than the input dimension or square root of the number of training data, the loss surface has benign properties that enable local search algorithms to find global minima. We further use the theory of Rademacher complexity to show the learned neural can generalize well.

Our next step is consider neural networks with other activation functions and how over-parametrization allows for efficient local-search algorithms to find near global minimzers. Another interesting direction to extend our results to deeper model.

7 Acknowledgment

S.S.D. was supported by NSF grant IIS1563887, AFRL grant FA8750-17-2-0212 and DARPA D17AP00001. J.D.L. acknowledges support of the ARO under MURI Award W911NF-11-1-0303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative.

References