Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

06/04/2018 ∙ by Simon S. Du, et al. ∙ University of Southern California Princeton University Carnegie Mellon University 0

We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i.e. gradient descent with infinitesimal step size) effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization. This result implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers. Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization. Inspired by our findings for gradient flow, we prove that gradient descent with step sizes η_t = O(t^-( 1/2+δ)) (0<δ<1/2) automatically balances two low-rank factors and converges to a bounded global optimum. Furthermore, for rank-1 asymmetric matrix factorization we give a finer analysis showing gradient descent with constant step size converges to the global minimum at a globally linear rate. We believe that the idea of examining the invariance imposed by first order algorithms in learning homogeneous models could serve as a fundamental building block for studying optimization for learning deep models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern machine learning models often consist of multiple layers. For example, consider a feed-forward deep neural network that defines a prediction function

where are weight matrices in layers, and is a point-wise homogeneousactivation function such as Rectified Linear Unit (ReLU) . A simple observation is that this model is homogeneous: if we multiply a layer by a positive scalar and divide another layer by , the prediction function remains the same, e.g. .

A direct consequence of homogeneity is that a solution can produce small function value while being unbounded, because one can always multiply one layer by a huge number and divide another layer by that number. Theoretically, this possible unbalancedness poses significant difficulty in analyzing first order optimization methods like gradient descent/stochastic gradient descent (GD/SGD), because when parameters are not a priori constrained to a compact set via either coerciveness

111A function is coercive if implies . of the loss or an explicit constraint, GD and SGD are not even guaranteed to converge (Lee et al., 2016, Proposition 4.11)

. In the context of deep learning,

Shamir (2018) determined that the primary barrier to providing algorithmic results is in that the sequence of parameter iterates is possibly unbounded.

Now we take a closer look at asymmetric matrix factorization, which is a simple two-layer homogeneous model. Consider the following formulation for factorizing a low-rank matrix:

(1)

where is a matrix we want to factorize. We observe that due to the homogeneity of , it is not smooth222A function is said to be smooth if its gradient is -Lipschitz continuous for some finite . even in the neighborhood of a globally optimum point. To see this, we compute the gradient of :

(2)

Notice that the gradient of is not homogeneous anymore. Further, consider a globally optimal solution such that is of order and is of order ( being very small). A small perturbation on can lead to dramatic change to the gradient of . This phenomenon can happen for all homogeneous functions when the layers are unbalanced. The lack of nice geometric properties of homogeneous functions due to unbalancedness makes first-order optimization methods difficult to analyze.

A common theoretical workaround is to artificially modify the natural objective function as in (1) in order to prove convergence. In (Tu et al., 2015; Ge et al., 2017a), a regularization term for balancing the two layers is added to (1):

(3)

For problem (3), the regularizer removes the homogeneity issue and the optimal solution becomes unique (up to rotation). Ge et al. (2017a) showed that the modified objective (3) satisfies (i) every local minimum is a global minimum, (ii) all saddle points are strict333A saddle point of a function

is strict if the Hessian at that point has a negative eigenvalue.

, and (iii) the objective is smooth. These imply that (noisy) GD finds a global minimum (Ge et al., 2015; Lee et al., 2016; Panageas and Piliouras, 2016).

(a) Comparison of convergence rates of GD for objective functions (1) and (3).
(b) Comparison of quantity when running GD for objective functions (1) and (3).
Figure 1: Experiments on the matrix factorization problem with objective functions (1) and (3). Red lines correspond to running GD on the objective function (1), and blue lines correspond to running GD on the objective function (3).

On the other hand, empirically, removing the homogeneity is not necessary. We use GD with random initialization to solve the optimization problem (1). Figure 0(a) shows that even without regularization term like in the modified objective (3) GD with random initialization converges to a global minimum and the convergence rate is also competitive. A more interesting phenomenon is shown in Figure 0(b) in which we track the Frobenius norms of and in all iterations. The plot shows that the ratio between norms remains a constant in all iterations. Thus the unbalancedness does not occur at all! In many practical applications, many models also admit the homogeneous property (like deep neural networks) and first order methods often converge to a balanced solution. A natural question arises:

Why does GD balance multiple layers and converge in learning homogeneous functions?

In this paper, we take an important step towards answering this question. Our key finding is that the gradient descent algorithm provides an implicit regularization on the target homogeneous function. First, we show that on the gradient flow (gradient descent with infinitesimal step size) trajectory induced by any differentiable loss function, for a large class of homogeneous models, including fully connected and convolutional neural networks with linear, ReLU and Leaky ReLU activations, the differences between squared norms across layers remain invariant. Thus, as long as at the beginning the differences are small, they remain small at all time. Note that small differences arise in commonly used initialization schemes such as

Gaussian initialization or Xavier/Kaiming initialization schemes (Glorot and Bengio, 2010; He et al., 2016). Our result thus explains why using ReLU activation is a better choice than sigmoid from the optimization point view. For linear activation, we prove an even stronger invariance for gradient flow: we show that stays invariant over time, where and are weight matrices in consecutive layers with linear activation in between.

Next, we go beyond gradient flow and consider gradient descent with positive step size. We focus on the asymmetric matrix factorization problem (1). Our invariance result for linear activation indicates that stays unchanged for gradient flow. For gradient descent, can change over iterations. Nevertheless we show that if the step size decreases like (), will remain small in all iterations. In the set where is small, the loss is coercive, and gradient descent thus ensures that all the iterates are bounded. Using these properties, we then show that gradient descent converges to a globally optimal solution. Furthermore, for rank- asymmetric matrix factorization, we give a finer analysis and show that randomly initialized gradient descent with constant step size converges to the global minimum at a globally linear rate.

1.1 Related Work

The homogeneity issue has been previously discussed by Neyshabur et al. (2015a, b). The authors proposed a variant of stochastic gradient descent that regularizes paths in a neural network, which is related to the max-norm. The algorithm outperforms gradient descent and AdaGrad on several classification tasks.

A line of research focused on analyzing gradient descent dynamics for (convolutional) neural networks with one or two unknown layers (Tian, 2017; Brutzkus and Globerson, 2017; Du et al., 2017a, b; Zhong et al., 2017; Li and Yuan, 2017; Ma et al., 2017; Brutzkus et al., 2017). For one unknown layer, there is no homogeneity issue. While for two unknown layers, existing work either requires learning two layers separately (Zhong et al., 2017; Ge et al., 2017b) or uses re-parametrization like weight normalization to remove the homogeneity issue (Du et al., 2017b). To our knowledge, there is no rigorous analysis for optimizing multi-layer homogeneous functions.

For a general (non-convex) optimization problem, it is known that if the objective function satisfies (i) gradient changes smoothly if the parameters are perturbed, (ii) all saddle points and local maxima are strict (i.e., there exists a direction with negative curvature), and (iii) all local minima are global (no spurious local minimum), then gradient descent (Lee et al., 2016; Panageas and Piliouras, 2016) converges to a global minimum. There have been many studies on the optimization landscapes of neural networks (Kawaguchi, 2016; Choromanska et al., 2015; Du and Lee, 2018; Hardt and Ma, 2016; Bartlett et al., 2018; Haeffele and Vidal, 2015; Freeman and Bruna, 2016; Vidal et al., 2017; Safran and Shamir, 2016; Zhou and Feng, 2017; Nguyen and Hein, 2017a, b; Zhou and Feng, 2017; Safran and Shamir, 2017), showing that the objective functions have properties (ii) and (iii). Nevertheless, the objective function is in general not smooth as we discussed before. Our paper complements these results by showing that the magnitudes of all layers are balanced and in many cases, this implies smoothness.

1.2 Paper Organization

The rest of the paper is organized as follows. In Section 2, we present our main theoretical result on the implicit regularization property of gradient flow for optimizing neural networks. In Section 3, we analyze the dynamics of randomly initialized gradient descent for asymmetric matrix factorization problem with unregularized objective function (1). In Section 4, we empirically verify the theoretical result in Section 2. We conclude and list future directions in Section 5. Some technical proofs are deferred to the appendix.

1.3 Notation

We use bold-faced letters for vectors and matrices. For a vector

, denote by its -th coordinate. For a matrix , we use to denote its -th entry, and use and to denote its -th row and -th column, respectively (both as column vectors). We use or to denote the Euclidean norm of a vector, and use to denote the Frobenius norm of a matrix. We use to denote the standard Euclidean inner product between two vectors or two matrices. Let .

2 The Auto-Balancing Properties in Deep Neural Networks

In this section we study the implicit regularization imposed by gradient descent with infinitesimal step size (gradient flow) in training deep neural networks. In Section 2.1 we consider fully connected neural networks, and our main result (Theorem 2.1

) shows that gradient flow automatically balances the incoming and outgoing weights at every neuron. This directly implies that the weights between different layers are balanced (Corollary 

2.1). For linear activation, we derive a stronger auto-balancing property (Theorem 2.2). In Section 2.2 we generalize our result from fully connected neural networks to convolutional neural networks. In Section 2.3 we present the proof of Theorem 2.1. The proofs of other theorems in this section follow similar ideas and are deferred to Appendix A.

2.1 Fully Connected Neural Networks

We first formally define a fully connected feed-forward neural network with

() layers. Let be the weight matrix in the -th layer, and define as a shorthand of the collection of all the weights. Then the function () computed by this network can be defined recursively: , (), and , where each is an activation function that acts coordinate-wise on vectors.444We omit the trainable bias weights in the network for simplicity, but our results can be directly generalized to allow bias weights. We assume that each () is homogeneous, namely, for all and all elements of the sub-differential when is non-differentiable at . This property is satisfied by functions like ReLU , Leaky ReLU (), and linear function .

Let be a differentiable loss function. Given a training dataset , the training loss as a function of the network parameters is defined as

(4)

We consider gradient descent with infinitesimal step size (also known as gradient flow) applied on , which is captured by the differential inclusion:

(5)

where is a continuous time index, and is the Clarke sub-differential (Clarke et al., 2008). If curves () evolve with time according to (5) they are said to be a solution of the gradient flow differential inclusion.

Our main result in this section is the following invariance imposed by gradient flow.

Theorem 2.1 (Balanced incoming and outgoing weights at every neuron).

For any and , we have

(6)

Note that is a vector consisting of network weights coming into the -th neuron in the -th hidden layer, and is the vector of weights going out from the same neuron. Therefore, Theorem 2.1 shows that gradient flow exactly preserves the difference between the squared -norms of incoming weights and outgoing weights at any neuron.

Taking sum of (6) over , we obtain the following corollary which says gradient flow preserves the difference between the squares of Frobenius norms of weight matrices.

Corollary 2.1 (Balanced weights across layers).

For any , we have

Corollary 2.1 explains why in practice, trained multi-layer models usually have similar magnitudes on all the layers: if we use a small initialization, is very small at the beginning, and Corollary 2.1 implies this difference remains small at all time. This finding also partially explains why gradient descent converges. Although the objective function like (4) may not be smooth over the entire parameter space, given that is small for all , the objective function may have smoothness. Under this condition, standard theory shows that gradient descent converges. We believe this finding serves as a key building block for understanding first order methods for training deep neural networks.

For linear activation, we have the following stronger invariance than Theorem 2.1:

Theorem 2.2 (Stronger balancedness property for linear activation).

If for some we have , then

This result was known for linear networks (Arora et al., 2018), but the proof there relies on the entire network being linear while Theorem 2.2 only needs two consecutive layers to have no nonlinear activations in between.

While Theorem 2.1 shows the invariance in a node-wise manner, Theorem 2.2 shows for linear activation, we can derive a layer-wise invariance. Inspired by this strong invariance, in Section 3 we prove gradient descent with positive step sizes preserves this invariance approximately for matrix factorization.

2.2 Convolutional Neural Networks

Now we show that the conservation property in Corollary 2.1 can be generalized to convolutional neural networks. In fact, we can allow arbitrary sparsity pattern and weight sharing structure within a layer; convolutional layers are a special case.

Neural networks with sparse connections and shared weights.

We use the same notation as in Section 2.1, with the difference that some weights in a layer can be missing or shared. Formally, the weight matrix in layer () can be described by a vector and a function . Here consists of the actual free parameters in this layer and is the number of free parameters (e.g. if there are convolutional filters in layer each with size , we have ). The map represents the sparsity and weight sharing pattern:

Denote by the collection of all the parameters in this network, and we consider gradient flow to learn the parameters:

The following theorem generalizes Corollary 2.1 to neural networks with sparse connections and shared weights:

Theorem 2.3.

For any , we have

Therefore, for a neural network with arbitrary sparsity pattern and weight sharing structure, gradient flow still balances the magnitudes of all layers.

2.3 Proof of Theorem 2.1

The proofs of all theorems in this section are similar. They are based on the use of the chain rule (i.e. back-propagation) and the property of homogeneous activations. Below we provide the proof of Theorem 

2.1 and defer the proofs of other theorems to Appendix A.

Proof of Theorem 2.1.

First we note that we can without loss of generality assume is the loss associated with one data sample , i.e., . In fact, for where , for any single weight in the network we can compute , using the sharp chain rule of differential inclusions for tame functions (Drusvyatskiy et al., 2015; Davis et al., 2018). Thus, if we can prove the theorem for every individual loss , we can prove the theorem for by taking average over .

Therefore in the rest of proof we assume . For convenience, we denote (), which is the input to the -th hidden layer of neurons for and is the output of the network for . We also denote and ().

Now we prove (6). Since () can only affect through , we have for ,

which can be rewritten as

It follows that

(7)

On the other hand, only affects through . Using the chain rule, we get

where is interpreted as a set-valued mapping whenever it is applied at a non-differentiable point.555More precisely, the equalities should be an inclusion whenever there is a sub-differential, but as we see in the next display the ambiguity in the choice of sub-differential does not affect later calculations.

It follows that666This holds for any choice of element of the sub-differential, since holds at for any choice of sub-differential.

Comparing the above expression to (7), we finish the proof. ∎

3 Gradient Descent Converges to Global Minimum for Asymmetric Matrix Factorization

In this section we constrain ourselves to the asymmetric matrix factorization problem and analyze the gradient descent algorithm with random initialization. Our analysis is inspired by the auto-balancing properties presented in Section 2. We extend these properties from gradient flow to gradient descent with positive step size.

Formally, we study the following non-convex optimization problem:

(8)

where has rank . Note that we do not have any explicit regularization in (8). The gradient descent dynamics for (8) have the following form:

(9)

3.1 The General Rank- Case

First we consider the general case of . Our main theorem below says that if we use a random small initialization , and set step sizes to be appropriately small, then gradient descent (9) will converge to a solution close to the global minimum of (8). To our knowledge, this is the first result showing that gradient descent with random initialization directly solves the un-regularized asymmetric matrix factorization problem (8).

Theorem 3.1.

Let . Suppose we initialize the entries in and i.i.d. from (), and run (9) with step sizes ().777The dependency of on can be for any constant .

Then with high probability over the initialization,

exists and satisfies .

Proof sketch of Theorem 3.1.

First let’s imagine that we are using infinitesimal step size in GD. Then according to Theorem 2.2 (viewing problem (8) as learning a two-layer linear network where the inputs are all the standard unit vectors in ), we know that will stay invariant throughout the algorithm. Hence when and are initialized to be small, will stay small forever. Combined with the fact that the objective is decreasing over time (which means cannot be too far from ), we can show that and will always stay bounded.

Now we are using positive step sizes , so we no longer have the invariance of . Nevertheless, by a careful analysis of the updates, we can still prove that is small, the objective decreases, and and stay bounded. Formally, we have the following lemma:

Lemma 3.1.

With high probability over the initialization , for all we have:

  1. Balancedness: ;

  2. Decreasing objective: ;

  3. Boundedness: .

Now that we know the GD algorithm automatically constrains in a bounded region, we can use the smoothness of in this region and a standard analysis of GD to show that converges to a stationary point of (Lemma B.2). Furthermore, using the results of (Lee et al., 2016; Panageas and Piliouras, 2016) we know that is almost surely not a strict saddle point. Then the following lemma implies that has to be close to a global optimum since we know from Lemma 3.1 (i). This would complete the proof of Theorem 3.1.

Lemma 3.2.

Suppose is a stationary point of such that . Then either , or is a strict saddle point of .

The full proof of Theorem 3.1 and the proofs of Lemmas 3.1 and 3.2 are given in Appendix B.

3.2 The Rank- Case

We have shown in Theorem 3.1 that GD with small and diminishing step sizes converges to a global minimum for matrix factorization. Empirically, it is observed that a constant step size is enough for GD to converge quickly to global minimum. Therefore, some natural questions are how to prove convergence of GD with a constant step size, how fast it converges, and how the discretization affects the invariance we derived in Section 2.

While these questions remain challenging for the general rank- matrix factorization, we resolve them for the case of . Our main finding is that with constant step size, the norms of two layers are always within a constant factor of each other (although we may no longer have the stronger balancedness property as in Lemma 3.1), and we utilize this property to prove the linear convergence of GD to a global minimum.

When , the asymmetric matrix factorization problem and its GD dynamics become

and

Here we assume has rank , i.e., it can be factorized as where and are unit vectors and .

Our main theoretical result is the following.

Theorem 3.2 (Approximate balancedness and linear convergence of GD for rank- matrix factorization).

Suppose , with () for some sufficiently small constant , and for some sufficiently small constant . Then with constant probability over the initialization, for all we have for some universal constants . Furthermore, for any , after iterations, we have .

Theorem 3.2 shows for and , their strengths in the signal space, and , are of the same order. This approximate balancedness helps us prove the linear convergence of GD. We refer readers to Appendix C for the proof of Theorem 3.2.

4 Empirical Verification

We perform experiments to verify the auto-balancing properties of gradient descent in neural networks with ReLU activation. Our results below show that for GD with small step size and small initialization: (1) the difference between the squared Frobenius norms of any two layers remains small in all iterations, and (2) the ratio between the squared Frobenius norms of any two layers becomes close to . Notice that our theorems in Section 2 hold for gradient flow (step size ) but in practice we can only choose a (small) positive step size, so we cannot hope the difference between the squared Frobenius norms to remain exactly the same but can only hope to observe that the differences remain small.

We consider a 3-layer fully connected network of the form where is the input, , , , and is ReLU activation. We use 1,000 data points and the quadratic loss function, and run GD. We first test a balanced initialization: , and , which ensures . After 10,000 iterations we have , and . Figure 1(a) shows that in all iterations and are bounded by which is much smaller than the magnitude of each . Figures 1(b) shows that the ratios between norms approach . We then test an unbalanced initialization: , and . After 10,000 iterations we have , and . Figure 1(c) shows that and are bounded by (and indeed change very little throughout the process), and Figures 1(d) shows that the ratios become close to after about 1,000 iterations.

(a) Balanced initialization, squared norm differences.
(b) Balanced initialization, squared norm ratios.
(c) Unbalanced Initialization, squared norm differences.
(d) Unbalanced initialization, squared norm ratios.
Figure 2: Balancedness of a 3-layer neural network.

5 Conclusion and Future Work

In this paper we take a step towards characterizing the invariance imposed by first order algorithms. We show that gradient flow automatically balances the magnitudes of all layers in a deep neural network with homogeneous activations. For the concrete model of asymmetric matrix factorization, we further use the balancedness property to show that gradient descent converges to global minimum. We believe our findings on the invariance in deep models could serve as a fundamental building block for understanding optimization in deep learning. Below we list some future directions.

Other first-order methods.

In this paper we focus on the invariance induced by gradient descent. In practice, different acceleration and adaptive methods are also used. A natural future direction is how to characterize the invariance properties of these algorithms.

From gradient flow to gradient descent: a generic analysis?

As discussed in Section 3

, while strong invariance properties hold for gradient flow, in practice one uses gradient descent with positive step sizes and the invariance may only hold approximately because positive step sizes discretize the dynamics. We use specialized techniques for analyzing asymmetric matrix factorization. It would be very interesting to develop a generic approach to analyze the discretization. Recent findings on the connection between optimization and ordinary differential equations 

(Su et al., 2014; Zhang et al., 2018) might be useful for this purpose.

Acknowledgements

We thank Phil Long for his helpful comments on an earlier draft of this paper. JDL acknowledges support from ARO W911NF-11-1-0303.

References

Appendix

Appendix A Proofs for Section 2

Proof of Theorem 2.2.

Same as the proof of Theorem 2.1, we assume without loss of generality that for some . We also denote (), and .

Now we suppose for some . Denote . Then we have . Using the chain rule, we can directly compute

Then we have

Comparing the above two equations we know . ∎

Proof of Theorem 2.3.

Same as the proof of Theorem 2.1, we assume without loss of generality that for , and denote (), and .

Using the chain rule, we have

Then we have using the sharp chain rule,

(10)

Substituting with in (10) gives , which further implies

(11)