The Implicit Bias of Depth: How Incremental Learning Drives Generalization

09/26/2019 ∙ by Daniel Gissin, et al. ∙ 0

A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have led to a breakthrough in modern machine learning, allowing us to efficiently learn highly expressive models that still generalize to unseen data. The theoretical reasons for this success are still unclear, as the generalization capabilities of neural networks defy the classic statistical learning theory bounds. Since these bounds, which depend solely on the capacity of the learned model, are unable to account for the success of neural networks, we must examine additional properties of the learning process. One such property is the optimization algorithm - while neural networks can express a multitude of possible ERM solutions for a given training set, gradient-based methods with the right initialization may be implicitly biased towards certain solutions which generalize.

A possible way such an implicit bias may present itself, is if gradient-based methods were to search the hypothesis space for possible solutions of gradually increasing complexity. This would suggest that while the hypothesis space itself is extremely complex, our search strategy favors the simplest solutions and thus generalizes. One of the leading results along these lines has been by saxe2013exact

, deriving an analytical solution for the gradient flow dynamics of deep linear networks and showing that for such models, the singular values converge at different rates, with larger values converging first. At the limit of infinitesimal initialization of the deep linear network,

gidel2019implicit show these dynamics exhibit a behavior of “incremental learning” - the singular values of the model are learned separately, one at a time. Our work generalizes these results to small but finite initialization scales.

Incremental learning dynamics have also been explored in gradient descent applied to matrix completion and sensing with a factorized parameterization (gunasekar2017implicit, arora2018optimization, woodworth2019kernel). When initialized with small Gaussian weights and trained with a small learning rate, such a model is able to successfully recover the low-rank matrix which labeled the data, even if the problem is highly over-determined and no additional regularization is applied. In their proof of low-rank recovery for such models, li2017algorithmic show that the model remains low-rank throughout the optimization process, leading to the successful generalization. Additionally, arora2019implicit explore the dynamics of such models, showing the singular values are learned at different rates and that deeper models exhibit stronger incremental learning dynamics. Our work deals with a more simplified setting, allowing us to determine explicitly under which conditions depth leads to this dynamical phenomenon.

Finally, the learning dynamics of nonlinear models have been studied as well. combes2018learning and williams2019gradient

study the gradient flow dynamics of shallow ReLU networks under restrictive distributional assumptions,

basri2019convergence show that shallow networks learn functions of gradually increasing frequencies and nakkiran2019sgd

show how deep ReLU networks correlate with linear classifiers in the early stages of training.

These findings, along with others, suggest that the generalization ability of deep networks is at least in part due to the incremental learning dynamics of gradient descent. Following this line of work, we begin by explicitly defining the notion of incremental learning for a toy model which exhibits this sort of behavior. Analyzing the dynamics of the model for gradient flow and gradient descent, we characterize the effect of the model’s depth and initialization scale on incremental learning, showing how deeper models allow for incremental learning in larger (realistic) initialization scales. Specifically, we show that a depth-2 model requires exponentially small initialization for incremental learning to occur, while deeper models only require the initialization to be polynomially small.

Once incremental learning has been defined and characterized for the toy model, we generalize our results theoretically and empirically for larger models. Examples of incremental learning in these models can be seen in figure 1, which we discuss further in section 4.

(a) Matrix Sensing
(b) Quadratic Nets
(c) Diagonal Nets
(d) Convolutional Nets
Figure 1: Incremental learning dynamics in deep models. Each panel shows the evolution of the five largest values of , the parameters of the induced model. All models were trained using gradient descent with a small initialization and learning rate, on a small training set such that there are multiple possible solutions. In all cases, the deep parameterization of the models lead to “incremental learning”, where the values are learned at different rates (larger values are learned first), leading to sparse solutions. (a) Depth 4 matrix sensing, denotes singular values (see section 4.1). (b) Quadratic networks, denotes singular values (see section 4.2). (c) Depth 3 diagonal networks, denotes feature weights (see section 4.3). (d) Depth 3 circular-convolutional networks, denotes amplitudes in the frequency domain of the feature weights (see appendix G).

2 Dynamical Analysis of a Toy Model

We begin by analyzing incremental learning for a simple model. This will allow us to gain a clear understanding of the phenomenon and the conditions for it, which we will later be able to apply to a variety of other models in which incremental learning is present.

2.1 Preliminaries

Our simple linear model will be similar to the toy model analyzed by woodworth2019kernel. Our input space will be and the hypothesis space will be linear models with non-negative weights, such that:

(1)

We will introduce depth into our model, by parameterizing using in the following way:

(2)

Where represents the depth of the model. Since we restrict the model to having non-negative weights, this parameterization doesn’t change the expressiveness, but it does radically change it’s optimization dynamics.

Assuming the data is labeled by some , we will study the dynamics of this model for general under a depth-normalized111This normalization is used for mathematical convenience to have solutions of different depths exhibit similar time scales in their dynamics. Equivalently, we can derive the solutions for the regular square loss and then use different time scalings in the dynamical analysis. squared loss over Gaussian inputs, which will allow us to derive our analytical solution:

(3)

We will assume that our model is initialized uniformly with a tunable scaling factor, such that:

(4)

2.2 Gradient Flow Analytical Solutions

Analyzing our toy model using gradient flow allows us to obtain an analytical solution for the dynamics of

along with the dynamics of the loss function for a general

. For brevity, the following theorem refers only to and , however the solutions for are similar in structure to , but more complicated. We also assume for brevity, however we can derive the solutions for as well:

Theorem 1.

Minimizing the toy linear model described in equation 1 with gradient flow over the depth normalized squared loss equation 3, with Gaussian inputs and weights initialized as in equation 4 and assuming leads to the following analytical solutions for different values of :

(5)
(6)
(7)
Proof.

The gradient flow equations for our model are the following:

Given the dynamics of the

parameters, we may use the chain rule to derive the dynamics of the induced model,

:

(8)

This differential equation is solvable for all , leading to the solutions in the theorem.

Analyzing these solutions, we see how even in such a simple model depth causes different factors of the model to be learned at different rates. Specifically, values corresponding to larger optimal values converge faster, suggesting a form of incremental learning. This is most clear for where the solution isn’t implicit, but is also the case for , as we will see in the next subsection.

These dynamics are depicted in figure 2, where we see the dynamics of the different values of as learning progresses. When , all values are learned at the same rate regardless of the initialization, while the deeper models are clearly biased towards learning the larger singular values first, especially at small initialization scales.

Our model has only one optimal solution due to the population loss, but it is clear how this sort of dynamic can induce sparse solutions - if the model is able to fit the data after a small amount of learning phases, then it’s obtained result will be sparse. Alternatively, if , we know that the dynamics will lead to the minimal norm solution which is dense. We explore the sparsity inducing bias of our toy model by comparing it empirically222The code for reproducing all of our experiments can be found in https://github.com/dsgissin/Incremental-Learning to a greedy sparse approximation algorithm in appendix D, and give our theoretical results in the next section.

Figure 2: Incremental learning dynamics in the toy model. Each panel shows the evolution of for according to the analytical solutions in theorem 1, under different depths and initializations. The first column has all values converging at the same rate. Notice how the deep parameterization with small initialization leads to distinct phases of learning, where values are learned incrementally (bottom-right). The shallow model’s much weaker incremental learning, even at small initialization scales (second column), is explained in theorem 2.

3 Incremental Learning

Equipped with analytical solutions for the dynamics of our model for every depth, we turn to study how the depth and initialization effect incremental learning. While gidel2019implicit focuses on incremental learning in depth-2 models at the limit of , we will study the phenomenon for a general depth and for .

First, we will define the notion of incremental learning. Since all values of are learned in parallel, we can’t expect one value to converge before the other moves at all (which happens for infinitesimal initialization as shown by gidel2019implicit). We will need a more relaxed definition for incremental learning in finite initialization scales.

Definition 1.

Given two values such that and both are initialized as , and given two scalars and , we call the learning of the values -incremental if there exists such that:

In words, two values have distinct learning phases if the first almost converges () before the second changes by much (). Given this definition of incremental learning, we turn to study the conditions that facilitate incremental learning in our toy model.

Our main result is a dynamical depth separation result, showing that incremental learning is dependent on in different ways for different values of . The largest difference in dependence happens between and , where the dependence changes from exponential to polynomial:

Theorem 2.

Given two values of a toy linear model as in equation 1, where and the model is initialized as in equation 4, and given two scalars and , then the largest initialization value for which the learning phases of the values are -incremental, denoted , is bounded in the following way:

(9)
(10)
Proof sketch (the full proof is given in appendix A).

Rewriting the separable differential equation in equation 8 to calculate the time until , we get the following:

The condition for incremental learning is then the requirement that , resulting in:

We then relax/restrict the above condition to get a necessary/sufficient condition on , leading to a lower and upper bound on .

Note that the value determining the condition for incremental learning is - if two values are in the same order of magnitude, then their ratio will be close to and we will need a small initialization to obtain incremental learning. The dependence on the ratio changes with depth, and is exponential for . This means that incremental learning, while possible for shallow models, is difficult to see in practice. This result explains why changing the initialization scale in figure 2 changes the dynamics of the models, while not changing the dynamics for noticeably.

The next theorem extends part of our analysis to gradient descent, a more realistic setting than the infinitesimal learning rate of gradient flow:

Theorem 3.

Given two values of a depth-2 toy linear model as in equation 1, such that and the model is initialized as in equation 4, and given two scalars and , and assuming , and assuming we optimize with gradient descent with a learning rate , then the largest initialization value for which the learning phases of the values are -incremental, denoted , is lower and upper bounded in the following way:

Where and are defined as:

We defer the proof to appendix B.

Note that this result, while less elegant than the bounds of the gradient flow analysis, is similar in nature. Both and simplify to when we take their first order approximation around , giving us similar bounds and showing that the condition on for is exponential in gradient descent as well.

While similar gradient descent results are harder to obtain for deeper models, we discuss the general effect of depth on the gradient decent dynamics in appendix C.

4 Incremental Learning in Larger Models

So far, we have only shown interesting properties of incremental learning caused by depth for a toy model. In this section, we will relate several deep models to our toy model and show how incremental learning presents itself in larger models as well.

4.1 Matrix Sensing

The task of matrix sensing is a generalization of matrix completion, where our input space is and our model is a matrix , such that:

(11)

Following arora2019implicit, we introduce depth by parameterizing the model using a product of matrices and the following initialization scheme ():

(12)
(13)

Note that when , the deep matrix sensing model reduces to our toy model without weight sharing. We study the dynamics of the model under gradient flow over a depth-normalized squared loss, assuming the data is labeled by a matrix sensing model parameterized by a PSD :

(14)

The following theorem relates the deep matrix sensing model to our toy model, showing the two have the same dynamical equations:

Theorem 4.

Optimizing the deep matrix sensing model described in equation 12 with gradient flow over the depth normalized squared loss (equation 14), with weights initialized as in equation 12 leads to the following dynamical equations for different values of :

(15)

Where and are the th singular values of and

, respectively, corresponding to the same singular vector.

The proof follows that of saxe2013exact and gidel2019implicit and is deferred to appendix E.

Theorem 4 shows us that the bias towards sparse solutions introduced by depth in the toy model is equivalent to the bias for low-rank solutions in the matrix sensing task. This bias was studied in a more general setting in arora2019implicit, with empirical results supporting the effect of depth on the obtainment of low-rank solutions under a more natural loss and initialization scheme. We recreate and discuss these experiments and their connection to our analysis in appendix E, and an example of these dynamics in deep matrix sensing can also be seen in panel (a) of figure 1.

4.2 Quadratic Neural Networks

By drawing connections between quadratic networks and matrix sensing (as in soltanolkotabi2018theoretical), we can extend our results to these nonlinear models. We will study a simplified quadratic network, where our input space is and the first layer is parameterized by a weight matrix

and followed by a quadratic activation function. The final layer will be a summation layer. We assume, like before, that the labeling function is a quadratic network parameterized by

. Our model can be written in the following way, using the following orthogonal initialization scheme:

(16)
(17)

Immediately, we see the similarity of the quadratic network to the deep matrix sensing model with , where the input space is made up of rank-1 matrices. However, the change in input space forces us to optimize over a different loss function to reproduce the same dynamics:

Definition 2.

Given an input distribution over an input space with a labeling function and a hypothesis

, the variance loss is defined in the following way:

Note that minimizing this loss function amounts to minimizing the variance of the error, while the squared loss minimizes the second moment of the error. We note that both loss functions have the same minimum for our problem, and the dynamics of the squared loss can be approximated in certain cases by the dynamics of the variance loss. For a complete discussion of the two losses, including the cases where the two losses have similar dynamics, we refer the reader to appendix

F.

Theorem 5.

Minimizing the quadratic network described and initialized as in equation 16 with gradient flow over the variance loss defined in equation 2 leads to the following dynamical equations:

(18)

Where and are the th singular values of and , respectively, corresponding to the same singular vector.

We defer the proof to appendix F and note that these dynamics are the same as our depth-2 toy model, showing that shallow quadratic networks can exhibit incremental learning (albeit requiring a small initialization).

4.3 Diagonal/Convolutional Linear Networks

While incremental learning has been described for deep linear networks in the past, it has been restricted to regression tasks. Here, we illustrate how incremental learning presents itself in binary classification, where implicit bias results have so far focused on convergence at (soudry2018implicit, nacson2018convergence, ji2019implicit). Deep linear networks with diagonal weight matrices have been shown to be biased towards sparse solutions when in gunasekar2018implicit, and biased towards the max-margin solution for . Instead of analyzing convergence at , we intend to show that the model favors sparse solutions for the entire duration of optimization, and that this is due to the dynamics of incremental learning.

Our theoretical illustration will use our toy model as in equation 1 (initialized as in equation 4) as a special weight-shared case of deep networks with diagonal weight matrices, and we will then show empirical results for the more general setting. We analyze the optimization dynamics of this model over a separable dataset where . We use the exponential loss () for the theoretical illustration and experiment on the exponential and logistic losses.

Computing the gradient for the model over , the gradient flow dynamics for become:

(19)

We see the same dynamical attenuation of small values of that is seen in the regression model, caused by the multiplication by . From this, we can expect the same type of incremental learning to occur - weights of will be learned incrementally until the dataset can be separated by the current support of . Then, the dynamics strengthen the growth of the current support while relatively attenuating that of the other values. Since the data is separated, increasing the values of the current support reduces the loss and the magnitude of subsequent gradients, and so we should expect the support to remain the same and the model to converge to a sparse solution.

Granted, the above description is just intuition, but panel (c) of figure 1 shows how it is born out in practice (similar results are obtained for the logistic loss). In appendix G we further explore this model, showing deeper networks have a stronger bias for sparsity. We also observe that the initialization scale plays a similar role as before - deep models are less biased towards sparsity when is large.

In their work, gunasekar2018implicit show an equivalence between the diagonal network and the circular-convolutional network in the frequency domain. According to their results, we should expect to see the same sparsity-bias of diagonal networks in convolutional networks, when looking at the Fourier coefficients of . An example of this can be seen in panel (d) of figure 1, and we refer the reader to appendix G for a full discussion of their convolutional model and it’s incremental learning dynamics.

5 Conclusion

Gradient-based optimization for deep linear models has an implicit bias towards simple (sparse) solutions, caused by an incremental search strategy over the hypothesis space. Deeper models have a stronger tendency for incremental learning, exhibiting it in more realistic initialization scales.

This dynamical phenomenon exists for the entire optimization process for regression as well as classification tasks, and for many types of models - diagonal networks, convolutional networks, matrix completion and even the nonlinear quadratic network. We believe this kind of dynamical analysis may be able to shed light on the generalization of deeper nonlinear neural networks as well, with shallow quadratic networks being only a first step towards that goal.

Acknowledgments

This research is supported by the European Research Council (TheoryDL project).

References

Appendix A Proof of Theorem 2

Theorem.

Given two values of a toy linear model as in equation 1, such that and the model is initialized as in equation 4, and given two scalars and , then the largest initialization value for which the learning phases of the values are -incremental, denoted , is lower and upper bounded in the following way:

Proof.

Our strategy will be to define the time for which a value reaches a fraction of it’s optimal value, and then require that . We begin with recalling the differential equation which determines the dynamics of the model:

Since the solution for is implicit and difficult to manage in a general form, we will define using the integral of the differential equation. The equation is separable, and under initialization of we can describe in the following way:

(20)

Incremental learning takes place when happens before . We can write this condition in the following way:

Plugging in and rearranging, we get the following necessary and sufficient condition for incremental learning:

(21)

Our last step before relaxing and restricting our condition will be to split the integral on the left-hand side into two integrals:

(22)

At this point, we cannot solve this equation and isolate to obtain a clear threshold condition on it for incremental learning. Instead, we will relax/restrict the above condition to get a necessary/sufficient condition on , leading to a lower and upper bound on the threshold value of .

Sufficient Condition

To obtain a sufficient (but not necessary) condition on , we may make the condition stricter either by increasing the left-hand side or decreasing the right-hand side. We can increase the left-hand side by removing from the left-most integral’s denominator () and then combine the left-most and right-most integrals:

Next, we note that the integration bounds give us a bound on for either integral. This means we can replace with on the right-hand side, and replace with on the left-hand side:

We may now solve these integrals for every and isolate , obtaining the lower bound on . We start with the case where :

Rearranging to isolate , we obtain our result:

(23)

For the case, we have the following after solving the integrals:

For simplicity we may further restrict the condition by removing the term . Solving for gives us the following:

(24)

Necessary Condition

To obtain a necessary (but not sufficient) condition on , we may relax the condition in equation 22 either by decreasing the left-hand side or increasing the right-hand side. We begin by rearranging the equation:

Like before, we may use the integration bounds to bound . Plugging in for all integrals decreases the left-hand side and increases the right-hand side, leading us to the following:

Rearranging, we get the following inequality:

We now solve the integrals for the different cases. For , we have:

Rearranging to isolate , we get our condition:

(25)

Finally, for , we solve the integrals to give us:

Rearranging to isolate , we get our condition:

(26)

Summary

For a given , we derived a sufficient condition and a necessary condition on for ()-incremental learning. The necessary and sufficient condition on , which is the largest initialization value for which we see incremental learning (denoted ), is between the two derived bounds.

The precise bounds can possibly be improved a bit, but the asymptotic dependence on is the crux of the matter, showing the dependence on changes with depth with a substantial difference when we move from shallow models () to deeper ones ()

Appendix B Proof of Theorem 3

Theorem.

Given two values of a depth-2 toy linear model as in equation 1, such that and the model is initialized as in equation 4, and given two scalars and , and assuming , and assuming we optimize with gradient descent with a learning rate , then the largest initialization value for which the learning phases of the values are -incremental, denoted , is lower and upper bounded in the following way:

Where and are defined as:

Proof..

To show our result for gradient descent and , we build on the proof techniques of theorem 3 of gidel2019implicit. We start by deriving the recurrence relation for the values for general depth, when now stands for the iteration. Remembering that , we write down the gradient update for :

Raising to the th power, we get the gradient update for the values:

(27)

Next, we will prove a simple lemma which gives us the maximal learning rate we will consider for the analysis, for which there is no overshooting (the values don’t grow larger than the optimal values).

Lemma 1.

For the gradient update in equation 27, assuming , if and , then:

Proof.

Plugging in for , we have:

Defining and dividing both sides by , we have:

It is enough to show that for any , we have that , as over-shooting occurs when . Indeed, this function is monotonic increasing in (since the exponent is non negative), and equals when . Since is a fixed point and no iteration that starts at can cross , then for any . This concludes our proof. ∎

Under this choice of learning rate, we can now obtain our incremental learning results for gradient descent when . Our strategy will be bounding from below and above, which will give us a lower and upper bound for . Once we have these bounds, we will be able to describe either a necessary or a sufficient condition on for incremental learning, similar to theorem 2.

The update rule for is:

Next, we plug in for and denote and to get:

(28)

Following theorem 3 of gidel2019implicit, we bound :

Where in the fourth line we use the inequality . We may now subtract from both sides to obtain:

We may now obtain a bound on by plugging in and taking the log:

Rearranging (note that and that our choice of keeps the argument of the log positive), we get:

(29)

Next, we follow the same procedure for an upper bound. Starting with our update step:

Where in the last line we use the inequality . Subtracting from both sides, we get: