Kernel and Deep Regimes in Overparametrized Models

06/13/2019 ∙ by Blake Woodworth, et al. ∙ 0

A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "deep" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a simple two-layer model that already exhibits an interesting and meaningful transition between the kernel and deep regimes, and we demonstrate the transition for more complex matrix factorization models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A string of recent papers study neural networks trained with gradient descent in the “kernel regime.” The main observation is that, in a certain regime, networks trained with gradient descent behave as kernel methods, and so can be studied as such Jacot et al. (2018); Daniely et al. (2016); Daniely (2017). This allow one to prove convergence to zero error solutions in overparametrized settings Du et al. (2018, 2019); Allen-Zhu et al. (2018), but it also implies gradient descent will converge to the minimum norm solution (in the corresponding RKHS) Chizat and Bach (2018); Arora et al. (2019); Mei et al. (2019)

and more generally that models will inherit the inductive bias and generalization behaviour of the RKHS. This would suggest that deep models can be effectively replaced by kernel methods with the “right” kernel, and deep learning boils down to a kernel method (with a fixed kernel determined by the architecture and initialization), and thus it can only learn problems learnable by some kernel.

This contrasts with other recent results that show how in deep models, including infinitely overparametrized networks, training with gradient descent induces an inductive bias that cannot be represented as an RKHS norm. For example, analytic and/or empirical results suggest that gradient descent on deep linear convolutional networks implicitly biases toward minimizing the bridge penalty, for , in the frequency domain Gunasekar et al. (2018b)

; weight decay on an infinite width single input ReLU implicitly biases towards minimizing the second order total variations

of the learned function Savarese et al. (2019); and gradient descent on a overparametrized matrix factorization, which can be thought of as a two layer linear network, induces nuclear norm minimization of the learned matrix Gunasekar et al. (2017) and can ensure low rank matrix recovery Li et al. (2018). All these natural inductive biases ( bridge penalty for , total variation norm, nuclear norm) are not Hilbert norms, and therefore cannot be captured by a kernel. This suggests that training deep models with gradient descent can behave very differently from kernel methods, and have much richer inductive biases.

One might ask whether the kernel approximation indeed captures the behavior of deep learning in a relevant and interesting regime, or does the success of deep learning come when learning escapes this regime? In order to understand this, we must first carefully understand when each of these regimes hold, and how the transition between the “kernel” regime and the “deep” regime happens.

Some investigations of the kernel regime emphasized the number of parameters (“width”) going to infinity as leading to this regime. However Chizat and Bach (2018) identified the scale of the model as a quantity controlling entry into the kernel regime. Their results suggest that for any number of parameters (any width), a model can be approximated by kernelized linear model when its scale at initialization goes to infinity (see details in Section 3). Considering models with increasing (or infinite) width, the relevant regime (kernel or deep) is determined by how the scaling at initialization behaves as the width goes to infinity. In this paper we elaborate and expand of this view, carefully studying how the scale of initialization effects the model behaviour for -homogeneous models.

In Section 4

we provide a complete and detailed study for a simple 2-homogeneous model that can be viewed as linear regression with squared parametrization, or as a “diagonal” linear neural network. For this model we can exactly characterize the implicit bias of training with gradient descent, as a function of the scale

of initialization, and see how this implicit bias becomes the L2 norm in the kernel regime, but the L1 norm in the deep regime. We can therefore understand how, e.g. for a high dimensional problem with underlying sparse structure, we can get good generalization when the initialization scale is small, but not when is large. In Section 5 we demonstrate a similar transition in matrix factorization.

2 Setup and preliminaries

We consider models which map parameters and examples to predictions . We denote the predictor implemented by the parameters as such that . Much of our focus will on models, such as linear networks, which are linear in (but not on the parameters !), in which case

is a linear predictor and can be represented as a vector

with . Such models are essentially alternate parametrizations of linear models, but as we shall see this change of parametrization is crucial.

We consider models that are -positive homogeneous in the parameters , for some integer , meaning that for any , and so . We will refer to such models simply as -homogeneous. Such homogeneity is satisfied by many interesting model classes including multi-layer ReLU networks with fully connected and convolutional layers, layered linear neural networks, and matrix factorization where corresponds to the depth of the network.

Consider a training set consisting of

examples of input label pairs. For a given loss function

, the loss of the model parametrized by is . We will focus on the squared loss . We slightly abuse notation and use to denote the vector of predictions and so we can write , where is the vector of target labels.

Minimizing the loss using gradient descent amounts to iteratively updating the parameters

(1)

We will consider gradient descent with infinitesimally small stepsize , which is captured by the gradient flow dynamics

(2)

We will be particularly interested in the scale of initialization and will capture this through a scalar parameter . For each scale , we will denote by the dynamics obtained by the gradient flow dynamics (2) with the initial condition for some fixed . We will also denote , or in the case of linear predictors , the dynamics on the predictor induced by the gradient flow dynamics on .

In many cases we expect the gradient flow dynamics to converge to a minimizer of , though establishing that this happens will not be our focus. Rather, we are interested in the underdetermined case, where , and in general there are multiple minimizers of , all with and so . The question we will mostly be concerned with is which of these many minimizers does gradient flow converge to. That is, we would like to characterize , or more importantly the predictor or we converge to, and how these depend on the scale . In underdetermined problems, where there are many zero error solutions, simply fitting the data using the model does not provide enough inductive bias to ensure generalization. But in many cases, the specific solution reached by gradient flow (or some other optimization procedure) has special structure, or minimizes some implicit regularizer, and this structure or regularizer provides the needed inductive bias Gunasekar et al. (2018b, a); Soudry et al. (2018); Ji and Telgarsky (2018).

3 The Kernel Regime

Gradient descent and gradient flow only consider the first order approximation of a model w.r.t.  about the current iterate:

(3)

That is, locally around any , gradient flow operates on the model as if it were an affine model with feature map , corresponding to the tangent kernel Jacot et al. (2018); Yang (2019); Lee et al. (2019). Of particular interest is the tangent kernel at initialization, where we denote .

The “kernel regime” refers to a limit in which the tangent kernel does not change over the course of optimization, and less formally to the regime in which it does not change significantly, i.e. where . In this case, training the model is completely equivalent to training the affine model , or in other words to kernelized gradient descent (or gradient flow) with the kernel and a “bias term” of . In order to not have to worry about this bias term, and in particular its scaling, Chizat and Bach (2018) suggest considering “unbiased” initializations such that , and so this bias term vanishes. This can be achieved in many cases by replicating units or components with opposite signs at initialization, and is the approach we take here (see Sections 4 and 5 for examples and details).

For underdetermined problem with multiple solutions , unbiased111With a bias term, convergence is to , where is the predictor at initialization. kernel gradient flow (or gradient descent) converges to the minimum norm solution , where is the RKHS norm corresponding to the kernel. And so, in the kernel regime, we will have that , and the implicit bias of training is precisely given by the kernel.

When does the “kernel regime” happen? Chizat and Bach (2018) showed that for any homogeneous222Chizat and Bach did not consider only homogeneous models, and instead of studying the scale of initialization they studied scaling the output of the model. For homogeneous models, the dynamics obtained by scaling the initialization are equivalent to those obtained by scaling the output, and so here we focus on homogeneous models and on scaling the initialization. model satisfying some technical conditions,333A technical problem with the main result of Chizat and Bach (2018), their Theorem 3.2, is that for models obtained by the symmetric initialization of duplicating units and negating their signs, the Jacobian of the model is degenerate at initialization, or in their notation , invalidating the assumption in the Theorem. On the other hand, without such symmetric initialization, and for finite width model (i.e. when is finite), the scale of the prediction at initialization explodes as and violates their assumptions. For this reason, we cannot rely on their result, and instead establish the kernel regime specifically for the model we study in Section 4 the kernel regime is reached as . That is, as we increase the scale of initialization, the dynamics converge to the kernel gradient flow dynamics with the kernel , and we have . In Section 4 we prove this limit directly for our specific model, and we also demonstrate it empirically for matrix factorization in Section 5.

In contrast, and as we shall see in Sections 4 and 5, the small initialization limit often leads to a very different and rich inductive bias, e.g. inducing sparsity or low-rank structure Gunasekar et al. (2017); Li et al. (2018); Gunasekar et al. (2018b), that allows for generalization in many settings where kernel methods would not. We refer to this limit reached as as the “deep regime.” This regime is also referred to as the “active” or “adaptive” regime (Chizat and Bach, 2018) since the tangent kernel changes over the course of training, in a sense adapting to the data. We argue that this regime is the regime that truly allows us to exploit the power of depth, and thus is the relevant regime for understanding the success of deep learning.

4 Detailed Study of a Simple Depth-2 Model

We study in detail a simple -homogeneous model. Consider the class of linear functions over , with squared parametrization as follows:

(4)

where we use the notation for to denote element-wise squaring. We will consider initializing all weights equally, i.e. using scalings of .

This is nothing but a linear regression model, except with unconventional over-parametrization. The models can also be thought of as a “diagonal” linear neural network (i.e. where the weight matrix has diagonal structure) with units. A standard diagonal linear network would have units, with each unit connected to just a single input unit with weights and the output with weight , thus implementing the model . But if at initialization , their magnitude will remain equal and their signs will not flip throughout training, and so we can equivalently replace both with a single weight , yielding the model .

The reason for using both and (or units) is two fold: first, it ensures that the image of is all (signed) linear functions, and thus the model is truly equivalent to standard linear regression. Second, it allows for initialization at without this being a saddle point from which gradient flow will never escape.444Our results can be generalized to non-uniform initialization, “biased initiliation” (i.e. where at initialization), or the asymmetric parametrization , however this complicates the presentation without adding much insight.

The model (4) is perhaps the simplest non-trivial -homogeneous model for , but, as we shall see, it already exhibits distinct and meaningful kernel and deep regimes. Furthermore, we can completely understand the implicit regularization driving this model analytically, and precisely characterize the transition between the kernel and rich regimes.

Let us consider the behavior of the limit of gradient flow (eq. (2)) as a function of the initialization, in the under-determined case where there are many solutions . It is straightforward to compute the tangent kernel at initialization and confirm that , i.e. the standard inner inner product kernel (with some scaling), and so . Therefore, in the kernel regime, gradient flow would take us to the minimum L2 norm solution, . Following Chizat and Bach (2018) and the discussion in Section 3, we would therefore expect that .

In contrast, Gunasekar et al. (2017, Corollary 2) shows that when , gradient flow will lead instead to the minimum L1 norm solution . This is the “deep regime” in this case. We already see two very distinct behaviors and, in high dimensions, two very different inductive biases, with the deep regime inducing a bias that is not an RKHS norm for any choice of kernel. Can we characterize and understand the transition between the two regimes as transitions from very small to very large? The following theorem does just that.

(a) Generalization
(b) Norms of solution
(c) Sample complexity
Figure 1: In Figure 0(a), the population error of the gradient flow solution is shown as a function of the initialization. The data are generated by a -sparse predictor according to with and . In Figure 0(b), the excess L1 norm of the gradient flow solution is shown as a function of in blue. In red is the same for the excess L2 norm . In Figure 0(c), for the same sparse regression problem described above, the largest such that achieves population error at most is shown in black. The blue dashed line indicates the minimum number of samples required for to achieve this error.
Theorem 1.

For any ,

(5)

where and

Proof sketch

The proof in Appendix A proceeds by showing the gradient flow dynamics on lead to a solution of the form

(6)

where . While evaluating the integral would be very difficult, the fact that

(7)

already provides a dual certificate for the KKT conditions for .

The function , also known as the “hypentropy” regularizer Ghai et al. (2019), can thus be understood as an implicit regularizer which biases the gradient flow solution towards a particular zero-error solution out of the many possibilities. As ranges from to , the

regularizer interpolates between the L1 and L2 norms, as illustrated in Figure

2, which shows a single coordinate function . As we have that , and so the behaviour of is controlled by the behaviour of around . In this regime is quadratic, and so . On the other hand when , we have that and the behaviour is governed by the asymptotic behaviour as . In this regime . But for any initialization scale , describes exactly how training will interpolate between the kernel and deep regimes.

The following Theorems, proven in Appendix B, provide a quantitative statement of how the and norms are approached as and respectively:

Theorem 2.

For any ,

Theorem 3.

For any

Theorems 2 and 3 and Figure 0(b) indicate a certain asymmetry between reaching the deep and kernel regimes: a relatively small value of (polynomial in the accuracy) suffices to approximate the minimum L2 norm solution to a very high degree of accuracy. On the other hand, needs to be exponentially small in order for the minimum solution to approximate the minimum L1 norm solution. From an optimization perspective this is unfortunate because is a saddle point, so taking will quickly create numerical difficulties since the time needed to escape the vicinity of the saddle point will grow drastically.

Generalization

In order to understand the effects of initialization on generalization, and how we might need to be in the deep regime in order to generalize well, consider a simple sparse regression problem, where and where is -sparse and its non-zero entries are . When , gradient flow will reach a zero training error solution, however, not all of these solutions will generalize the same. With samples, the deep regime, i.e. the minimum norm solution will generalize well, but even though we can fit the training data perfectly well, we should not expect any generalization in the kernel regime with this sample size ( samples will be required in that regime). This is demonstrated in Figure 0(c).

We see that in order to generalize well, we might need to use small initialization, and generalization improves as we decrease the scale of initialization . There is a tension here between generalization and optimization: a smaller might improve generalization, but as discussed above makes optimization trickier as we are starting closer to a saddle point. This suggests that in practice we would want to compromise, and operate just at the edge of the deep regime, using the largest that still allows for generalization. The tension between optimization and generalization can also be seen through a tradeoff between the sample size and the largest we can use and still generalize. This is illustrated in Figure 0(c), where for each sample size , we plot the largest for which the gradient flow solution achieves population risk below some threshold. As approaches the number of samples needed for the minimum L1 solution to generalize (the vertical dashed line), the initialization indeed must become extremely small. However, generalization is much easier when the number of samples is only slightly larger, and we can use much more moderate initialization.

Figure 2: and (scaled)

The situation we describe here is similar to a situation studied by Mei et al. (2019)

, who considered one-pass stochastic gradient descent (i.e. SGD on the population objective) and analyzed the number of steps, and so also number of samples, required for generalization.

Mei et al. showed that even with large initialization one can achieve generalization by optimizing with more one-pass SGD steps. Our analysis suggests that the issue here is not that of optimizing longer or more accurately, but rather of requiring a larger sample size—in studying one-pass SGD this distinction is blurred, but our analysis separates between the two.

Explicit Regularization

It is tempting to imagine that the effect of implicit regularization through gradient descent corresponds to selecting the solution closest to initialization in Euclidean norm:

(8)

where

(9)

It is certainly the case for standard linear regression , that and the implicit bias is fully captured by this view. Is the implicit bias of indeed captured by this minimum Euclidean distance solution also for our 2-homogeneous (depth 2) model, and perhaps more generally? Can the behavior discussed above can also be explained by ?

Indeed, it is easy to verify that for our square parametrization, the limiting behavior when and of the two approaches match, i.e.  and . To check whether the complete behaviour and transition are also captured by (8), we can calculate , which decomposes over the coordinates, as555Substituting and equating the gradient w.r.t.  to zero leads to a quadratic equation, the solution of which can be substituted back to evaluate :

(10)

Where is the unique real root of w.r.t. .

As depicted in Figure 2, is quadratic around and asymptotically linear as , yielding regularization when and regularization as , similarly to . However, and are very different: is quadratic (even radical), while is transcendental. This implies and are substantially different, are not simple rescaling of each other, and hence will lead to different sets or “paths” of solutions, and . In particular, while needed to be exponentially small in order for to approximate the norm, and so for the limit of the gradient flow path to approximate the minimum norm solution, being algebraic converges to polynomially (that is, only needs to scale polynomially with the accuracy). We see, that implicit regularization effect of gradient descent (or gradient flow), and the transition from the kernel to deep regime, is more complex and subtle than what is captured simply by distances in parameter space.

5 Demonstration in Matrix Completion

We now turn to a more complex depth two model, namely a matrix factorization model, and demonstrate similar transitions empirically. Specifically, we consider the model over matrix inputs defined by , where , . This corresponds to linear predictors over matrix arguments specified by . For generic inputs this can be thought of as a matrix sensing problem, where are measurement matrices. We consider here a matrix completion problem where represents an observation of entry : , and we observe some subset of the entries of the matrix and would like to complete the unobserved entries.

In the overparametrized regime , the model itself does not impose any constraints on the linear predictor , and so for learning with samples (as would always be the case for matrix completion), we need to rely on the implicit bias of gradient descent. In particular, consider matrix completion with observations of a planted rank matrix, with . For such underdetermined problems, there are many trivial global minimizers of the loss, most of which are not low rank and hence will not guarantee recovery, and we must rely on some other inductive bias. Indeed, previous work Gunasekar et al. (2017); Li et al. (2018) demonstrated rich implicit bias when , showing (theoretically and/or analytically) that in this regime we would converge to the minimum nuclear norm solution and would be able to generalize (or reconstruct) a low rank model. Crucially, these analysis depend on initialization with scale . Here we consider what happens with larger scale unbiased initialization (i.e. when even though ).

Similar to Section 4, in order to get unbiased initialization, we consider and initialization of the form and , where . We will study implicit bias of gradient flow over the factorized parametrization with above initialization.

We will focus on matrix completion problems where inputs are of the form . The tangent kernel at initialization is given by . This defaults to the trivial delta kernel for the two special cases (a) have orthogonal columns (e.g. ), or (b) have independent Gaussian entries and

. In these cases, minimizing the RKHS norm of the tangent kernel corresponds to returning a zero imputed matrix (minimum Frobenius norm solution). Said differently, in the “kernel” regime training is truly lazy: the unobserved entries do not change at all during training, and instead we just adjust the observed entries to fit the observations. We cannot expect any generalization in this regime, no matter what we assume about the observed matrix. In contrast, in the “deep” regime, as was previously observed by

Gunasekar et al., training leads to the minimum nuclear norm solution, a rich inductive bias that allows for generalization Candès and Recht (2009); Recht et al. (2010). Figure 3 demonstrates the transition between the two regimes, and how recovery deteriorates as we move away from the “deep” regime and into the “kernel” regime, changing the unobserved entries less and less.

Figure 3: Regimes in Matrix Completion We generated a rank-one matrix completion problem with ground truth by generating with i.i.d.  entries and observing random entries . We fit the observed entries by minimizing the squared loss on a matrix factorization model with . For different scalings , we examine the matrix reached by gradient flow on (solved using python ODE solvers) and plot (i) the reconstruction error on unobserved entries , and (ii) the amount by which the unobserved entries changed during optimization . In (a) we used and initialized to and . In Appendix C we also plot the nuclear norm and Frobenius norms of , and observe an almost identical figure. In (b) for varying , we initialized to and with with i.i.d.  entries. For large , the tangent kernel converges to the kernel corresponding to the Frobenius norm, and so as we again see the unobserved entries do not change. The scaling required to transition between the deep regime (reconstruction) and the kernel regime changes with .

6 Discussion

The main point of this paper is to emphasize the distinction between the “kernel” regime in training overparametrized multi-layered networks, and the “deep” (rich, active, adaptive) regime, show how the scaling of the initialization can transition between them, and understand this transition in detail. We argue that rich inductive bias that enables generalization may arise in the deep regime, but that focusing on the kernel regime restricts us to only what can be done with an RKHS. By studying the transition we also see a tension between generalization and optimization, which suggests we would tend to operate just on the edge of the deep regime, and so understanding this transition, rather then just the extremes, is important. Furthermore, we see that at this operating regime, at the edge of the deep regime, the implicit bias of gradient descent differs substantively from that of explicit regularization. Although in our detailed study we focused on a simple model, so that we can carry out a complete and exact analysis analytically, we expect this to be representative of the behaviour also in other homogeneous models, and serve as a basis of a more general understanding.

Effect of Width

Our treatment focused on the effect of scale on the transition between the regimes, and we saw that, as pointed out by Chizat and Bach

, we can observe a very meaningful transition between a kernel and deep regime even for finite width parametric models. The transition becomes even more interesting if the width of the model (the number of units per layer, and so also the number of parameters) increases towards infinity. In this case, we must be careful as to how the initialization of each individual unit scales when the total number of units increase, and which regime we fall in to is controlled by the relative scaling of the width and the scale of individual units at initialization. This is demonstrated, for example, in Figure

3(a)-3(b), which shows the regime change in matrix factorization problems, from minimum Frobenius norm recovery (the kernel regime) to minimum nuclear norm recovery (the deep regime), as a function of both the number of factors and the scale of initialization of each factor . As is expected, the scale at which we see the transition decreases as the model becomes wider, but further study is necessary in order to obtain a complete understanding of this scaling.

A particularly interesting aspect of infinite width networks is that , unlike for fixed-width networks, it may be possible to scale relative to the width such that at the infinite-width limit we would have an (asymptotically) unbiased predictor at initialization , or at least a non-exploding initialization , even with random initialization (without a doubling trick leading to artificially unbiased initialization), while still being in the kernel regime. For two-layer networks with ReLU activation, this is has been established in Arora et al. (2019) which showed that with width the gradient dynamics stay in the kernel regime forever.

Another interesting question is whether as the width increases, the transition between the kernel and deep regimes becomes sharp, or perhaps for infinite width models there is a wide intermediate regime with distinct and interesting behaviour.

Acknowledgements

BW is supported the NSF Graduate Research Fellowship under award 1754881. JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative. The work of DS was supported by the Israel Science Foundation (grant No. 31/1031), and by the Taub Foundation. NS is supported by NSF Medium (grant No. NSF-102:1764032) and by NSF BIGDATA (grant No. NSF-104:1546500).

References

Appendix A Proof of Theorem 1

See 1

Proof.

The proof involves relating the set of points reachable by gradient flow on to the KKT conditions of the minimization problem. While it may not be obvious from the expression, is the integral of an increasing function and is thus convex, and is the sum of applied to individual coordinates of , and is therefore also convex.

The linear predictor is given by applied to the limit of the gradient flow dynamics on . Recalling that ,

(11)

where the residual , and denotes the element-wise product of and . It is easily confirmed that these dynamics have a solution:

(12)

This immediately gives an expression for :

(13)
(14)

Understanding the limit exactly requires calculating , which would be a difficult task. However, for our purposes, it is sufficient to know that there is some such that . In other words, the vector is contained in the non-linear manifold

(15)

Setting this aside for a moment, consider the KKT conditions of the convex program

(16)

which are

(17)
(18)

Expanding in (17), there must exist such that

(19)
(20)
(21)

Since we already know that the gradient flow solution , there is some such that is a certificate of (17). Furthermore, this problem satisfies the strict saddle property Ge et al. [2015] [Zhao et al., 2019, Lemma 2.1], therefore gradient flow will converge to a zero-error solution, i.e. . Thus, we conclude that is a solution to (16). ∎

Appendix B Proofs of Theorems 2 and 3

Lemma 1.

For any ,

guarantees that

Proof.

First, we show that . Observe that is even because and

are odd. Therefore,

(22)
(23)
(24)
(25)

Therefore, we can rewrite

(26)
(27)
(28)
(29)

Using the fact that

(30)

we can bound for

(31)
(32)
(33)
(34)

So, for any , then

(36)
(37)
(38)

On the other hand, using (29) and (30) again,

(39)
(40)

Using the inequality , this can be further lower bounded by

(41)
(42)

Therefore, for any then

(43)

We conclude that for that

(44)

See 2

Proof.

First, we will prove that . By Lemma 1, since , for all with we have

(45)

Let be such that and . Then

(46)
(47)
(48)
(49)
(50)
(51)

Therefore, . Furthermore, let be any solution with