1 Introduction
A string of recent papers study neural networks trained with gradient descent in the “kernel regime.” The main observation is that, in a certain regime, networks trained with gradient descent behave as kernel methods, and so can be studied as such Jacot et al. (2018); Daniely et al. (2016); Daniely (2017). This allow one to prove convergence to zero error solutions in overparametrized settings Du et al. (2018, 2019); AllenZhu et al. (2018), but it also implies gradient descent will converge to the minimum norm solution (in the corresponding RKHS) Chizat and Bach (2018); Arora et al. (2019); Mei et al. (2019)
and more generally that models will inherit the inductive bias and generalization behaviour of the RKHS. This would suggest that deep models can be effectively replaced by kernel methods with the “right” kernel, and deep learning boils down to a kernel method (with a fixed kernel determined by the architecture and initialization), and thus it can only learn problems learnable by some kernel.
This contrasts with other recent results that show how in deep models, including infinitely overparametrized networks, training with gradient descent induces an inductive bias that cannot be represented as an RKHS norm. For example, analytic and/or empirical results suggest that gradient descent on deep linear convolutional networks implicitly biases toward minimizing the bridge penalty, for , in the frequency domain Gunasekar et al. (2018b)
; weight decay on an infinite width single input ReLU implicitly biases towards minimizing the second order total variations
of the learned function Savarese et al. (2019); and gradient descent on a overparametrized matrix factorization, which can be thought of as a two layer linear network, induces nuclear norm minimization of the learned matrix Gunasekar et al. (2017) and can ensure low rank matrix recovery Li et al. (2018). All these natural inductive biases ( bridge penalty for , total variation norm, nuclear norm) are not Hilbert norms, and therefore cannot be captured by a kernel. This suggests that training deep models with gradient descent can behave very differently from kernel methods, and have much richer inductive biases.One might ask whether the kernel approximation indeed captures the behavior of deep learning in a relevant and interesting regime, or does the success of deep learning come when learning escapes this regime? In order to understand this, we must first carefully understand when each of these regimes hold, and how the transition between the “kernel” regime and the “deep” regime happens.
Some investigations of the kernel regime emphasized the number of parameters (“width”) going to infinity as leading to this regime. However Chizat and Bach (2018) identified the scale of the model as a quantity controlling entry into the kernel regime. Their results suggest that for any number of parameters (any width), a model can be approximated by kernelized linear model when its scale at initialization goes to infinity (see details in Section 3). Considering models with increasing (or infinite) width, the relevant regime (kernel or deep) is determined by how the scaling at initialization behaves as the width goes to infinity. In this paper we elaborate and expand of this view, carefully studying how the scale of initialization effects the model behaviour for homogeneous models.
In Section 4
we provide a complete and detailed study for a simple 2homogeneous model that can be viewed as linear regression with squared parametrization, or as a “diagonal” linear neural network. For this model we can exactly characterize the implicit bias of training with gradient descent, as a function of the scale
of initialization, and see how this implicit bias becomes the L2 norm in the kernel regime, but the L1 norm in the deep regime. We can therefore understand how, e.g. for a high dimensional problem with underlying sparse structure, we can get good generalization when the initialization scale is small, but not when is large. In Section 5 we demonstrate a similar transition in matrix factorization.2 Setup and preliminaries
We consider models which map parameters and examples to predictions . We denote the predictor implemented by the parameters as such that . Much of our focus will on models, such as linear networks, which are linear in (but not on the parameters !), in which case
is a linear predictor and can be represented as a vector
with . Such models are essentially alternate parametrizations of linear models, but as we shall see this change of parametrization is crucial.We consider models that are positive homogeneous in the parameters , for some integer , meaning that for any , and so . We will refer to such models simply as homogeneous. Such homogeneity is satisfied by many interesting model classes including multilayer ReLU networks with fully connected and convolutional layers, layered linear neural networks, and matrix factorization where corresponds to the depth of the network.
Consider a training set consisting of
examples of input label pairs. For a given loss function
, the loss of the model parametrized by is . We will focus on the squared loss . We slightly abuse notation and use to denote the vector of predictions and so we can write , where is the vector of target labels.Minimizing the loss using gradient descent amounts to iteratively updating the parameters
(1) 
We will consider gradient descent with infinitesimally small stepsize , which is captured by the gradient flow dynamics
(2) 
We will be particularly interested in the scale of initialization and will capture this through a scalar parameter . For each scale , we will denote by the dynamics obtained by the gradient flow dynamics (2) with the initial condition for some fixed . We will also denote , or in the case of linear predictors , the dynamics on the predictor induced by the gradient flow dynamics on .
In many cases we expect the gradient flow dynamics to converge to a minimizer of , though establishing that this happens will not be our focus. Rather, we are interested in the underdetermined case, where , and in general there are multiple minimizers of , all with and so . The question we will mostly be concerned with is which of these many minimizers does gradient flow converge to. That is, we would like to characterize , or more importantly the predictor or we converge to, and how these depend on the scale . In underdetermined problems, where there are many zero error solutions, simply fitting the data using the model does not provide enough inductive bias to ensure generalization. But in many cases, the specific solution reached by gradient flow (or some other optimization procedure) has special structure, or minimizes some implicit regularizer, and this structure or regularizer provides the needed inductive bias Gunasekar et al. (2018b, a); Soudry et al. (2018); Ji and Telgarsky (2018).
3 The Kernel Regime
Gradient descent and gradient flow only consider the first order approximation of a model w.r.t. about the current iterate:
(3) 
That is, locally around any , gradient flow operates on the model as if it were an affine model with feature map , corresponding to the tangent kernel Jacot et al. (2018); Yang (2019); Lee et al. (2019). Of particular interest is the tangent kernel at initialization, where we denote .
The “kernel regime” refers to a limit in which the tangent kernel does not change over the course of optimization, and less formally to the regime in which it does not change significantly, i.e. where . In this case, training the model is completely equivalent to training the affine model , or in other words to kernelized gradient descent (or gradient flow) with the kernel and a “bias term” of . In order to not have to worry about this bias term, and in particular its scaling, Chizat and Bach (2018) suggest considering “unbiased” initializations such that , and so this bias term vanishes. This can be achieved in many cases by replicating units or components with opposite signs at initialization, and is the approach we take here (see Sections 4 and 5 for examples and details).
For underdetermined problem with multiple solutions , unbiased^{1}^{1}1With a bias term, convergence is to , where is the predictor at initialization. kernel gradient flow (or gradient descent) converges to the minimum norm solution , where is the RKHS norm corresponding to the kernel. And so, in the kernel regime, we will have that , and the implicit bias of training is precisely given by the kernel.
When does the “kernel regime” happen? Chizat and Bach (2018) showed that for any homogeneous^{2}^{2}2Chizat and Bach did not consider only homogeneous models, and instead of studying the scale of initialization they studied scaling the output of the model. For homogeneous models, the dynamics obtained by scaling the initialization are equivalent to those obtained by scaling the output, and so here we focus on homogeneous models and on scaling the initialization. model satisfying some technical conditions,^{3}^{3}3A technical problem with the main result of Chizat and Bach (2018), their Theorem 3.2, is that for models obtained by the symmetric initialization of duplicating units and negating their signs, the Jacobian of the model is degenerate at initialization, or in their notation , invalidating the assumption in the Theorem. On the other hand, without such symmetric initialization, and for finite width model (i.e. when is finite), the scale of the prediction at initialization explodes as and violates their assumptions. For this reason, we cannot rely on their result, and instead establish the kernel regime specifically for the model we study in Section 4 the kernel regime is reached as . That is, as we increase the scale of initialization, the dynamics converge to the kernel gradient flow dynamics with the kernel , and we have . In Section 4 we prove this limit directly for our specific model, and we also demonstrate it empirically for matrix factorization in Section 5.
In contrast, and as we shall see in Sections 4 and 5, the small initialization limit often leads to a very different and rich inductive bias, e.g. inducing sparsity or lowrank structure Gunasekar et al. (2017); Li et al. (2018); Gunasekar et al. (2018b), that allows for generalization in many settings where kernel methods would not. We refer to this limit reached as as the “deep regime.” This regime is also referred to as the “active” or “adaptive” regime (Chizat and Bach, 2018) since the tangent kernel changes over the course of training, in a sense adapting to the data. We argue that this regime is the regime that truly allows us to exploit the power of depth, and thus is the relevant regime for understanding the success of deep learning.
4 Detailed Study of a Simple Depth2 Model
We study in detail a simple homogeneous model. Consider the class of linear functions over , with squared parametrization as follows:
(4) 
where we use the notation for to denote elementwise squaring. We will consider initializing all weights equally, i.e. using scalings of .
This is nothing but a linear regression model, except with unconventional overparametrization. The models can also be thought of as a “diagonal” linear neural network (i.e. where the weight matrix has diagonal structure) with units. A standard diagonal linear network would have units, with each unit connected to just a single input unit with weights and the output with weight , thus implementing the model . But if at initialization , their magnitude will remain equal and their signs will not flip throughout training, and so we can equivalently replace both with a single weight , yielding the model .
The reason for using both and (or units) is two fold: first, it ensures that the image of is all (signed) linear functions, and thus the model is truly equivalent to standard linear regression. Second, it allows for initialization at without this being a saddle point from which gradient flow will never escape.^{4}^{4}4Our results can be generalized to nonuniform initialization, “biased initiliation” (i.e. where at initialization), or the asymmetric parametrization , however this complicates the presentation without adding much insight.
The model (4) is perhaps the simplest nontrivial homogeneous model for , but, as we shall see, it already exhibits distinct and meaningful kernel and deep regimes. Furthermore, we can completely understand the implicit regularization driving this model analytically, and precisely characterize the transition between the kernel and rich regimes.
Let us consider the behavior of the limit of gradient flow (eq. (2)) as a function of the initialization, in the underdetermined case where there are many solutions . It is straightforward to compute the tangent kernel at initialization and confirm that , i.e. the standard inner inner product kernel (with some scaling), and so . Therefore, in the kernel regime, gradient flow would take us to the minimum L2 norm solution, . Following Chizat and Bach (2018) and the discussion in Section 3, we would therefore expect that .
In contrast, Gunasekar et al. (2017, Corollary 2) shows that when , gradient flow will lead instead to the minimum L1 norm solution . This is the “deep regime” in this case. We already see two very distinct behaviors and, in high dimensions, two very different inductive biases, with the deep regime inducing a bias that is not an RKHS norm for any choice of kernel. Can we characterize and understand the transition between the two regimes as transitions from very small to very large? The following theorem does just that.
Theorem 1.
For any ,
(5) 
where and
Proof sketch
The proof in Appendix A proceeds by showing the gradient flow dynamics on lead to a solution of the form
(6) 
where . While evaluating the integral would be very difficult, the fact that
(7) 
already provides a dual certificate for the KKT conditions for .
The function , also known as the “hypentropy” regularizer Ghai et al. (2019), can thus be understood as an implicit regularizer which biases the gradient flow solution towards a particular zeroerror solution out of the many possibilities. As ranges from to , the
regularizer interpolates between the L1 and L2 norms, as illustrated in Figure
2, which shows a single coordinate function . As we have that , and so the behaviour of is controlled by the behaviour of around . In this regime is quadratic, and so . On the other hand when , we have that and the behaviour is governed by the asymptotic behaviour as . In this regime . But for any initialization scale , describes exactly how training will interpolate between the kernel and deep regimes.The following Theorems, proven in Appendix B, provide a quantitative statement of how the and norms are approached as and respectively:
Theorem 2.
For any ,
Theorem 3.
For any
Theorems 2 and 3 and Figure 0(b) indicate a certain asymmetry between reaching the deep and kernel regimes: a relatively small value of (polynomial in the accuracy) suffices to approximate the minimum L2 norm solution to a very high degree of accuracy. On the other hand, needs to be exponentially small in order for the minimum solution to approximate the minimum L1 norm solution. From an optimization perspective this is unfortunate because is a saddle point, so taking will quickly create numerical difficulties since the time needed to escape the vicinity of the saddle point will grow drastically.
Generalization
In order to understand the effects of initialization on generalization, and how we might need to be in the deep regime in order to generalize well, consider a simple sparse regression problem, where and where is sparse and its nonzero entries are . When , gradient flow will reach a zero training error solution, however, not all of these solutions will generalize the same. With samples, the deep regime, i.e. the minimum norm solution will generalize well, but even though we can fit the training data perfectly well, we should not expect any generalization in the kernel regime with this sample size ( samples will be required in that regime). This is demonstrated in Figure 0(c).
We see that in order to generalize well, we might need to use small initialization, and generalization improves as we decrease the scale of initialization . There is a tension here between generalization and optimization: a smaller might improve generalization, but as discussed above makes optimization trickier as we are starting closer to a saddle point. This suggests that in practice we would want to compromise, and operate just at the edge of the deep regime, using the largest that still allows for generalization. The tension between optimization and generalization can also be seen through a tradeoff between the sample size and the largest we can use and still generalize. This is illustrated in Figure 0(c), where for each sample size , we plot the largest for which the gradient flow solution achieves population risk below some threshold. As approaches the number of samples needed for the minimum L1 solution to generalize (the vertical dashed line), the initialization indeed must become extremely small. However, generalization is much easier when the number of samples is only slightly larger, and we can use much more moderate initialization.
The situation we describe here is similar to a situation studied by Mei et al. (2019)
, who considered onepass stochastic gradient descent (i.e. SGD on the population objective) and analyzed the number of steps, and so also number of samples, required for generalization.
Mei et al. showed that even with large initialization one can achieve generalization by optimizing with more onepass SGD steps. Our analysis suggests that the issue here is not that of optimizing longer or more accurately, but rather of requiring a larger sample size—in studying onepass SGD this distinction is blurred, but our analysis separates between the two.Explicit Regularization
It is tempting to imagine that the effect of implicit regularization through gradient descent corresponds to selecting the solution closest to initialization in Euclidean norm:
(8) 
where
(9) 
It is certainly the case for standard linear regression , that and the implicit bias is fully captured by this view. Is the implicit bias of indeed captured by this minimum Euclidean distance solution also for our 2homogeneous (depth 2) model, and perhaps more generally? Can the behavior discussed above can also be explained by ?
Indeed, it is easy to verify that for our square parametrization, the limiting behavior when and of the two approaches match, i.e. and . To check whether the complete behaviour and transition are also captured by (8), we can calculate , which decomposes over the coordinates, as^{5}^{5}5Substituting and equating the gradient w.r.t. to zero leads to a quadratic equation, the solution of which can be substituted back to evaluate :
(10) 
Where is the unique real root of w.r.t. .
As depicted in Figure 2, is quadratic around and asymptotically linear as , yielding regularization when and regularization as , similarly to . However, and are very different: is quadratic (even radical), while is transcendental. This implies and are substantially different, are not simple rescaling of each other, and hence will lead to different sets or “paths” of solutions, and . In particular, while needed to be exponentially small in order for to approximate the norm, and so for the limit of the gradient flow path to approximate the minimum norm solution, being algebraic converges to polynomially (that is, only needs to scale polynomially with the accuracy). We see, that implicit regularization effect of gradient descent (or gradient flow), and the transition from the kernel to deep regime, is more complex and subtle than what is captured simply by distances in parameter space.
5 Demonstration in Matrix Completion
We now turn to a more complex depth two model, namely a matrix factorization model, and demonstrate similar transitions empirically. Specifically, we consider the model over matrix inputs defined by , where , . This corresponds to linear predictors over matrix arguments specified by . For generic inputs this can be thought of as a matrix sensing problem, where are measurement matrices. We consider here a matrix completion problem where represents an observation of entry : , and we observe some subset of the entries of the matrix and would like to complete the unobserved entries.
In the overparametrized regime , the model itself does not impose any constraints on the linear predictor , and so for learning with samples (as would always be the case for matrix completion), we need to rely on the implicit bias of gradient descent. In particular, consider matrix completion with observations of a planted rank matrix, with . For such underdetermined problems, there are many trivial global minimizers of the loss, most of which are not low rank and hence will not guarantee recovery, and we must rely on some other inductive bias. Indeed, previous work Gunasekar et al. (2017); Li et al. (2018) demonstrated rich implicit bias when , showing (theoretically and/or analytically) that in this regime we would converge to the minimum nuclear norm solution and would be able to generalize (or reconstruct) a low rank model. Crucially, these analysis depend on initialization with scale . Here we consider what happens with larger scale unbiased initialization (i.e. when even though ).
Similar to Section 4, in order to get unbiased initialization, we consider and initialization of the form and , where . We will study implicit bias of gradient flow over the factorized parametrization with above initialization.
We will focus on matrix completion problems where inputs are of the form . The tangent kernel at initialization is given by . This defaults to the trivial delta kernel for the two special cases (a) have orthogonal columns (e.g. ), or (b) have independent Gaussian entries and
. In these cases, minimizing the RKHS norm of the tangent kernel corresponds to returning a zero imputed matrix (minimum Frobenius norm solution). Said differently, in the “kernel” regime training is truly lazy: the unobserved entries do not change at all during training, and instead we just adjust the observed entries to fit the observations. We cannot expect any generalization in this regime, no matter what we assume about the observed matrix. In contrast, in the “deep” regime, as was previously observed by
Gunasekar et al., training leads to the minimum nuclear norm solution, a rich inductive bias that allows for generalization Candès and Recht (2009); Recht et al. (2010). Figure 3 demonstrates the transition between the two regimes, and how recovery deteriorates as we move away from the “deep” regime and into the “kernel” regime, changing the unobserved entries less and less.6 Discussion
The main point of this paper is to emphasize the distinction between the “kernel” regime in training overparametrized multilayered networks, and the “deep” (rich, active, adaptive) regime, show how the scaling of the initialization can transition between them, and understand this transition in detail. We argue that rich inductive bias that enables generalization may arise in the deep regime, but that focusing on the kernel regime restricts us to only what can be done with an RKHS. By studying the transition we also see a tension between generalization and optimization, which suggests we would tend to operate just on the edge of the deep regime, and so understanding this transition, rather then just the extremes, is important. Furthermore, we see that at this operating regime, at the edge of the deep regime, the implicit bias of gradient descent differs substantively from that of explicit regularization. Although in our detailed study we focused on a simple model, so that we can carry out a complete and exact analysis analytically, we expect this to be representative of the behaviour also in other homogeneous models, and serve as a basis of a more general understanding.
Effect of Width
Our treatment focused on the effect of scale on the transition between the regimes, and we saw that, as pointed out by Chizat and Bach
, we can observe a very meaningful transition between a kernel and deep regime even for finite width parametric models. The transition becomes even more interesting if the width of the model (the number of units per layer, and so also the number of parameters) increases towards infinity. In this case, we must be careful as to how the initialization of each individual unit scales when the total number of units increase, and which regime we fall in to is controlled by the relative scaling of the width and the scale of individual units at initialization. This is demonstrated, for example, in Figure
3(a)3(b), which shows the regime change in matrix factorization problems, from minimum Frobenius norm recovery (the kernel regime) to minimum nuclear norm recovery (the deep regime), as a function of both the number of factors and the scale of initialization of each factor . As is expected, the scale at which we see the transition decreases as the model becomes wider, but further study is necessary in order to obtain a complete understanding of this scaling.A particularly interesting aspect of infinite width networks is that , unlike for fixedwidth networks, it may be possible to scale relative to the width such that at the infinitewidth limit we would have an (asymptotically) unbiased predictor at initialization , or at least a nonexploding initialization , even with random initialization (without a doubling trick leading to artificially unbiased initialization), while still being in the kernel regime. For twolayer networks with ReLU activation, this is has been established in Arora et al. (2019) which showed that with width the gradient dynamics stay in the kernel regime forever.
Another interesting question is whether as the width increases, the transition between the kernel and deep regimes becomes sharp, or perhaps for infinite width models there is a wide intermediate regime with distinct and interesting behaviour.
Acknowledgements
BW is supported the NSF Graduate Research Fellowship under award 1754881. JDL acknowledges support of the ARO under MURI Award W911NF1110303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative. The work of DS was supported by the Israel Science Foundation (grant No. 31/1031), and by the Taub Foundation. NS is supported by NSF Medium (grant No. NSF102:1764032) and by NSF BIGDATA (grant No. NSF104:1546500).
References
 AllenZhu et al. [2018] Zeyuan AllenZhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
 Arora et al. [2019] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. arXiv preprint arXiv:1901.08584, 2019.
 Candès and Recht [2009] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
 Chizat and Bach [2018] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
 Daniely [2017] Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
 Daniely et al. [2016] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.
 Du et al. [2018] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
 Du et al. [2019] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. In International Conference on Learning Representations, 2019.

Ge et al. [2015]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points—online stochastic gradient for tensor decomposition.
In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.  Ghai et al. [2019] Udaya Ghai, Elad Hazan, and Yoram Singer. Exponentiated gradient meets gradient descent. arXiv preprint arXiv:1902.01903, 2019.
 Gunasekar et al. [2017] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.
 Gunasekar et al. [2018a] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018a.
 Gunasekar et al. [2018b] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468, 2018b.
 Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
 Ji and Telgarsky [2018] Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. arXiv preprint arXiv:1810.02032, 2018.
 Lee et al. [2019] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
 Li et al. [2018] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47, 2018.
 Mei et al. [2018] Song Mei, Andrea Montanari, and PhanMinh Nguyen. A mean field view of the landscape of twolayers neural networks. In Proceedings of the National Academy of Sciences, volume 115, pages E7665–E7671, 2018.
 Mei et al. [2019] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Meanfield theory of twolayers neural networks: dimensionfree bounds and kernel limit. arXiv preprint arXiv:1902.06015, 2019.
 Recht et al. [2010] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
 Savarese et al. [2019] Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? arXiv preprint arXiv:1902.05040, 2019.

Soudry et al. [2018]
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan
Srebro.
The implicit bias of gradient descent on separable data.
Journal of Machine Learning Research
, 19(70), 2018.  Yang [2019] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
 Zhao et al. [2019] Peng Zhao, Yun Yang, and QiaoChu He. Implicit regularization via hadamard product overparametrization in highdimensional linear regression. 03 2019.
Appendix A Proof of Theorem 1
See 1
Proof.
The proof involves relating the set of points reachable by gradient flow on to the KKT conditions of the minimization problem. While it may not be obvious from the expression, is the integral of an increasing function and is thus convex, and is the sum of applied to individual coordinates of , and is therefore also convex.
The linear predictor is given by applied to the limit of the gradient flow dynamics on . Recalling that ,
(11) 
where the residual , and denotes the elementwise product of and . It is easily confirmed that these dynamics have a solution:
(12) 
This immediately gives an expression for :
(13)  
(14) 
Understanding the limit exactly requires calculating , which would be a difficult task. However, for our purposes, it is sufficient to know that there is some such that . In other words, the vector is contained in the nonlinear manifold
(15) 
Setting this aside for a moment, consider the KKT conditions of the convex program
(16) 
which are
(17)  
(18) 
Expanding in (17), there must exist such that
(19)  
(20)  
(21) 
Since we already know that the gradient flow solution , there is some such that is a certificate of (17). Furthermore, this problem satisfies the strict saddle property Ge et al. [2015] [Zhao et al., 2019, Lemma 2.1], therefore gradient flow will converge to a zeroerror solution, i.e. . Thus, we conclude that is a solution to (16). ∎
Appendix B Proofs of Theorems 2 and 3
Lemma 1.
For any ,
guarantees that
Proof.
First, we show that . Observe that is even because and
are odd. Therefore,
(22)  
(23)  
(24)  
(25) 
Therefore, we can rewrite
(26)  
(27)  
(28)  
(29) 
Using the fact that
(30) 
we can bound for
(31)  
(32)  
(33)  
(34) 
So, for any , then
(36)  
(37)  
(38) 
See 2
Proof.
First, we will prove that . By Lemma 1, since , for all with we have
(45) 
Let be such that and . Then
(46)  
(47)  
(48)  
(49)  
(50)  
(51) 
Therefore, . Furthermore, let be any solution with . It is easily confirmed that there exists such that the point