1 Introduction
Deep learning has been highly successful in recent years and has led to dramatic improvements in multiple domains. Deeplearning algorithms often generalize
quite well in practice, namely, given access to labeled training data, they return neural networks that correctly label unobserved test data. However, despite much research our theoretical understanding of generalization in deep learning is still limited.
Neural networks used in practice often have far more learnable parameters than training examples. In such overparameterized settings, one might expect overfitting to occur, that is, the learned network might perform well on the training dataset and perform poorly on test data. Indeed, in overparameterized settings, there are many solutions that perform well on the training data, but most of them do not generalize well. Surprisingly, it seems that gradientbased deeplearning algorithms^{1}^{1}1
Neural networks are trained using gradientbased algorithms, where the network’s parameters are randomly initialized, and then adjusted in many stages in order to fit the training dataset by using information based on the gradient of a loss function w.r.t. the network parameters.
prefer the solutions that generalize well [66].Decades of research in learning theory suggest that in order to avoid overfitting one should use a model which is “not more expressive than necessary”. Namely, the model should be able to perform well on the training data, but should be as “simple” as possible. This idea goes back to the Occam’s Razor philosophical principle, which argues that we should prefer simple explanations over complicated ones. For example, in Figure 1 the data points are labeled according to a degree polynomial plus a small random noise, and we fit it with a degree polynomial (green) and with a degree polynomial (red). The degree polynomial achieves better accuracy on the training data, but it overfits and will not generalize well.
Simplicity in neural networks may stem from having a small number of parameters (which is often not the case in modern deep learning), but may also be achieved by adding a regularization term
during the training, which encourages networks that minimize a certain measure of complexity. For example, we may add a regularizer that penalizes models where the Euclidean norm of the parameters (viewed as a vector) is large. However, in practice neural networks seem to generalize well even when trained without such an explicit regularization
[66], and hence the success of deep learning cannot be attributed to explicit regularization. Therefore, it is believed that gradientbased algorithms induce an implicit bias (or implicit regularization) [43] which prefers solutions that generalize well, and characterizing this bias has been a subject of extensive research.In this review article, we discuss the implicit bias in training neural networks using gradientbased methods. The literature on this subject has rapidly expanded in recent years, and we aim to provide a highlevel overview of some fundamental results. This article is not a comprehensive survey, and there are certainly important results which are not discussed here.
2 The doubledescent phenomenon
An important implication of the implicit bias in deep learning is the doubledescent phenomenon, observed by Belkin et al. [7]
. As we already discussed, conventional wisdom in machine learning suggests that the number of parameters in the neural network should not be too large in order to avoid overfitting (when training without explicit regularization). Also, it should not be too small, in which case the model is not expressive enough and hence it performs poorly even on the training data, a situation called
underfitting. This classical thinking can be captured by the Ushaped risk (i.e., error or loss) curve from Figure 2(A). Thus, as we increase the number of parameters, the training risk decreases, and the test risk initially decreases and then increases. Hence, there is a “sweet spot” where the test risk is minimized. This classical Ushaped curve suggests that we should not use a model that perfectly fits the training data.However, it turns out that the Ushaped curve only provides a partial picture. If we continue to increase the number of parameters in the model after the training set is already perfectly labeled, then there might be a second descent in the test risk (hence the name “double descent”). As can be seen in Figure 2(B), by increasing the number of parameters beyond what is required to perfectly fit the training set, the generalization improves. Hence, in contrast to the classical approach which seeks a sweet spot, we may achieve generalization by using overparameterized neural networks. Modern deep learning indeed relies on using highly overparameterized networks.
The doubledescent phenomenon is believed to be a consequence of the implicit bias in deeplearning algorithms. When using overparameterized models, gradient methods converge to networks that generalize well by implicitly minimizing a certain measure of the network’s complexity. As we will see next, characterizing this complexity measure in different settings is a challenging puzzle.
3 Implicit bias in classification
We start with the implicit bias in classification tasks, namely, where the labels are in . Neural networks are trained in practice using the gradientdescent algorithm and its variants, such as stochastic gradient descent (SGD). In gradient descent, we start from some random initialization of the network’s parameters, and then in each iteration we update , where is the step size and is some empirical loss that we wish to minimize. Here, we will focus on gradient flow, which is a continuous version of gradient descent. That is, we train the parameters of the considered neural network, such that for all time we have , where is the empirical loss, and is some random initialization.^{2}^{2}2If
is nondifferentiable, e.g., in ReLU networks, then we use the
Clarke subdifferential, which is a generalization of the derivative for nondifferentiable functions. We focus on the logistic loss function (a.k.a. binary cross entropy): For a training dataset and a neural network parameterized by , the empirical loss is defined by . We note that gradient flow corresponds to gradient descent with an infinitesimally small step size, and many implicitbias results are shown for gradient flow since it is often easier to analyze.3.1 Logistic regression
Logistic regression is perhaps the simplest classification setting where implicit bias of neural networks can be considered. It corresponds to a neural network with a single neuron that does not have an activation function. That is, the model here is a linear predictor , where are the learned parameters.
We assume that the data is linearly separable, i.e., there exists some such that for all we have . Note that is strictly positive for all , but it may tend to zero as tends to infinity. For example, for and (where is the vector defined above) we have . However, there might be infinitely many directions that satisfy this condition. Interestingly, Soudry et al. [55] showed that gradient flow converges in direction to the maximummargin predictor. That is, exists, and points at the direction of the maximummargin predictor, defined by
The above characterization of the implicit bias in logistic regression can be used to explain generalization. Consider the case where , thus, the input dimension is larger than the size of the dataset, and hence there are infinitely many vectors with such that for all we have , i.e., there are infinitely many directions that correctly label the training data. Some of the directions which fit the training data generalize well, and others do not. The above result guarantees that gradient flow converges to the maximummargin direction. Consequently, we can explain generalization by using standard marginbased generalization bounds (cf. [52]), which imply that predictors achieving large margin generalize well.
3.2 Linear networks
We turn to deep neural networks where the activation is the identity function. Such neural networks are called linear networks. These networks compute linear functions, but the network architectures induce significantly different dynamics of gradient flow compared to the case of linear predictors that we already discussed. Hence, the implicit bias in linear networks has been extensively studied, as a first step towards understanding implicit bias in deep nonlinear networks. It turns out that understanding implicit bias in linear networks is highly nontrivial, and its study reveals valuable insights.
A linear fullyconnected network of depth computes a function , where are weight matrices (where is a row vector). The trainable parameters are a collection of the weight matrices. By [25],^{3}^{3}3Similar results under stronger assumptions were previously established in [17, 23]. if gradient flow converges to zero loss, namely, , then the vector converges in direction to the maximummargin predictor. That is, although the dynamics of gradient flow in the case of a deep fullyconnected linear network is significantly different compared to the case of a linear predictor, in both cases gradient flow is biased towards the maximummargin predictor. We also note that by [23], the weight matrices converge to matrices of rank . Thus, gradient flow on fullyconnected linear networks is also biased towards rank minimization of the weight matrices.
Once we consider linear networks which are not fully connected, gradient flow no longer maximizes the margin. For example, Gunasekar et al. [17] showed that in linear diagonal networks (i.e., where the weight matrices are diagonal) gradient flow encourages predictors that maximize the margin w.r.t. the quasinorm. As a result, diagonal networks are biased towards sparse linear predictors. In linear convolutional networks
they proved bias towards sparsity of the linear predictors in the frequency domain (see also
[65]).3.3 Homogeneous neural networks
Neural networks of practical interest have nonlinear activations and compute nonlinear predictors. The notion of margin maximization in nonlinear predictors is generally not welldefined. Nevertheless, it has been established that for certain neural networks gradient flow maximizes the margin in parameter space. Thus, while in linear networks we considered margin maximization in predictor space (a.k.a. function space), here we consider margin maximization w.r.t. the network’s parameters.
Consider a neural network with parameters , denoted by . We say that is homogeneous if there exists such that for every and we have . Here we think about as a vector obtained by concatenating all the parameters. That is, in homogeneous networks, scaling the parameters by any factor scales the predictions by . We note that a feedforward neural network with the ReLU activation (namely,
) is homogeneous if it does not contain skipconnections (i.e., residual connections), and does not have bias terms, except maybe for the first hidden layer. Also, we note that a homogeneous network might include convolutional layers. In papers by Lyu and Li
[33] and by Ji and Telgarsky [25],^{4}^{4}4A similar result under stronger assumptions was previously established in [38]. it was shown that if gradient flow on homogeneous networks reaches a sufficiently small loss (at some time ), then as it converges to zero loss, the parameters converge in direction, i.e., exists, and is biased towards margin maximization in the following sense. Consider the following margin maximization problem in parameter space:Then, points at the direction of a firstorder stationary point of the above optimization problem, which is also called Karush–Kuhn–Tucker point, or KKT point for short. The KKT approach allows inequality constraints and is a generalization of the method of Lagrange multipliers, which allows only equality constraints.
A KKT point satisfies several conditions (called KKT conditions), and it was proved in [33] that in the case of homogeneous neural networks these conditions are necessary for optimality. However, they are not sufficient even for local optimality (cf. [59]). Thus, a KKT point may not be an actual optimum of the maximummargin problem. It is analogous to showing that some unconstrained optimization problem converges to a point where the gradient is zero, without proving that it is a global/local minimum. Intuitively, convergence to a KKT point of the maximummargin problem implies a certain bias towards margin maximization in parameter space, although it does not guarantee convergence to a maximummargin solution.
As we already discussed, in linear classifiers and linear neural networks, margin maximization (in predictor space) can explain generalization using wellknown marginbased generalization bounds. Can margin maximization in parameter space explain generalization in nonlinear neural networks? In recent years several works showed marginbased generalization bounds for neural networks (e.g.,
[42, 5, 14, 60]). Hence, generalization in neural networks might be established by combining these results with results on the implicit bias towards margin maximization in parameter space. On the flip side, we note that it is still unclear how tight the known marginbased generalization bounds for neural networks are, and whether the sample complexity implied by such results (i.e., the required size of the training dataset) may be small enough to capture the situation in practice.Several results on implicit bias were shown for some specific cases of nonlinear homogeneous networks. For example: [10] showed bias towards margin maximization w.r.t. a certain function norm (known as the variation norm) in infinitewidth twolayer homogeneous networks; [34] proved margin maximization in twolayer LeakyReLU networks trained on linearly separable and symmetric data, and [50] proved convergence to a linear classifier in twolayer LeakyReLU networks under different assumptions; [49] showed bias towards minimizing the number of linear regions in univariate twolayer ReLU networks (see also [13, 51]).
Finally, the implicit bias in nonhomogeneous neural networks (such as ResNets) is currently not wellunderstood.^{5}^{5}5We remark that a result by [38] suggests that for a sum of homogeneous networks of different orders (such a sum is nonhomogeneous), the implicit bias may encourage solutions that discard the networks with the smallest order. Improving our knowledge of nonhomogeneous networks is an important challenge in the path towards understanding implicit bias in deep learning.
3.4 Extensions
For simplicity, we considered so far only gradient flow in binary classification. We note that many of the above results can also be extended to other gradient methods (such as gradient descent, steepest descent and SGD), and to multiclass classification with the crossentropy loss (see, e.g., [55, 33, 16, 40]).
The marginmaximization guarantees that we discussed for gradient flow hold in an asymptotic sense, and an important question is how fast the convergence is. It turns out that the convergence rate might be extremely slow [55, 39, 24, 26, 40, 28]. For example, when learning a linear predictor on linearlyseparable training dataset using gradient descent, after iterations the distance between the normalized predictor and the maximum margin predictor generally satisfies ,^{6}^{6}6See [55] for a more precise statement. and hence to reach for some , the number of iterations must be exponential in . We remark that Shamir [53] showed that the slow convergence rate to the maximummargin predictor does not imply that must be extremely large in order to avoid overfitting. Namely, he proved that also for much smaller values of , the predictor achieves a sufficiently large margin on a sufficiently large portion of the dataset, which implies good generalization properties.
Moreover, so far we considered only training with the logistic loss, which is a common loss function for binary classification. The results can generally be extended to loss functions that have an exponential tail, but the implicit bias is different when using other loss functions (e.g., losses with a polynomial tail) [27, 26, 39].
Finally, all the results described so far do not depend on the initialization of gradient flow. For example, when training homogeneous networks, as the time
tends to infinity, gradient flow converges to a KKT point of the maximummargin problem (assuming that the loss reaches a sufficiently small value), regardless of the initialization. However, such an asymptotic analysis only provides a partial picture. The authors in
[35] considered certain linear diagonal networks of depth , and showed that if we train the network only until we reach some fixed accuracy, e.g., until the loss reaches (instead of considering ), then the initialization plays a crucial role: If the initialization scale is large w.r.t. the considered accuracy (a setting which corresponds to the socalled kernel regime), then the implicit bias is given by the norm, rather than the quasinorm in the setting studied so far. Thus, to understand implicit bias in practice, the asymptotic results may not suffice, and we may need to take the initialization of gradient flow into account.4 Implicit bias in regression
We turn to consider the implicit bias of gradient methods in regression tasks, namely, where the labels are in . We focus on gradient flow and on the squareloss function. Thus, we have , where the empirical loss is defined such that given a training dataset and a neural network , we have . In an overparameterized setting, there might be many possible choices of such that , thus there are many global minima. Ideally, we want to find some implicit regularization function such that gradient flow prefers global minima which minimize . That is, if gradient flow converges to some with , then we have s.t. .
4.1 Linear regression
We start with linear regression, which is perhaps the simplest regression setting where the implicit bias of neural networks can be considered. Here, the model is a linear predictor
, where are the learned parameters. In this case, it is not hard to show that gradient flow converges to the global minimum of , which is closest to the initialization in distance [66, 16]. Thus, the implicit regularization function is . Minimizing this norm allows us to explain generalization using standard normbased generalization bounds [52].We note that the results on linear regression can be extended to other optimization methods other than gradient flow, such as gradient descent, SGD, and mirror descent [16, 2].^{7}^{7}7For mirror descent with a potential function , the implicit bias is given by , where is the Bregman divergence w.r.t. . In particular, if we start at then we have .
4.2 Linear networks
A linear neural network has the identity activation function and computes a linear predictor , where is the vector obtained by multiplying the weight matrices in . Hence, instead of seeking an implicit regularization function we may aim to find . Similarly to our discussion in the context of classification, the network architectures in deep linear networks induce significantly different dynamics of gradient flow compared to the case of linear regression. Hence, the implicit bias in such networks has been studied as a first step towards understanding implicit bias in deep nonlinear networks.
Exact expressions for have been obtained for linear diagonal networks and linear fullyconnected networks [3, 65, 63]. The expressions for are rather complicated, and depend on the initialization scale, i.e., the norm of , and the initialization “shape”, namely, the relative magnitudes of different weights and layers in the network. Below we discuss a few special cases.
Recall that diagonal networks are neural networks where the weight matrices are diagonal. The regularization function has been obtained for a few variants of diagonal networks. In [63] the authors considered networks of the form , where and the operator denotes the Hadamard (entrywise) product. Thus, in this network, which can be viewed as a variant of a diagonal network, the weights in the two layers are shared. For an initialization , the scale of the initialization does not affect . However, the initialization scale does affect the implicit bias: if then and if (which corresponds to the socalled kernel regime) then . Thus, the implicit bias transitions between the norm and the norm as we increase the initialization scale. A similar result holds also for deeper networks. In twolayer “diagonal networks” with a similar structure, but where the weights are not shared, [3] showed that the implicit bias is affected by both the initialization scale and “shape”. In [65] the authors showed a transition between the and norms both in diagonal networks and in certain convolutional networks (in the frequency domain).
4.3 Matrix factorization as a testbed for neural networks
Towards the goal of understanding implicit bias in neural networks, much effort was directed at the matrix factorization (and more specifically, matrix sensing) problem. In this problem, we are given a set of observations about an unknown matrix and we find a matrix with that is compatible with the given observations by running gradient flow. More precisely, consider the observations such that , where in the inner product we view and as vectors (namely, ). Then, we find using gradient flow on the objective . Since then the rank of is not limited by the parameterization. However, the parameterization has a crucial effect on the implicit bias. Assuming that there are many matrices that are compatible with the observations, the implicit bias controls which matrix gradient flow finds. A notable special case of matrix sensing is matrix completion. Here, every observation has the form , where is the standard basis. Thus, each observation reveals a single entry from the matrix .
The matrix factorization problem is analogous to training a twolayer linear network. Furthermore, we may consider deep matrix factorization where we have , which is analogous to training a deep linear network. Therefore, this problem is considered a natural testbed to investigate implicit bias in neural networks, and has been studied extensively in recent years (e.g., [15, 30, 1, 46, 6, 31]).
Gunasekar et al. [15] conjectured that in matrix factorization, the implicit bias of gradient flow starting from small initialization is given by the nuclear norm of (a.k.a. the trace norm). That is, . They also proved this conjecture in the restricted case where the matrices commute. The conjecture was further studied in a line of works (e.g., [30, 6, 1, 46]) providing positive and negative evidence, and was formally refuted by [31]. In [46] the authors showed that gradient flow in matrix completion might approach a global minimum at infinity rather than converging to a global minimum with a finite norm. This result suggests that the implicit bias in matrix factorization may not be expressible by any norm or quasinorm. They conjectured that the implicit bias can be explained by rank minimization rather than norm minimization. In [31]
, the authors provided theoretical and empirical evidence that gradient flow with infinitesimally small initialization in the matrixfactorization problem is mathematically equivalent to a simple heuristic rankminimization algorithm called
Greedy LowRank Learning (see also [21]). This result was generalized to tensor factorization in
[47, 48].Overall, although an explicit expression for the function is not known in the case of matrix factorization, we can view the implicit bias as a heuristic for rank minimization. It is still unclear what the implications of these results are for more practical nonlinear neural networks.
4.4 Nonlinear networks
Once we consider nonlinear networks the situation is even less clear. For a singleneuron network, namely, , where is strictly monotonic, gradient flow converges to the closest global minimum in distance, similarly to the case of linear regression. Thus, we have . However, if is the ReLU activation then the implicit bias is not expressible by any nontrivial function of . A bit more precisely, suppose that is such that if gradient flow starting from converges to a zeroloss solution , then is a zeroloss solution that minimizes . Then, such a function must be constant in . Hence, the approach of precisely specifying the implicit bias of gradient flow via a regularization function is not feasible in singleneuron ReLU networks [58]. It suggests that this approach may not be useful for understanding implicit bias in the general case of ReLU networks.
On the positive side, in singleneuron ReLU networks the implicit bias can be expressed approximately, within a factor of , by the norm. Namely, let be a zeroloss solution with a minimal norm, and assume that gradient flow converges to some zeroloss solution , then [58].
For twolayer LeakyReLU networks with a single hidden neuron, an explicit expression for was obtained in [3]. However, if we replace the LeakyReLU activation with ReLU, then the implicit bias is not expressible by any nontrivial function, similarly to the case of a singleneuron ReLU network [58].
The results on implicit bias in matrix factorization might suggest that there is a certain tendency towards rank minimization in deep learning with the square loss. However, the authors in [57] showed that gradient flow is not biased towards rank minimization of the weight matrices in ReLU networks, at least in the simple case of twolayer networks and small datasets.^{8}^{8}8We remark that [57] showed a certain tendency towards rank minimization in deep networks trained with the square loss using weight decay, or with the logistic loss.
Overall, our understanding of implicit bias in nonlinear networks with regression losses is very limited. While in classification we have a useful characterization of the implicit bias in nonlinear homogeneous models, here we do not understand even extremely simple models. Improving our understanding of this question is a major challenge for the upcoming years.
5 Additional implicit biases
In the previous sections, we mostly focused on implicit bias in gradient flow. We also discussed cases where the results on gradient flow can be extended to other gradient methods. However, when considering gradient descent or SGD with a finite step size (rather than infinitesimal), the discrete nature of the algorithms and the stochasticity induce additional implicit biases that do not exist in gradient flow. These biases are crucial for understanding the behavior of gradient methods in practice, as empirical results suggest that increasing the step size and decreasing the batch size may improve generalization (cf. [29, 22]).
It is well known that gradient descent and SGD cannot stably converge to minima that are too sharp relative to the step size [11, 37, 64, 36, 41]
. Namely, in a stable minimum, the maximal eigenvalue of the Hessian is at most
, where is the step size. As a result, when using an appropriate step size we can rule out stable convergence to certain (sharp) minima, and thus encourage convergence to flat minima, which are believed to generalize better [29, 64]. Similarly, small batch sizes in SGD also encourage flat minima [29, 64].Another direction for understanding implicit bias in gradient descent and SGD, is by transforming it into gradient flow on a modified loss. In [4, 54], the authors showed that the discrete iterates of gradient descent and SGD with small step size lie close to the path of gradient flow on certain modified losses, obtained by adding explicit regularizers (see also [44]).
The implicit bias of SGD is also affected when training with label noise, namely, when the training labels are perturbed by independent noise at each iteration [9, 19, 12, 32, 45]. For example, in [19, 45] it is shown that when training a linear diagonal network using SGD with label noise, there is a bias towards sparse solutions even when the initialization scale is large. This is in contrast to noisefree training where such initializations lead to implicit regularization.
Finally, we note that in this article we mostly considered implicit bias in the rich regime, rather than in the NTK regime [20] (a.k.a. kernel regime). The NTK regime does not capture feature learning, which seems to be a crucial element in the success of deep learning. The implicit bias in the NTK regime has been studied in several works (see, e.g., [62, 8]), but we leave the discussion on these results outside the scope of this review.
6 Implications beyond generalization
While the primary motivation for studying implicit bias is to better understand generalization in deep learning, it may also have other significant implications. Indeed, various phenomena may stem from the tendency of gradientbased algorithms to prefer specific solutions over others. We believe that exploring such implications is an exciting research direction, and demonstrate it below with two examples.
First, neural networks are known to be extremely vulnerable to adversarial examples [56], namely, small perturbations to the inputs might change the network’s predictions. This phenomenon has been widely studied, but it is still not wellunderstood. Specifically, it is unclear why gradient methods tend to learn nonrobust networks, namely, networks that are susceptible to adversarial examples, even in cases where robust networks exist. The tendency of gradient methods to learn nonrobust networks can be viewed as an implication of the implicit bias. See [61] for some results in this direction.
Second, the implicit bias might shed light on the hidden representations learned by neural networks, and on the extent to which neural networks are susceptible to privacy attacks. In
[18], the authors used the characterization of the implicit bias in homogeneous networks due to [33, 25], and showed that training data can be reconstructed from trained networks. Thus, neural networks “memorize” training data, and by using known properties of the implicit bias the data may be reconstructed, which might have negative implications on privacy.7 Conclusion
Deeplearning algorithms exhibit remarkable performance, but it is not wellunderstood why they are able to generalize despite having much more parameters than training examples. Exploring generalization in deep learning is an intriguing puzzle with both theoretical and practical implications. It is believed that implicit bias is a main ingredient in the ability of deeplearning algorithms to generalize, and hence it has been widely studied. Moreover, investigating the implicit bias might have important implications beyond generalization.
Our understanding of implicit bias improved dramatically in recent years, but is still far from satisfactory. We believe that further progress in the study of implicit bias will eventually shed light on the mystery of generalization in deep learning.
Acknowledgements
I thank Nadav Cohen, Noam Razin, Ohad Shamir and Daniel Soudry for valuable comments and discussions.
References
 Arora et al. [2019] S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pages 7413–7424, 2019.
 Azizan and Hassibi [2018] N. Azizan and B. Hassibi. Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952, 2018.
 Azulay et al. [2021] S. Azulay, E. Moroshko, M. S. Nacson, B. Woodworth, N. Srebro, A. Globerson, and D. Soudry. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. In International Conference on Machine Learning, pages 468–477, 2021.
 Barrett and Dherin [2020] D. G. Barrett and B. Dherin. Implicit gradient regularization. arXiv preprint arXiv:2009.11162, 2020.
 Bartlett et al. [2017] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrallynormalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30:6240–6249, 2017.
 Belabbas [2020] M. A. Belabbas. On implicit regularization: Morse functions and applications to matrix factorization. arXiv preprint arXiv:2001.04264, 2020.

Belkin et al. [2019]
M. Belkin, D. Hsu, S. Ma, and S. Mandal.
Reconciling modern machinelearning practice and the classical bias–variance tradeoff.
Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.  Bietti and Mairal [2019] A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019.
 Blanc et al. [2020] G. Blanc, N. Gupta, G. Valiant, and P. Valiant. Implicit regularization for deep neural networks driven by an ornsteinuhlenbeck like process. In Conference on learning theory, pages 483–513. PMLR, 2020.
 Chizat and Bach [2020] L. Chizat and F. Bach. Implicit bias of gradient descent for wide twolayer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
 Cohen et al. [2021] J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
 Damian et al. [2021] A. Damian, T. Ma, and J. D. Lee. Label noise sgd provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449–27461, 2021.
 Ergen and Pilanci [2021] T. Ergen and M. Pilanci. Convex geometry and duality of overparameterized neural networks. Journal of machine learning research, 2021.
 Golowich et al. [2018] N. Golowich, A. Rakhlin, and O. Shamir. Sizeindependent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018.
 Gunasekar et al. [2017] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems, 30, 2017.
 Gunasekar et al. [2018a] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832–1841. PMLR, 2018a.
 Gunasekar et al. [2018b] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018b.
 Haim et al. [2022] N. Haim, G. Vardi, G. Yehudai, O. Shamir, and M. Irani. Reconstructing training data from trained neural networks. arXiv preprint arXiv:2206.07758, 2022.
 HaoChen et al. [2021] J. Z. HaoChen, C. Wei, J. Lee, and T. Ma. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315–2357. PMLR, 2021.
 Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
 Jacot et al. [2021] A. Jacot, F. Ged, F. Gabriel, B. Şimşek, and C. Hongler. Deep linear networks dynamics: Lowrank biases induced by initialization scale and l2 regularization. arXiv preprint arXiv:2106.15933, 2021.
 Jastrzębski et al. [2017] S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
 Ji and Telgarsky [2018a] Z. Ji and M. Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2018a.
 Ji and Telgarsky [2018b] Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018b.
 Ji and Telgarsky [2020] Z. Ji and M. Telgarsky. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 2020.
 Ji and Telgarsky [2021] Z. Ji and M. Telgarsky. Characterizing the implicit bias via a primaldual analysis. In Algorithmic Learning Theory, pages 772–804. PMLR, 2021.
 Ji et al. [2020] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, pages 2109–2136. PMLR, 2020.
 Ji et al. [2021] Z. Ji, N. Srebro, and M. Telgarsky. Fast margin maximization via dual acceleration. In International Conference on Machine Learning, pages 4860–4869. PMLR, 2021.
 Keskar et al. [2016] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 Li et al. [2018] Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47. PMLR, 2018.
 Li et al. [2020] Z. Li, Y. Luo, and K. Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy lowrank learning. In International Conference on Learning Representations, 2020.
 Li et al. [2021] Z. Li, T. Wang, and S. Arora. What happens after sgd reaches zero loss?–a mathematical framework. arXiv preprint arXiv:2110.06914, 2021.
 Lyu and Li [2019] K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.
 Lyu et al. [2021] K. Lyu, Z. Li, R. Wang, and S. Arora. Gradient descent on twolayer nets: Margin maximization and simplicity bias. Advances in Neural Information Processing Systems, 34, 2021.
 Moroshko et al. [2020] E. Moroshko, B. E. Woodworth, S. Gunasekar, J. D. Lee, N. Srebro, and D. Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in Neural Information Processing Systems, 2020.
 Mulayoff and Michaeli [2020] R. Mulayoff and T. Michaeli. Unique properties of flat minima in deep networks. In International Conference on Machine Learning, pages 7108–7118. PMLR, 2020.
 Mulayoff et al. [2021] R. Mulayoff, T. Michaeli, and D. Soudry. The implicit bias of minima stability: A view from function space. Advances in Neural Information Processing Systems, 34:17749–17761, 2021.
 Nacson et al. [2019a] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry. Lexicographic and depthsensitive margins in homogeneous and nonhomogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019a.

Nacson et al. [2019b]
M. S. Nacson, J. Lee, S. Gunasekar, P. H. P. Savarese, N. Srebro, and
D. Soudry.
Convergence of gradient descent on separable data.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pages 3420–3428. PMLR, 2019b.  Nacson et al. [2019c] M. S. Nacson, N. Srebro, and D. Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3051–3059. PMLR, 2019c.
 Nacson et al. [2022] M. S. Nacson, K. Ravichandran, N. Srebro, and D. Soudry. Implicit bias of the step size in linear diagonal neural networks. In International Conference on Machine Learning, pages 16270–16295. PMLR, 2022.
 Neyshabur et al. [2015] B. Neyshabur, R. Tomioka, and N. Srebro. Normbased capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015.
 Neyshabur et al. [2017] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
 Pesme et al. [2021] S. Pesme, L. PillaudVivien, and N. Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021.
 PillaudVivien et al. [2022] L. PillaudVivien, J. Reygner, and N. Flammarion. Label noise (stochastic) gradient descent implicitly solves the lasso for quadratic parametrisation. arXiv preprint arXiv:2206.09841, 2022.
 Razin and Cohen [2020] N. Razin and N. Cohen. Implicit regularization in deep learning may not be explainable by norms. Advances in Neural Information Processing Systems, 2020.
 Razin et al. [2021] N. Razin, A. Maman, and N. Cohen. Implicit regularization in tensor factorization. In International Conference on Machine Learning, pages 8913–8924. PMLR, 2021.
 Razin et al. [2022] N. Razin, A. Maman, and N. Cohen. Implicit regularization in hierarchical tensor factorization and deep convolutional neural networks. arXiv preprint arXiv:2201.11729, 2022.
 Safran et al. [2022] I. Safran, G. Vardi, and J. D. Lee. On the effective number of linear regions in shallow univariate relu networks: Convergence guarantees and implicit bias. arXiv preprint arXiv:2205.09072, 2022.
 Sarussi et al. [2021] R. Sarussi, A. Brutzkus, and A. Globerson. Towards understanding learning in neural networks with linear teachers. In International Conference on Machine Learning, pages 9313–9322. PMLR, 2021.
 Savarese et al. [2019] P. Savarese, I. Evron, D. Soudry, and N. Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690. PMLR, 2019.
 ShalevShwartz and BenDavid [2014] S. ShalevShwartz and S. BenDavid. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 Shamir [2021] O. Shamir. Gradient methods never overfit on separable data. Journal of Machine Learning Research, 22(85):1–20, 2021.
 Smith et al. [2021] S. L. Smith, B. Dherin, D. G. Barrett, and S. De. On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176, 2021.
 Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
 Szegedy et al. [2013] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Timor et al. [2022] N. Timor, G. Vardi, and O. Shamir. Implicit regularization towards rank minimization in relu networks. arXiv preprint arXiv:2201.12760, 2022.
 Vardi and Shamir [2021] G. Vardi and O. Shamir. Implicit regularization in relu networks with the square loss. In Conference on Learning Theory, pages 4224–4258. PMLR, 2021.
 Vardi et al. [2021] G. Vardi, O. Shamir, and N. Srebro. On margin maximization in linear and relu networks. arXiv preprint arXiv:2110.02732, 2021.
 Vardi et al. [2022a] G. Vardi, O. Shamir, and N. Srebro. The sample complexity of onehiddenlayer neural networks. arXiv preprint arXiv:2202.06233, 2022a.
 Vardi et al. [2022b] G. Vardi, G. Yehudai, and O. Shamir. Gradient methods provably converge to nonrobust networks. arXiv preprint arXiv:2202.04347, 2022b.
 Williams et al. [2019] F. Williams, M. Trager, D. Panozzo, C. Silva, D. Zorin, and J. Bruna. Gradient dynamics of shallow univariate relu networks. In Advances in Neural Information Processing Systems, pages 8378–8387, 2019.
 Woodworth et al. [2020] B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.
 Wu et al. [2018] L. Wu, C. Ma, et al. How sgd selects the global minima in overparameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
 Yun et al. [2020] C. Yun, S. Krishnan, and H. Mobahi. A unifying view on implicit bias in training linear neural networks. arXiv preprint arXiv:2010.02501, 2020.
 Zhang et al. [2021] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.