Log In Sign Up

On the Implicit Bias in Deep-Learning Algorithms

Gradient-based deep-learning algorithms exhibit remarkable performance in practice, but it is not well-understood why they are able to generalize despite having more parameters than training examples. It is believed that implicit bias is a key factor in their ability to generalize, and hence it has been widely studied in recent years. In this short survey, we explain the notion of implicit bias, review main results and discuss their implications.


page 1

page 2

page 3

page 4


Can Implicit Bias Explain Generalization? Stochastic Convex Optimization as a Case Study

The notion of implicit bias, or implicit regularization, has been sugges...

Deep learning in radiology: an overview of the concepts and a survey of the state of the art

Deep learning is a branch of artificial intelligence where networks of s...

How to Shift Bias: Lessons from the Baldwin Effect

An inductive learning algorithm takes a set of data as input and generat...

Ensemble Robustness and Generalization of Stochastic Deep Learning Algorithms

The question why deep learning algorithms generalize so well has attract...

The Implicit Bias of Gradient Descent on Generalized Gated Linear Networks

Understanding the asymptotic behavior of gradient-descent training of de...

Failures of Gradient-Based Deep Learning

In recent years, Deep Learning has become the go-to solution for a broad...

Limitation of characterizing implicit regularization by data-independent functions

In recent years, understanding the implicit regularization of neural net...

1 Introduction

Deep learning has been highly successful in recent years and has led to dramatic improvements in multiple domains. Deep-learning algorithms often generalize

quite well in practice, namely, given access to labeled training data, they return neural networks that correctly label unobserved test data. However, despite much research our theoretical understanding of generalization in deep learning is still limited.

Neural networks used in practice often have far more learnable parameters than training examples. In such overparameterized settings, one might expect overfitting to occur, that is, the learned network might perform well on the training dataset and perform poorly on test data. Indeed, in overparameterized settings, there are many solutions that perform well on the training data, but most of them do not generalize well. Surprisingly, it seems that gradient-based deep-learning algorithms111

Neural networks are trained using gradient-based algorithms, where the network’s parameters are randomly initialized, and then adjusted in many stages in order to fit the training dataset by using information based on the gradient of a loss function w.r.t. the network parameters.

prefer the solutions that generalize well [66].

Decades of research in learning theory suggest that in order to avoid overfitting one should use a model which is “not more expressive than necessary”. Namely, the model should be able to perform well on the training data, but should be as “simple” as possible. This idea goes back to the Occam’s Razor philosophical principle, which argues that we should prefer simple explanations over complicated ones. For example, in Figure 1 the data points are labeled according to a degree- polynomial plus a small random noise, and we fit it with a degree- polynomial (green) and with a degree- polynomial (red). The degree- polynomial achieves better accuracy on the training data, but it overfits and will not generalize well.

Figure 1: Fitting training data with a degree- polynomial (green) and degree- polynomial (red). The latter achieves better accuracy on the training data but it overfits.

Simplicity in neural networks may stem from having a small number of parameters (which is often not the case in modern deep learning), but may also be achieved by adding a regularization term

during the training, which encourages networks that minimize a certain measure of complexity. For example, we may add a regularizer that penalizes models where the Euclidean norm of the parameters (viewed as a vector) is large. However, in practice neural networks seem to generalize well even when trained without such an explicit regularization

[66], and hence the success of deep learning cannot be attributed to explicit regularization. Therefore, it is believed that gradient-based algorithms induce an implicit bias (or implicit regularization[43] which prefers solutions that generalize well, and characterizing this bias has been a subject of extensive research.

In this review article, we discuss the implicit bias in training neural networks using gradient-based methods. The literature on this subject has rapidly expanded in recent years, and we aim to provide a high-level overview of some fundamental results. This article is not a comprehensive survey, and there are certainly important results which are not discussed here.

2 The double-descent phenomenon

An important implication of the implicit bias in deep learning is the double-descent phenomenon, observed by Belkin et al. [7]

. As we already discussed, conventional wisdom in machine learning suggests that the number of parameters in the neural network should not be too large in order to avoid overfitting (when training without explicit regularization). Also, it should not be too small, in which case the model is not expressive enough and hence it performs poorly even on the training data, a situation called

underfitting. This classical thinking can be captured by the U-shaped risk (i.e., error or loss) curve from Figure 2(A). Thus, as we increase the number of parameters, the training risk decreases, and the test risk initially decreases and then increases. Hence, there is a “sweet spot” where the test risk is minimized. This classical U-shaped curve suggests that we should not use a model that perfectly fits the training data.

Figure 2:

The double-descent phenomenon. Dashed lines denote the training risk and solid lines denote the test risk. (A) The classical U-shaped curve. (B) The double-descent curve, which extends the U-shaped curve with a second descent. Beyond the “interpolation threshold” the model fits the training data perfectly, but the test risk decreases. The figure is from


However, it turns out that the U-shaped curve only provides a partial picture. If we continue to increase the number of parameters in the model after the training set is already perfectly labeled, then there might be a second descent in the test risk (hence the name “double descent”). As can be seen in Figure 2(B), by increasing the number of parameters beyond what is required to perfectly fit the training set, the generalization improves. Hence, in contrast to the classical approach which seeks a sweet spot, we may achieve generalization by using overparameterized neural networks. Modern deep learning indeed relies on using highly overparameterized networks.

The double-descent phenomenon is believed to be a consequence of the implicit bias in deep-learning algorithms. When using overparameterized models, gradient methods converge to networks that generalize well by implicitly minimizing a certain measure of the network’s complexity. As we will see next, characterizing this complexity measure in different settings is a challenging puzzle.

3 Implicit bias in classification

We start with the implicit bias in classification tasks, namely, where the labels are in . Neural networks are trained in practice using the gradient-descent algorithm and its variants, such as stochastic gradient descent (SGD). In gradient descent, we start from some random initialization of the network’s parameters, and then in each iteration we update , where is the step size and is some empirical loss that we wish to minimize. Here, we will focus on gradient flow, which is a continuous version of gradient descent. That is, we train the parameters of the considered neural network, such that for all time we have , where is the empirical loss, and is some random initialization.222If

is non-differentiable, e.g., in ReLU networks, then we use the

Clarke subdifferential, which is a generalization of the derivative for non-differentiable functions. We focus on the logistic loss function (a.k.a. binary cross entropy): For a training dataset and a neural network parameterized by , the empirical loss is defined by . We note that gradient flow corresponds to gradient descent with an infinitesimally small step size, and many implicit-bias results are shown for gradient flow since it is often easier to analyze.

3.1 Logistic regression

Logistic regression is perhaps the simplest classification setting where implicit bias of neural networks can be considered. It corresponds to a neural network with a single neuron that does not have an activation function. That is, the model here is a linear predictor , where are the learned parameters.

We assume that the data is linearly separable, i.e., there exists some such that for all we have . Note that is strictly positive for all , but it may tend to zero as tends to infinity. For example, for and (where is the vector defined above) we have . However, there might be infinitely many directions that satisfy this condition. Interestingly, Soudry et al. [55] showed that gradient flow converges in direction to the maximum-margin predictor. That is, exists, and points at the direction of the maximum-margin predictor, defined by

The above characterization of the implicit bias in logistic regression can be used to explain generalization. Consider the case where , thus, the input dimension is larger than the size of the dataset, and hence there are infinitely many vectors with such that for all we have , i.e., there are infinitely many directions that correctly label the training data. Some of the directions which fit the training data generalize well, and others do not. The above result guarantees that gradient flow converges to the maximum-margin direction. Consequently, we can explain generalization by using standard margin-based generalization bounds (cf. [52]), which imply that predictors achieving large margin generalize well.

3.2 Linear networks

We turn to deep neural networks where the activation is the identity function. Such neural networks are called linear networks. These networks compute linear functions, but the network architectures induce significantly different dynamics of gradient flow compared to the case of linear predictors that we already discussed. Hence, the implicit bias in linear networks has been extensively studied, as a first step towards understanding implicit bias in deep nonlinear networks. It turns out that understanding implicit bias in linear networks is highly non-trivial, and its study reveals valuable insights.

A linear fully-connected network of depth computes a function , where are weight matrices (where is a row vector). The trainable parameters are a collection of the weight matrices. By [25],333Similar results under stronger assumptions were previously established in [17, 23]. if gradient flow converges to zero loss, namely, , then the vector converges in direction to the maximum-margin predictor. That is, although the dynamics of gradient flow in the case of a deep fully-connected linear network is significantly different compared to the case of a linear predictor, in both cases gradient flow is biased towards the maximum-margin predictor. We also note that by [23], the weight matrices converge to matrices of rank . Thus, gradient flow on fully-connected linear networks is also biased towards rank minimization of the weight matrices.

Once we consider linear networks which are not fully connected, gradient flow no longer maximizes the margin. For example, Gunasekar et al. [17] showed that in linear diagonal networks (i.e., where the weight matrices are diagonal) gradient flow encourages predictors that maximize the margin w.r.t. the quasi-norm. As a result, diagonal networks are biased towards sparse linear predictors. In linear convolutional networks

they proved bias towards sparsity of the linear predictors in the frequency domain (see also


3.3 Homogeneous neural networks

Neural networks of practical interest have nonlinear activations and compute nonlinear predictors. The notion of margin maximization in nonlinear predictors is generally not well-defined. Nevertheless, it has been established that for certain neural networks gradient flow maximizes the margin in parameter space. Thus, while in linear networks we considered margin maximization in predictor space (a.k.a. function space), here we consider margin maximization w.r.t. the network’s parameters.

Consider a neural network with parameters , denoted by . We say that is homogeneous if there exists such that for every and we have . Here we think about as a vector obtained by concatenating all the parameters. That is, in homogeneous networks, scaling the parameters by any factor scales the predictions by . We note that a feedforward neural network with the ReLU activation (namely,

) is homogeneous if it does not contain skip-connections (i.e., residual connections), and does not have bias terms, except maybe for the first hidden layer. Also, we note that a homogeneous network might include convolutional layers. In papers by Lyu and Li

[33] and by Ji and Telgarsky [25],444A similar result under stronger assumptions was previously established in [38]. it was shown that if gradient flow on homogeneous networks reaches a sufficiently small loss (at some time ), then as it converges to zero loss, the parameters converge in direction, i.e., exists, and is biased towards margin maximization in the following sense. Consider the following margin maximization problem in parameter space:

Then, points at the direction of a first-order stationary point of the above optimization problem, which is also called Karush–Kuhn–Tucker point, or KKT point for short. The KKT approach allows inequality constraints and is a generalization of the method of Lagrange multipliers, which allows only equality constraints.

A KKT point satisfies several conditions (called KKT conditions), and it was proved in [33] that in the case of homogeneous neural networks these conditions are necessary for optimality. However, they are not sufficient even for local optimality (cf. [59]). Thus, a KKT point may not be an actual optimum of the maximum-margin problem. It is analogous to showing that some unconstrained optimization problem converges to a point where the gradient is zero, without proving that it is a global/local minimum. Intuitively, convergence to a KKT point of the maximum-margin problem implies a certain bias towards margin maximization in parameter space, although it does not guarantee convergence to a maximum-margin solution.

As we already discussed, in linear classifiers and linear neural networks, margin maximization (in predictor space) can explain generalization using well-known margin-based generalization bounds. Can margin maximization in parameter space explain generalization in nonlinear neural networks? In recent years several works showed margin-based generalization bounds for neural networks (e.g.,

[42, 5, 14, 60]). Hence, generalization in neural networks might be established by combining these results with results on the implicit bias towards margin maximization in parameter space. On the flip side, we note that it is still unclear how tight the known margin-based generalization bounds for neural networks are, and whether the sample complexity implied by such results (i.e., the required size of the training dataset) may be small enough to capture the situation in practice.

Several results on implicit bias were shown for some specific cases of nonlinear homogeneous networks. For example: [10] showed bias towards margin maximization w.r.t. a certain function norm (known as the variation norm) in infinite-width two-layer homogeneous networks; [34] proved margin maximization in two-layer Leaky-ReLU networks trained on linearly separable and symmetric data, and [50] proved convergence to a linear classifier in two-layer Leaky-ReLU networks under different assumptions; [49] showed bias towards minimizing the number of linear regions in univariate two-layer ReLU networks (see also [13, 51]).

Finally, the implicit bias in non-homogeneous neural networks (such as ResNets) is currently not well-understood.555We remark that a result by [38] suggests that for a sum of homogeneous networks of different orders (such a sum is non-homogeneous), the implicit bias may encourage solutions that discard the networks with the smallest order. Improving our knowledge of non-homogeneous networks is an important challenge in the path towards understanding implicit bias in deep learning.

3.4 Extensions

For simplicity, we considered so far only gradient flow in binary classification. We note that many of the above results can also be extended to other gradient methods (such as gradient descent, steepest descent and SGD), and to multiclass classification with the cross-entropy loss (see, e.g., [55, 33, 16, 40]).

The margin-maximization guarantees that we discussed for gradient flow hold in an asymptotic sense, and an important question is how fast the convergence is. It turns out that the convergence rate might be extremely slow [55, 39, 24, 26, 40, 28]. For example, when learning a linear predictor on linearly-separable training dataset using gradient descent, after iterations the distance between the normalized predictor and the maximum margin predictor generally satisfies ,666See [55] for a more precise statement. and hence to reach for some , the number of iterations must be exponential in . We remark that Shamir [53] showed that the slow convergence rate to the maximum-margin predictor does not imply that must be extremely large in order to avoid overfitting. Namely, he proved that also for much smaller values of , the predictor achieves a sufficiently large margin on a sufficiently large portion of the dataset, which implies good generalization properties.

Moreover, so far we considered only training with the logistic loss, which is a common loss function for binary classification. The results can generally be extended to loss functions that have an exponential tail, but the implicit bias is different when using other loss functions (e.g., losses with a polynomial tail) [27, 26, 39].

Finally, all the results described so far do not depend on the initialization of gradient flow. For example, when training homogeneous networks, as the time

tends to infinity, gradient flow converges to a KKT point of the maximum-margin problem (assuming that the loss reaches a sufficiently small value), regardless of the initialization. However, such an asymptotic analysis only provides a partial picture. The authors in

[35] considered certain linear diagonal networks of depth , and showed that if we train the network only until we reach some fixed accuracy, e.g., until the loss reaches (instead of considering ), then the initialization plays a crucial role: If the initialization scale is large w.r.t. the considered accuracy (a setting which corresponds to the so-called kernel regime), then the implicit bias is given by the norm, rather than the quasi-norm in the setting studied so far. Thus, to understand implicit bias in practice, the asymptotic results may not suffice, and we may need to take the initialization of gradient flow into account.

4 Implicit bias in regression

We turn to consider the implicit bias of gradient methods in regression tasks, namely, where the labels are in . We focus on gradient flow and on the square-loss function. Thus, we have , where the empirical loss is defined such that given a training dataset and a neural network , we have . In an overparameterized setting, there might be many possible choices of such that , thus there are many global minima. Ideally, we want to find some implicit regularization function such that gradient flow prefers global minima which minimize . That is, if gradient flow converges to some with , then we have s.t. .

4.1 Linear regression

We start with linear regression, which is perhaps the simplest regression setting where the implicit bias of neural networks can be considered. Here, the model is a linear predictor

, where are the learned parameters. In this case, it is not hard to show that gradient flow converges to the global minimum of , which is closest to the initialization in distance [66, 16]. Thus, the implicit regularization function is . Minimizing this norm allows us to explain generalization using standard norm-based generalization bounds [52].

We note that the results on linear regression can be extended to other optimization methods other than gradient flow, such as gradient descent, SGD, and mirror descent [16, 2].777For mirror descent with a potential function , the implicit bias is given by , where is the Bregman divergence w.r.t. . In particular, if we start at then we have .

4.2 Linear networks

A linear neural network has the identity activation function and computes a linear predictor , where is the vector obtained by multiplying the weight matrices in . Hence, instead of seeking an implicit regularization function we may aim to find . Similarly to our discussion in the context of classification, the network architectures in deep linear networks induce significantly different dynamics of gradient flow compared to the case of linear regression. Hence, the implicit bias in such networks has been studied as a first step towards understanding implicit bias in deep nonlinear networks.

Exact expressions for have been obtained for linear diagonal networks and linear fully-connected networks [3, 65, 63]. The expressions for are rather complicated, and depend on the initialization scale, i.e., the norm of , and the initialization “shape”, namely, the relative magnitudes of different weights and layers in the network. Below we discuss a few special cases.

Recall that diagonal networks are neural networks where the weight matrices are diagonal. The regularization function has been obtained for a few variants of diagonal networks. In [63] the authors considered networks of the form , where and the operator denotes the Hadamard (entrywise) product. Thus, in this network, which can be viewed as a variant of a diagonal network, the weights in the two layers are shared. For an initialization , the scale of the initialization does not affect . However, the initialization scale does affect the implicit bias: if then and if (which corresponds to the so-called kernel regime) then . Thus, the implicit bias transitions between the norm and the norm as we increase the initialization scale. A similar result holds also for deeper networks. In two-layer “diagonal networks” with a similar structure, but where the weights are not shared, [3] showed that the implicit bias is affected by both the initialization scale and “shape”. In [65] the authors showed a transition between the and norms both in diagonal networks and in certain convolutional networks (in the frequency domain).

In two-layer fully-connected linear networks with infinitesimally small initialization, gradient flow minimizes the norm, namely, [3, 65]. This result was shown also for deep fully-connected linear networks, under certain assumptions on the initialization [65].

4.3 Matrix factorization as a test-bed for neural networks

Towards the goal of understanding implicit bias in neural networks, much effort was directed at the matrix factorization (and more specifically, matrix sensing) problem. In this problem, we are given a set of observations about an unknown matrix and we find a matrix with that is compatible with the given observations by running gradient flow. More precisely, consider the observations such that , where in the inner product we view and as vectors (namely, ). Then, we find using gradient flow on the objective . Since then the rank of is not limited by the parameterization. However, the parameterization has a crucial effect on the implicit bias. Assuming that there are many matrices that are compatible with the observations, the implicit bias controls which matrix gradient flow finds. A notable special case of matrix sensing is matrix completion. Here, every observation has the form , where is the standard basis. Thus, each observation reveals a single entry from the matrix .

The matrix factorization problem is analogous to training a two-layer linear network. Furthermore, we may consider deep matrix factorization where we have , which is analogous to training a deep linear network. Therefore, this problem is considered a natural test-bed to investigate implicit bias in neural networks, and has been studied extensively in recent years (e.g., [15, 30, 1, 46, 6, 31]).

Gunasekar et al. [15] conjectured that in matrix factorization, the implicit bias of gradient flow starting from small initialization is given by the nuclear norm of (a.k.a. the trace norm). That is, . They also proved this conjecture in the restricted case where the matrices commute. The conjecture was further studied in a line of works (e.g., [30, 6, 1, 46]) providing positive and negative evidence, and was formally refuted by [31]. In [46] the authors showed that gradient flow in matrix completion might approach a global minimum at infinity rather than converging to a global minimum with a finite norm. This result suggests that the implicit bias in matrix factorization may not be expressible by any norm or quasi-norm. They conjectured that the implicit bias can be explained by rank minimization rather than norm minimization. In [31]

, the authors provided theoretical and empirical evidence that gradient flow with infinitesimally small initialization in the matrix-factorization problem is mathematically equivalent to a simple heuristic rank-minimization algorithm called

Greedy Low-Rank Learning (see also [21]

). This result was generalized to tensor factorization in

[47, 48].

Overall, although an explicit expression for the function is not known in the case of matrix factorization, we can view the implicit bias as a heuristic for rank minimization. It is still unclear what the implications of these results are for more practical nonlinear neural networks.

4.4 Nonlinear networks

Once we consider nonlinear networks the situation is even less clear. For a single-neuron network, namely, , where is strictly monotonic, gradient flow converges to the closest global minimum in distance, similarly to the case of linear regression. Thus, we have . However, if is the ReLU activation then the implicit bias is not expressible by any non-trivial function of . A bit more precisely, suppose that is such that if gradient flow starting from converges to a zero-loss solution , then is a zero-loss solution that minimizes . Then, such a function must be constant in . Hence, the approach of precisely specifying the implicit bias of gradient flow via a regularization function is not feasible in single-neuron ReLU networks [58]. It suggests that this approach may not be useful for understanding implicit bias in the general case of ReLU networks.

On the positive side, in single-neuron ReLU networks the implicit bias can be expressed approximately, within a factor of , by the norm. Namely, let be a zero-loss solution with a minimal norm, and assume that gradient flow converges to some zero-loss solution , then [58].

For two-layer Leaky-ReLU networks with a single hidden neuron, an explicit expression for was obtained in [3]. However, if we replace the Leaky-ReLU activation with ReLU, then the implicit bias is not expressible by any non-trivial function, similarly to the case of a single-neuron ReLU network [58].

The results on implicit bias in matrix factorization might suggest that there is a certain tendency towards rank minimization in deep learning with the square loss. However, the authors in [57] showed that gradient flow is not biased towards rank minimization of the weight matrices in ReLU networks, at least in the simple case of two-layer networks and small datasets.888We remark that [57] showed a certain tendency towards rank minimization in deep networks trained with the square loss using weight decay, or with the logistic loss.

Overall, our understanding of implicit bias in nonlinear networks with regression losses is very limited. While in classification we have a useful characterization of the implicit bias in nonlinear homogeneous models, here we do not understand even extremely simple models. Improving our understanding of this question is a major challenge for the upcoming years.

5 Additional implicit biases

In the previous sections, we mostly focused on implicit bias in gradient flow. We also discussed cases where the results on gradient flow can be extended to other gradient methods. However, when considering gradient descent or SGD with a finite step size (rather than infinitesimal), the discrete nature of the algorithms and the stochasticity induce additional implicit biases that do not exist in gradient flow. These biases are crucial for understanding the behavior of gradient methods in practice, as empirical results suggest that increasing the step size and decreasing the batch size may improve generalization (cf. [29, 22]).

It is well known that gradient descent and SGD cannot stably converge to minima that are too sharp relative to the step size [11, 37, 64, 36, 41]

. Namely, in a stable minimum, the maximal eigenvalue of the Hessian is at most

, where is the step size. As a result, when using an appropriate step size we can rule out stable convergence to certain (sharp) minima, and thus encourage convergence to flat minima, which are believed to generalize better [29, 64]. Similarly, small batch sizes in SGD also encourage flat minima [29, 64].

Another direction for understanding implicit bias in gradient descent and SGD, is by transforming it into gradient flow on a modified loss. In [4, 54], the authors showed that the discrete iterates of gradient descent and SGD with small step size lie close to the path of gradient flow on certain modified losses, obtained by adding explicit regularizers (see also [44]).

The implicit bias of SGD is also affected when training with label noise, namely, when the training labels are perturbed by independent noise at each iteration [9, 19, 12, 32, 45]. For example, in [19, 45] it is shown that when training a linear diagonal network using SGD with label noise, there is a bias towards sparse solutions even when the initialization scale is large. This is in contrast to noise-free training where such initializations lead to implicit regularization.

Finally, we note that in this article we mostly considered implicit bias in the rich regime, rather than in the NTK regime [20] (a.k.a. kernel regime). The NTK regime does not capture feature learning, which seems to be a crucial element in the success of deep learning. The implicit bias in the NTK regime has been studied in several works (see, e.g., [62, 8]), but we leave the discussion on these results outside the scope of this review.

6 Implications beyond generalization

While the primary motivation for studying implicit bias is to better understand generalization in deep learning, it may also have other significant implications. Indeed, various phenomena may stem from the tendency of gradient-based algorithms to prefer specific solutions over others. We believe that exploring such implications is an exciting research direction, and demonstrate it below with two examples.

First, neural networks are known to be extremely vulnerable to adversarial examples [56], namely, small perturbations to the inputs might change the network’s predictions. This phenomenon has been widely studied, but it is still not well-understood. Specifically, it is unclear why gradient methods tend to learn non-robust networks, namely, networks that are susceptible to adversarial examples, even in cases where robust networks exist. The tendency of gradient methods to learn non-robust networks can be viewed as an implication of the implicit bias. See [61] for some results in this direction.

Second, the implicit bias might shed light on the hidden representations learned by neural networks, and on the extent to which neural networks are susceptible to privacy attacks. In

[18], the authors used the characterization of the implicit bias in homogeneous networks due to [33, 25], and showed that training data can be reconstructed from trained networks. Thus, neural networks “memorize” training data, and by using known properties of the implicit bias the data may be reconstructed, which might have negative implications on privacy.

7 Conclusion

Deep-learning algorithms exhibit remarkable performance, but it is not well-understood why they are able to generalize despite having much more parameters than training examples. Exploring generalization in deep learning is an intriguing puzzle with both theoretical and practical implications. It is believed that implicit bias is a main ingredient in the ability of deep-learning algorithms to generalize, and hence it has been widely studied. Moreover, investigating the implicit bias might have important implications beyond generalization.

Our understanding of implicit bias improved dramatically in recent years, but is still far from satisfactory. We believe that further progress in the study of implicit bias will eventually shed light on the mystery of generalization in deep learning.


I thank Nadav Cohen, Noam Razin, Ohad Shamir and Daniel Soudry for valuable comments and discussions.