1 Introduction
A central puzzle in the theory of deep learning is how neural networks generalize even when trained without any explicit regularization, and when there are far more learnable parameters than training examples. In such an underdetermined optimization problem, there are many global minima with zero training loss, and gradient descent seems to prefer solutions that generalize well (see
zhang2017understanding). Hence, it is believed that gradient descent induces an implicit regularization (or implicit bias) (neyshabur2015search; neyshabur2017exploring), and characterizing this regularization/bias has been a subject of extensive research.Several works in recent years studied the relationship between the implicit regularization in linear neural networks and rank minimization. A main focus is on the matrix factorization problem, which corresponds to training a depth-2 linear neural network with multiple outputs w.r.t. the square loss, and is considered a well-studied test-bed for studying implicit regularization in deep learning. gunasekar2018implicit initially conjectured that the implicit regularization in matrix factorization can be characterized by the nuclear norm of the corresponding linear predictor. This conjecture was further studied in a string of works (e.g., belabbas2020implicit; arora2019implicit; razin2020implicit) and was formally refuted by li2020towards. razin2020implicit conjectured that the implicit regularization in matrix factorization can be explained by rank minimization, and also hypothesized that some notion of rank minimization may be key to explaining generalization in deep learning. li2020towards established evidence that the implicit regularization in matrix factorization is a heuristic for rank minimization. razin2021implicit
studied implicit regularization in tensor factorization (a generalization of matrix factorization). They demonstrated, both theoretically and empirically, implicit bias towards low-rank tensors. Going beyond factorization problems,
ji2018gradient; ji2020directional showed that in linear networks of output dimension , gradient flow (GF) w.r.t. exponentially-tailed classification losses converges to networks where the weight matrix of every layer is of rank .However, once we move to nonlinear neural networks (which are by far the more common in practice), things are less clear. Empirically, a series of works studying neural network compression (cf. denton2014exploiting; yu2017compressing; alvarez2017compression; arora2018stronger; tukan2020compressed) showed that replacing the weight matrices by low-rank approximations results in only a small drop in accuracy. This suggests that the weight matrices in practice are not too far from being low-rank. However, whether they provably behave this way remains unclear.
In this work we consider fully-connected nonlinear networks employing the popular ReLU activation function, and study whether GF is biased towards networks where the weight matrices have low ranks. On the negative side, we show that already for small (depth and width
) ReLU networks, there is no rank-minimization bias in a rather strong sense. On the positive side, for deeper and possibly wider overparameterized networks, we identify reasonable settings where GF is biased towards low-rank solutions. In more details, our contributions are as follows:-
[itemsep=3pt,parsep=3pt]
-
We begin by considering depth- width- ReLU networks with multiple outputs, trained with the square loss. li2020towards gave evidence that in linear networks with the same architecture, the implicit bias of GF can be characterized as a heuristic for rank minimization. In contrast, we show that in ReLU networks, the situation is quite different: Specifically, we show that GF does not converge to a low-rank solution, already for the simple case of datasets of size , , whenever the angle between and is in and
are linearly independent. Thus, rank minimization does not occur even if we just consider “most” datasets of this size. Moreover, we show that with at least constant probability, the solutions that GF converges to are not even close to have low rank, under any reasonable approximate rank metric. We also demonstrate these results empirically.
-
Next, for ReLU networks that are overparameterized in terms of depth and have width , we identify interesting settings in which GF is biased towards low ranks:
-
[leftmargin=*,itemsep=3pt,parsep=3pt, topsep=3pt, partopsep=3pt]
-
First, we consider ReLU networks trained w.r.t. the square loss. We show that for sufficiently deep networks, if GF converges to a network that attains zero loss and minimizes the norm of the parameters, then the average ratio between the spectral and the Frobenius norms of the weight matrices is close to . Since the squared inverse of this ratio is the stable rank (which is a continuous approximation of the rank, and equals if and only if the matrix has rank ), the result implies a bias towards low ranks. While GF in ReLU networks w.r.t. the square loss is not known to be biased towards solutions that minimize the norm, in practice it is common to use explicit regularization, which encourages norm minimization. Thus, our result suggests that GF in deep networks trained with the square loss and explicit regularization encourages rank minimization.
-
Shifting our attention to binary classification problems, we consider ReLU networks trained with exponentially-tailed classification losses. By lyu2019gradient, GF in such networks is biased towards networks that maximize the margin. We show that for sufficiently deep networks, maximizing the margin implies rank minimization, where the rank is measured by the ratio between the norms as in the former case.
-
Additional Related Work
The implicit regularization in matrix factorization and linear neural networks with the square loss was extensively studied, as a first step toward understanding implicit regularization in more complex models (see, e.g., gunasekar2018implicit; razin2020implicit; arora2019implicit; belabbas2020implicit; eftekhari2020implicit; li2018algorithmic; ma2018implicit; woodworth2020kernel; gidel2019implicit; li2020towards; yun2020unifying; azulay2021implicit; razin2021implicit). As we already discussed, some of these works showed bias toward low ranks.
The implicit regularization in nonlinear neural networks with the square loss was studies in several works. oymak2019overparameterized showed that under some assumptions, gradient descent in certain nonlinear models is guaranteed to converge to a zero-loss solution with a bounded norm. williams2019gradient and jin2020implicit studied the dynamics and implicit bias of gradient descent in wide depth- ReLU networks with input dimension . vardi2021implicit and azulay2021implicit
studied the implicit regularization in single-neuron networks. In particular,
vardi2021implicit showed that in single-neuron networks and single-hidden-neuron networks with the ReLU activation, the implicit regularization cannot be expressed by any non-trivial regularization function. Namely, there is no non-trivial regularization function , where are the parameters of the model, such that if GF with the square loss converges to a global minimum, then it is a global minimum that minimizes . However, this negative result does not imply that GF is not implicitly biased towards low-rank solutions, for two reasons. First, bias toward low ranks would not have implications in the case of networks of width that these authors studied, and hence it would not contradict their negative result. Second, their result rules out the existence of a non-trivial regularization function which expresses the implicit bias for all possible datasets and initializations, but it does not rule out the possibility that GF acts as a heuristic for rank minimization, in the sense that it minimizes the ranks for “most” datasets and initializations.The implicit bias of neural networks in classification tasks was also widely studied in recent years. soudry2018implicit showed that gradient descent on linearly-separable binary classification problems with exponentially-tailed losses, converges to the maximum
-margin direction. This analysis was extended to other loss functions, tighter convergence rates, non-separable data, and variants of gradient-based optimization algorithms
(nacson2019convergence; ji2018risk; ji2020gradient; gunasekar2018characterizing; shamir2021gradient; ji2021characterizing). lyu2019gradient and ji2020directional showed that GF on homogeneous neural networks, with exponentially-tailed losses, converges in direction to a KKT point of the maximum-margin problem in the parameter space. Similar results under stronger assumptions were previously obtained in nacson2019lexicographic; gunasekar2018bimplicit. vardi2021margin studied in which settings this KKT point is guaranteed to be a global/local optimum of the maximum-margin problem. The implicit bias in fully-connected linear networks was studied by ji2020directional; ji2018gradient; gunasekar2018bimplicit. As already mentioned, these results imply that GF minimizes the ranks of the weight matrices in linear fully-connected networks. The implicit bias in diagonal and convolutional linear networks was studied in gunasekar2018bimplicit; moroshko2020implicit; yun2020unifying. The implicit bias in infinitely-wide two-layer homogeneous neural networks was studied in chizat2020implicit.Organization. In Sec. 2 we provide necessary notations and definitions. In Sec. 3 we state our negative results for depth- networks. In Sec. 4 and 5 we state our positive results for deep ReLU networks. In Sec. 6 we describe the ideas for the proofs of the main theorems, with all formal proofs deferred to the appendix.
2 Preliminaries
Notations.
We use boldface letters to denote vectors. For we denote by the Euclidean norm. For a matrix we denote by the Frobenius norm and by the spectral norm. We denote . For an integer we denote . The angle between a pair of vectors is . The unit -sphere is . An open -ball that is centered at the origin is denoted by for some . The closure of a set , denoted as , is the intersection of all closed sets containing . The boundary of is .
Neural networks.
A fully-connected neural network of depth is parameterized by a collection of weight matrices, such that for every layer we have . Thus, denotes the number of neurons in the -th layer, i.e., the width of the layer. We denote by , the input and output dimensions. The neurons in layers are called hidden neurons. A fully-connected network computes a function defined recursively as follows. For an input we set , and define for every the input to the -th layer as , and the output of the -th layer as , where is an activation function that acts coordinate-wise on vectors. In this work we focus on the ReLU activation . Finally, we define . Thus, there is no activation in the output neurons. The width of the network is the maximal width of its layers, i.e., . We sometimes apply the activation function also on matrices, in which case it acts entry-wise. The parameters of the neural network are given by a collection of matrices, but we often view as the vector obtained by concatenating the matrices in the collection. Thus, denotes the norm of the vector .
We often consider depth- networks. For matrices and we denote by the depth- ReLU network where and . We denote the the rows of , namely, the incoming weight vectors to the neurons in the hidden layer, and by the columns of , namely, the outgoing weight vectors from the neurons in the hidden layer.
Let be inputs, and let be a matrix whose columns are . We denote by the matrix whose -th column is .
Optimization problem and gradient flow (GF).
Let be a training dataset. We often represent the dataset by matrices . For a neural network we consider empirical-loss minimization w.r.t. the square loss. Thus, the objective is given by:
(1) |
We assume that the data is realizable, that is, . Moreover, we focus on settings where the network is overparameterized, in the sense that has multiple (or even infinitely many) global minima.
We consider gradient flow (GF) on the objective given in Eq. (1). This setting captures the behavior of gradient descent with an infinitesimally small step size. Let be the trajectory of GF. Starting from an initial point , the dynamics of is given by the differential equation . Note that the ReLU function is not differentiable at . Practical implementations of gradient methods define the derivative to be some constant in . In this work we assume for convenience that . We say that GF converges if exists. In this case, we denote .
3 Gradient flow does not even approximately minimize ranks
In this section we consider rank minimization in depth- networks trained with the square loss. We show that even for the simple case of size- datasets, under mild assumptions, GF does not converge to a minimum-rank solution even approximately.
In what follows, we consider ReLU networks with vector-valued outputs, since for linear networks with the same architecture it was shown that GF can be viewed as a heuristic for rank minimization (cf. li2020towards; razin2020implicit). Specifically, let be a training dataset, and let be weight matrices such that is a zero-loss solution. Note that if then we must have : Indeed, by definition of , we necessarily have . Therefore, to understand rank minimization in this simple setting, we consider the rank of in a zero-loss solution. Trivially, , so can be considered low-rank only if .
To make the setting non-trivial, we need to show that such low-rank zero-loss solutions exist at all. The following theorem shows that this is true for almost all size- datasets:
Theorem 1.
Given any labeled dataset of two inputs with a strictly positive angle between them, i.e., , there exists a zero-loss solution with , such that .
The theorem follows by constructing a network where the weight vectors of the neurons in the first layer have opposite directions (and hence the weight matrix is of rank ), such that each neuron is active for exactly one input. Then, it is possible to show that for an appropriate choice of the weights in the second layer the network achieves zero loss. See Appendix A for the formal proof.
Thm. 1 implies that zero-loss solutions of rank exist. However, we now show that GF does not converge to such solutions. We prove this result under the following assumptions:
Assumption 1.
The two target vectors are on the unit sphere and are linearly independent.
Assumption 2.
The two inputs are on the unit sphere , and satisfy .
The assumptions that are of unit norm are mostly for technical convenience, and we believe that they are not essential.
Then, we have:
Theorem 2.
By the above theorem, GF does not minimize the rank even in a very simple setting where the dataset contains two inputs with angle larger than (as long as the initialization point is sufficiently close to
). In particular, if the dataset is drawn from the uniform distribution on the sphere then this condition holds with probability
.While Thm. 2
shows that GF does not minimize the rank, it does not rule out the possibility that it converges to a solution which is close to a low-rank solution. There are many ways to define such closeness, such as the ratio of the Frobenius and spectral norms, the Frobenius distance from a low-rank solution, or the exponential of the entropy of the singular values (cf.
rudelson2007sampling; sanyal2019stable; razin2020implicit; roy2007effective). However, for matrices they all boil down to either having the two rows of the matrix being nearly aligned, or having at least one of them very small (at least compared to the other). In the following theorem, we show that under the assumptions stated above, for any fixed dataset, with at least constant probability, GF converges to a zero-loss solution, where the two row vectors are bounded away from , the ratio of their norms are bounded, and the angle between them is bounded away from and from (all by explicit constants that depend just on the dataset and are large in general). Thus, with at least constant probability, GF does not minimize any reasonable approximate notion of rank.Theorem 3.
Let be a labeled dataset that satisfies Assumptions 1 and 2. Consider GF w.r.t. the loss function from Eq. (1). Suppose that are initialized such that for all we have , and is drawn from a spherically symmetric distribution with
Let be the event that GF converges to a zero-loss solution such that
-
,
-
for all .
Then, .
We note that in Thm. 3 the weights in the second layer are initialized to zero, while in Thm. 2 the assumption on the initialization is weaker. This difference is for technical convenience, and we believe that Thm. 3 should hold also under weaker assumptions on the initialization, as the next empirical result demonstrates.
3.1 An empirical result
Our theorems imply that for standard initialization schemes, GF will not converge close to low-rank solutions, with some positive probability. We now present a simple experiment that corroborates this and suggests that, furthermore, this holds with high probability.
Specifically, we trained ReLU networks in the same setup as in the previous section (w.r.t. two weight matrices ) on the two data points where are the standard basis vectors in , and are and normalized to have unit norm. At initialization, every row of and every column of is sampled uniformly at random from the sphere of radius around the origin. To simulate GF, we performed epochs of full-batch gradient descent of step size , w.r.t. the square loss. Of repeats of this experiment, converged to negligible loss (defined as ). In Fig. 1, we plot a histogram of the stable (numerical) ranks of the resulting weight matrices, i.e. the ratio of layer . The figure clearly suggests that whenever convergence to zero loss occurs, the solutions are all of rank , and none are even close to being low-rank (in terms of the stable rank).
4 Rank minimization in deep networks with small norm
When training neural networks with gradient descent, it is common to use explicit regularization on the parameters. In this case, gradient descent is biased towards solutions that minimize the norm of the parameters. We now show that in deep overparameterized ReLU networks, if GF converges to a zero-loss solution that minimizes the norm, then the ratios between the Frobenius and the spectral norms in the weight matrices tend to be small (we use here the ratio between these norms as a continuous surrogate for the exact rank, as discussed in the previous section). Formally, we have the following:
Theorem 4.
Let be a dataset, and assume that there is with and . Assume that there is a fully-connected neural network of width and depth , such that for all we have , and the weight matrices of satisfy for some . Let be a fully-connected neural network of width and depth parameterized by . Let be a global optimum of the following problem:
(2) |
Then,
(3) |
Equivalently, we have the following upper bound on the harmonic mean of the ratios
(4) |
By the above theorem if is much larger than , then the average ratio between the spectral and the Frobenius norms (Eq. (3)) is at least roughly . Likewise, the harmonic mean of the ratio between the Frobenius and the spectral norms (Eq. (4)), namely, the square root of the stable rank, is at most roughly . Noting that both these ratios equal if and only if the matrix is of rank , we see that there is a bias towards low-rank solutions as the depth of the trained network increases. Note that the result does not depend on the width of the networks. Thus, even if the width is large, the average ratio is close to . Also, note that the network of depth in the theorem might have high ranks (e.g., rank for each weight matrix), but once we consider networks of a large depth then the dataset becomes realizable by a network of small average rank, and GF converges to such a network.
5 Rank minimization in deep networks with exponentially-tailed losses
In this section, we turn to consider GF in classification tasks with exponentially-tailed losses, namely, the exponential loss or the logistic loss.
Let us first formally define the setting. We consider neural networks of output dimension , i.e., . Let be a binary classification training dataset. Let and be the data matrix and labels that correspond to . Let be a neural network parameterized by . For a loss function , the empirical loss of on the dataset is
(5) |
We focus on the exponential loss and the logistic loss .
We say that the dataset is correctly classified
The following well-known result characterizes the implicit bias in homogeneous neural networks trained with the logistic or the exponential loss:
Lemma 1 (Paraphrased from lyu2019gradient and ji2020directional).
Let be a homogeneous ReLU neural network. Consider minimizing the average of either the exponential or the logistic loss over a binary classification dataset using GF. Suppose that the average loss converges to zero as . Then, GF converges in direction to a first order stationary point (KKT point) of the following maximum margin problem in parameter space:
(6) |
The above lemma suggests that GF tends to converge in direction to a network with margin and small norm. In the following theorem we show that in deep overparameterized ReLU networks, if GF converges in direction to an optimal solution of Problem 6 (from the above lemma) then the ratios between the Frobenius and the spectral norms in the weight matrices tend to be small. Formally, we have the following:
Theorem 5.
Let be a binary classification dataset, and assume that there is with . Assume that there is a fully-connected neural network of width and depth , such that for all we have , and the weight matrices of satisfy for some . Let be a fully-connected neural network of width and depth parameterized by . Let be a global optimum of Problem 6. Namely, parameterizes a minimum-norm fully-connected network of width and depth that labels the dataset correctly with margin . Then, we have
(7) |
Equivalently, we have the following upper bound on the harmonic mean of the ratios :
(8) |
By the above theorem, if is much larger than , then the average ratio between the spectral and the Frobenius norms (Eq. (7)) is at least roughly . Likewise, the harmonic mean of the ratio between the Frobenius and the spectral norms (Eq. (8)), i.e., the square root of the stable rank, is at most roughly . Note that the result does not depend on the width of the networks. Thus, it holds even if the width is very large. Similarly to the case of Thm. 4, we note that the network of depth might have high ranks (e.g., rank for each weight matrix), but once we consider networks of a large depth , then the dataset becomes realizable by a network of small average rank, and GF converges to such a network.
The combination of the above result with Lemma 1 suggests that, in overparameterized deep fully-connected networks, GF tends to converge in direction to neural networks with low ranks. Note that we consider the exponential and the logistic losses, and hence if the loss tends to zero as , then we have . To conclude, in our case, the parameters tend to have an infinite norm and to converge in direction to a low-rank solution. Moreover, note that the ratio between the spectral and the Frobenius norms is invariant to scaling, and hence it suggests that after a sufficiently long time, GF tends to reach a network with low ranks.
6 Proof ideas
In this section we describe the main ideas for the proofs of Theorems 2, 3, 4 and 5. The full proofs are given in the appendix.
6.1 Theorem 2
We define the following regions (see Fig. 2):
Intuitively, defines the “dead” region where the relevant neuron will output on both ; is the “active” region where the relevant neuron will output a positive output on both ; and are the “partially active” regions, where the relevant neuron will output a positive output on one point, and on the other.
Assume towards contradiction that GF converges to some zero-loss network
with . Since attains zero loss, then , and hence(9) |
Therefore, the weight vectors and are not in the region . Indeed, if or are in , then at least one of the rows of is zero, in contradiction to Eq. (9). In particular, it implies that and are non-zero. Since by our assumption we have , then we conclude that . We denote where . Note that if , then for all , in contradiction to Eq. (9). Thus, . Since we also have , then one of these weight vectors is in and the other is in (as can be seen from Fig. 2). Assume w.l.o.g. that and .
By observing the gradients of w.r.t. for , the following facts follow. First, if at some time , then , hence remains at indefinitely, in contradiction to . Thus, the trajectory does not visit . Second, if at time , then . Since , we can consider the last time that enters , which can be either at the initialization (i.e., ) or when moving from (i.e., ). For all time we have . It allows us to conclude that must be in a region which is illustrated in Fig. 3 (by the union of the orange and green regions).
Furthermore, we show that cannot be too small, namely, obtaining a lower bound on . First, a theorem from du2018algorithmic implies that remains constant throughout the training. Since at the initialization both and are small, the consequence is that is small if is small. Also, since attains zero loss and for all , then we have , namely, only the -th hidden neuron contributes to the output of for the input . Since , it is impossible that both and are small. Hence, we are able to obtain a lower bound on , which implies that is in a region which is illustrated in Fig. 3.
Finally, we show that since and then the angle between and is smaller than , in contradiction to .
6.2 Theorem 3
We show that if the initialization is such that and (or, equivalently, that and ), then GF converges to a zero-loss network, and , are in the required intervals. Since by simple geometric arguments we can show that the initialization satisfies this requirement with probability at least , the theorem follows.
Indeed, suppose that and . We argue that GF converges to a zero-loss network and are in the required intervals, as follows. By analyzing the dynamics of GF for such an initialization, we show that for all and we have for some . Thus, moves only in the direction of , and for all . Moreover, we are able to prove that these properties of the trajectories and imply that GF converges to a zero-loss network . Then, by similar arguments to the proof of Thm. 2 we have for all , where are the regions from Fig. 3, and it allows us to obtain the required bounds on , and .
6.3 Theorems 4 and 5
The intuition for the proofs of both theorems can be roughly described as follows. If the dataset is realizable by a shallow network where the Frobenius norm of each layer is , then it is also realizable by a deep network where the Frobenius norm of each layer is , where is much smaller than . Moreover, if the network is sufficiently deep then is not much larger than . On the other hand, since for the input with the output of the network is of size at least , then the average spectral norm of the layers is at least . Hence, the average ratio between the spectral and the Frobenius norms cannot be too small.
We now describe the proof ideas in a bit more detail, starting with Thm. 4. We use the network of width and depth to construct a network of width and depth as follows. The first layers of are obtained by scaling the layers of by a factor . Since the output dimension of is , then the -th hidden layer of has width . Then, the network has additional layers of width , such that the weight in each of these layers is . Overall, given input , we have
We denote by the parameters of the network .
Let be a global optimum of Problem 2. From the optimality of it is possible to show that the layers in must be balanced, namely, for all . We denote by the Frobenius norm of the layers. From the global optimality of we also have . Hence, by a calculation we can obtain
Moreover, we show that since there is with and , then
Combining the last two displayed equations we get
as required.
Note that the arguments above do not depend on the ranks of the layers in . Thus, even if the weight matrices in have high ranks, once we consider deep networks which are optimal solutions to Problem 2, the ratios between the spectral and the Frobenius norms are close to .
We now turn to Thm. 5. The proof follows a similar approach to the proof of Thm. 4. However, here the outputs of the network can be either positive or negative. Hence, when constructing the network as above, we cannot have width in layers , since the ReLU activation will not allow us to pass both positive and negative values. Still, we show that we can define a network such that the width in layers is and we have for all . Then, the theorem follows by arguments similar to the proof of Thm. 4, with the required modifications.
Funding Acknowledgements
This research is supported in part by European Research Council (ERC) grant 754705.
References
Appendix A Proof of Thm. 1
Consider a matrix whose rows satisfy
The matrix has rank . To complete the proof, we need to show that we can choose a matrix such that attains zero loss. According to Lemma 2 below, it is enough to show that and . Since the angle between the inputs is strictly positive, namely, , it holds that . Thus,
Then,
while
∎
Lemma 2.
Let be a labeled dataset. Let . Suppose that for every data point there is at least one row in such that , and for all . Then, there exists such that .
Proof.
Consider the matrix of size , where acts entrywise. Note that our assumption on implies that . Thus, the matrix satisfies , where denotes the Moore-Penrose inverse of a matrix , and is the identity matrix. Hence, the matrix of dimensions yields . By setting , the network achieves zero loss. Namely, . ∎
Appendix B Proof of Thm. 2
Definition 1.
We define the following regions of interest:
Also, for we define