# Implicit Regularization Towards Rank Minimization in ReLU Networks

We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this generalizes to nonlinear networks is an open problem. In this paper, we focus on nonlinear ReLU networks, providing several new positive and negative results. On the negative side, we prove (and demonstrate empirically) that, unlike the linear case, GF on ReLU networks may no longer tend to minimize ranks, in a rather strong sense (even approximately, for "most" datasets of size 2). On the positive side, we reveal that ReLU networks of sufficient depth are provably biased towards low-rank solutions in several reasonable settings.

• 1 publication
• 15 publications
• 71 publications
12/09/2020

### Implicit Regularization in ReLU Networks with the Square Loss

Understanding the implicit regularization (or implicit bias) of gradient...
02/11/2022

### Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks

Understanding implicit bias of gradient descent has been an important go...
01/28/2022

### Training invariances and the low-rank phenomenon: beyond linear networks

The implicit bias induced by the training of neural networks has become ...
05/16/2022

### Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with Linear Widths

Implicit deep learning has recently become popular in the machine learni...
07/02/2021

### Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks

Deep linear networks trained with gradient descent yield low rank soluti...
10/05/2014

### On the Computational Efficiency of Training Neural Networks

It is well-known that neural networks are computationally hard to train....
09/27/2021

### Ridgeless Interpolation with Shallow ReLU Networks in 1D is Nearest Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz Functions

We prove a precise geometric description of all one layer ReLU networks ...

## 1 Introduction

A central puzzle in the theory of deep learning is how neural networks generalize even when trained without any explicit regularization, and when there are far more learnable parameters than training examples. In such an underdetermined optimization problem, there are many global minima with zero training loss, and gradient descent seems to prefer solutions that generalize well (see

zhang2017understanding). Hence, it is believed that gradient descent induces an implicit regularization (or implicit bias(neyshabur2015search; neyshabur2017exploring), and characterizing this regularization/bias has been a subject of extensive research.

Several works in recent years studied the relationship between the implicit regularization in linear neural networks and rank minimization. A main focus is on the matrix factorization problem, which corresponds to training a depth-2 linear neural network with multiple outputs w.r.t. the square loss, and is considered a well-studied test-bed for studying implicit regularization in deep learning. gunasekar2018implicit initially conjectured that the implicit regularization in matrix factorization can be characterized by the nuclear norm of the corresponding linear predictor. This conjecture was further studied in a string of works (e.g., belabbas2020implicit; arora2019implicit; razin2020implicit) and was formally refuted by li2020towards. razin2020implicit conjectured that the implicit regularization in matrix factorization can be explained by rank minimization, and also hypothesized that some notion of rank minimization may be key to explaining generalization in deep learning. li2020towards established evidence that the implicit regularization in matrix factorization is a heuristic for rank minimization. razin2021implicit

studied implicit regularization in tensor factorization (a generalization of matrix factorization). They demonstrated, both theoretically and empirically, implicit bias towards low-rank tensors. Going beyond factorization problems,

ji2018gradient; ji2020directional showed that in linear networks of output dimension , gradient flow (GF) w.r.t. exponentially-tailed classification losses converges to networks where the weight matrix of every layer is of rank .

However, once we move to nonlinear neural networks (which are by far the more common in practice), things are less clear. Empirically, a series of works studying neural network compression (cf. denton2014exploiting; yu2017compressing; alvarez2017compression; arora2018stronger; tukan2020compressed) showed that replacing the weight matrices by low-rank approximations results in only a small drop in accuracy. This suggests that the weight matrices in practice are not too far from being low-rank. However, whether they provably behave this way remains unclear.

In this work we consider fully-connected nonlinear networks employing the popular ReLU activation function, and study whether GF is biased towards networks where the weight matrices have low ranks. On the negative side, we show that already for small (depth and width

) ReLU networks, there is no rank-minimization bias in a rather strong sense. On the positive side, for deeper and possibly wider overparameterized networks, we identify reasonable settings where GF is biased towards low-rank solutions. In more details, our contributions are as follows:

• [itemsep=3pt,parsep=3pt]

• We begin by considering depth- width- ReLU networks with multiple outputs, trained with the square loss. li2020towards gave evidence that in linear networks with the same architecture, the implicit bias of GF can be characterized as a heuristic for rank minimization. In contrast, we show that in ReLU networks, the situation is quite different: Specifically, we show that GF does not converge to a low-rank solution, already for the simple case of datasets of size , , whenever the angle between and is in and

are linearly independent. Thus, rank minimization does not occur even if we just consider “most” datasets of this size. Moreover, we show that with at least constant probability, the solutions that GF converges to are not even close to have low rank, under any reasonable approximate rank metric. We also demonstrate these results empirically.

• Next, for ReLU networks that are overparameterized in terms of depth and have width , we identify interesting settings in which GF is biased towards low ranks:

• [leftmargin=*,itemsep=3pt,parsep=3pt, topsep=3pt, partopsep=3pt]

• First, we consider ReLU networks trained w.r.t. the square loss. We show that for sufficiently deep networks, if GF converges to a network that attains zero loss and minimizes the  norm of the parameters, then the average ratio between the spectral and the Frobenius norms of the weight matrices is close to . Since the squared inverse of this ratio is the stable rank (which is a continuous approximation of the rank, and equals  if and only if the matrix has rank ), the result implies a bias towards low ranks. While GF in ReLU networks w.r.t. the square loss is not known to be biased towards solutions that minimize the  norm, in practice it is common to use explicit  regularization, which encourages norm minimization. Thus, our result suggests that GF in deep networks trained with the square loss and explicit  regularization encourages rank minimization.

• Shifting our attention to binary classification problems, we consider ReLU networks trained with exponentially-tailed classification losses. By lyu2019gradient, GF in such networks is biased towards networks that maximize the margin. We show that for sufficiently deep networks, maximizing the margin implies rank minimization, where the rank is measured by the ratio between the norms as in the former case.

The implicit regularization in matrix factorization and linear neural networks with the square loss was extensively studied, as a first step toward understanding implicit regularization in more complex models (see, e.g., gunasekar2018implicit; razin2020implicit; arora2019implicit; belabbas2020implicit; eftekhari2020implicit; li2018algorithmic; ma2018implicit; woodworth2020kernel; gidel2019implicit; li2020towards; yun2020unifying; azulay2021implicit; razin2021implicit). As we already discussed, some of these works showed bias toward low ranks.

The implicit regularization in nonlinear neural networks with the square loss was studies in several works. oymak2019overparameterized showed that under some assumptions, gradient descent in certain nonlinear models is guaranteed to converge to a zero-loss solution with a bounded  norm. williams2019gradient and jin2020implicit studied the dynamics and implicit bias of gradient descent in wide depth- ReLU networks with input dimension . vardi2021implicit and azulay2021implicit

studied the implicit regularization in single-neuron networks. In particular,

vardi2021implicit showed that in single-neuron networks and single-hidden-neuron networks with the ReLU activation, the implicit regularization cannot be expressed by any non-trivial regularization function. Namely, there is no non-trivial regularization function , where are the parameters of the model, such that if GF with the square loss converges to a global minimum, then it is a global minimum that minimizes . However, this negative result does not imply that GF is not implicitly biased towards low-rank solutions, for two reasons. First, bias toward low ranks would not have implications in the case of networks of width  that these authors studied, and hence it would not contradict their negative result. Second, their result rules out the existence of a non-trivial regularization function which expresses the implicit bias for all possible datasets and initializations, but it does not rule out the possibility that GF acts as a heuristic for rank minimization, in the sense that it minimizes the ranks for “most” datasets and initializations.

The implicit bias of neural networks in classification tasks was also widely studied in recent years. soudry2018implicit showed that gradient descent on linearly-separable binary classification problems with exponentially-tailed losses, converges to the maximum

-margin direction. This analysis was extended to other loss functions, tighter convergence rates, non-separable data, and variants of gradient-based optimization algorithms

(nacson2019convergence; ji2018risk; ji2020gradient; gunasekar2018characterizing; shamir2021gradient; ji2021characterizing). lyu2019gradient and ji2020directional showed that GF on homogeneous neural networks, with exponentially-tailed losses, converges in direction to a KKT point of the maximum-margin problem in the parameter space. Similar results under stronger assumptions were previously obtained in nacson2019lexicographic; gunasekar2018bimplicit. vardi2021margin studied in which settings this KKT point is guaranteed to be a global/local optimum of the maximum-margin problem. The implicit bias in fully-connected linear networks was studied by ji2020directional; ji2018gradient; gunasekar2018bimplicit. As already mentioned, these results imply that GF minimizes the ranks of the weight matrices in linear fully-connected networks. The implicit bias in diagonal and convolutional linear networks was studied in gunasekar2018bimplicit; moroshko2020implicit; yun2020unifying. The implicit bias in infinitely-wide two-layer homogeneous neural networks was studied in chizat2020implicit.

Organization. In Sec. 2 we provide necessary notations and definitions. In Sec. 3 we state our negative results for depth- networks. In Sec. 4 and 5 we state our positive results for deep ReLU networks. In Sec. 6 we describe the ideas for the proofs of the main theorems, with all formal proofs deferred to the appendix.

## 2 Preliminaries

#### Notations.

We use boldface letters to denote vectors. For  we denote by  the Euclidean norm. For a matrix  we denote by  the Frobenius norm and by  the spectral norm. We denote . For an integer  we denote . The angle between a pair of vectors  is . The unit -sphere is . An open -ball that is centered at the origin is denoted by  for some . The closure of a set , denoted as , is the intersection of all closed sets containing . The boundary of  is .

#### Neural networks.

A fully-connected neural network of depth  is parameterized by a collection  of weight matrices, such that for every layer  we have . Thus, denotes the number of neurons in the -th layer, i.e., the width of the layer. We denote by , the input and output dimensions. The neurons in layers  are called hidden neurons. A fully-connected network computes a function  defined recursively as follows. For an input  we set , and define for every  the input to the -th layer as , and the output of the -th layer as , where is an activation function that acts coordinate-wise on vectors. In this work we focus on the ReLU activation . Finally, we define . Thus, there is no activation in the output neurons. The width of the network  is the maximal width of its layers, i.e., . We sometimes apply the activation function also on matrices, in which case it acts entry-wise. The parameters  of the neural network are given by a collection of matrices, but we often view  as the vector obtained by concatenating the matrices in the collection. Thus,  denotes the  norm of the vector .

We often consider depth- networks. For matrices  and  we denote by  the depth- ReLU network where  and . We denote the the rows of , namely, the incoming weight vectors to the neurons in the hidden layer, and by  the columns of , namely, the outgoing weight vectors from the neurons in the hidden layer.

Let be inputs, and let be a matrix whose columns are . We denote by the matrix whose -th column is .

#### Optimization problem and gradient flow (GF).

Let be a training dataset. We often represent the dataset by matrices . For a neural network  we consider empirical-loss minimization w.r.t. the square loss. Thus, the objective is given by:

 (1)

We assume that the data is realizable, that is, . Moreover, we focus on settings where the network is overparameterized, in the sense that has multiple (or even infinitely many) global minima.

We consider gradient flow (GF) on the objective given in Eq. (1). This setting captures the behavior of gradient descent with an infinitesimally small step size. Let  be the trajectory of GF. Starting from an initial point , the dynamics of  is given by the differential equation . Note that the ReLU function is not differentiable at . Practical implementations of gradient methods define the derivative  to be some constant in . In this work we assume for convenience that . We say that GF converges if  exists. In this case, we denote .

## 3 Gradient flow does not even approximately minimize ranks

In this section we consider rank minimization in depth- networks  trained with the square loss. We show that even for the simple case of size- datasets, under mild assumptions, GF does not converge to a minimum-rank solution even approximately.

In what follows, we consider ReLU networks with vector-valued outputs, since for linear networks with the same architecture it was shown that GF can be viewed as a heuristic for rank minimization (cf. li2020towards; razin2020implicit). Specifically, let be a training dataset, and let be weight matrices such that is a zero-loss solution. Note that if then we must have : Indeed, by definition of , we necessarily have . Therefore, to understand rank minimization in this simple setting, we consider the rank of in a zero-loss solution. Trivially, , so can be considered low-rank only if .

To make the setting non-trivial, we need to show that such low-rank zero-loss solutions exist at all. The following theorem shows that this is true for almost all size- datasets:

###### Theorem 1.

Given any labeled dataset of two inputs with a strictly positive angle between them, i.e., , there exists a zero-loss solution with , such that .

The theorem follows by constructing a network where the weight vectors of the neurons in the first layer have opposite directions (and hence the weight matrix is of rank ), such that each neuron is active for exactly one input. Then, it is possible to show that for an appropriate choice of the weights in the second layer the network achieves zero loss. See Appendix A for the formal proof.

Thm. 1 implies that zero-loss solutions of rank  exist. However, we now show that GF does not converge to such solutions. We prove this result under the following assumptions:

###### Assumption 1.

The two target vectors  are on the unit sphere  and are linearly independent.

###### Assumption 2.

The two inputs  are on the unit sphere , and satisfy .

The assumptions that are of unit norm are mostly for technical convenience, and we believe that they are not essential.

Then, we have:

###### Theorem 2.

Let be a labeled dataset that satisfies Assumptions 1 and 2. Consider GF w.r.t. the loss function  from Eq. (1). Suppose that are initialized such that

 ∥wi(0)∥

and for all . If GF converges to a zero-loss solution , then .

By the above theorem, GF does not minimize the rank even in a very simple setting where the dataset contains two inputs with angle larger than (as long as the initialization point is sufficiently close to

). In particular, if the dataset is drawn from the uniform distribution on the sphere then this condition holds with probability

.

While Thm. 2

shows that GF does not minimize the rank, it does not rule out the possibility that it converges to a solution which is close to a low-rank solution. There are many ways to define such closeness, such as the ratio of the Frobenius and spectral norms, the Frobenius distance from a low-rank solution, or the exponential of the entropy of the singular values (cf.

rudelson2007sampling; sanyal2019stable; razin2020implicit; roy2007effective). However, for  matrices they all boil down to either having the two rows of the matrix being nearly aligned, or having at least one of them very small (at least compared to the other). In the following theorem, we show that under the assumptions stated above, for any fixed dataset, with at least constant probability, GF converges to a zero-loss solution, where the two row vectors are bounded away from , the ratio of their norms are bounded, and the angle between them is bounded away from  and from  (all by explicit constants that depend just on the dataset and are large in general). Thus, with at least constant probability, GF does not minimize any reasonable approximate notion of rank.

###### Theorem 3.

Let be a labeled dataset that satisfies Assumptions 1 and 2. Consider GF w.r.t. the loss function from Eq. (1). Suppose that are initialized such that for all we have , and is drawn from a spherically symmetric distribution with

 ∥wi(0)∥≤√32min{sin(π−∡(x1,x2)4),sin(∡(x1,x2)−π2)} .

Let be the event that GF converges to a zero-loss solution  such that

• ,

• for all .

Then, .

We note that in Thm. 3 the weights in the second layer are initialized to zero, while in Thm. 2 the assumption on the initialization is weaker. This difference is for technical convenience, and we believe that Thm. 3 should hold also under weaker assumptions on the initialization, as the next empirical result demonstrates.

### 3.1 An empirical result

Our theorems imply that for standard initialization schemes, GF will not converge close to low-rank solutions, with some positive probability. We now present a simple experiment that corroborates this and suggests that, furthermore, this holds with high probability.

Specifically, we trained ReLU networks in the same setup as in the previous section (w.r.t. two  weight matrices ) on the two data points  where  are the standard basis vectors in , and  are  and  normalized to have unit norm. At initialization, every row of  and every column of  is sampled uniformly at random from the sphere of radius  around the origin. To simulate GF, we performed epochs of full-batch gradient descent of step size , w.r.t. the square loss. Of  repeats of this experiment,  converged to negligible loss (defined as ). In Fig. 1, we plot a histogram of the stable (numerical) ranks of the resulting weight matrices, i.e. the ratio  of layer . The figure clearly suggests that whenever convergence to zero loss occurs, the solutions are all of rank , and none are even close to being low-rank (in terms of the stable rank).

## 4 Rank minimization in deep networks with small ℓ2 norm

When training neural networks with gradient descent, it is common to use explicit  regularization on the parameters. In this case, gradient descent is biased towards solutions that minimize the  norm of the parameters. We now show that in deep overparameterized ReLU networks, if GF converges to a zero-loss solution that minimizes the  norm, then the ratios between the Frobenius and the spectral norms in the weight matrices tend to be small (we use here the ratio between these norms as a continuous surrogate for the exact rank, as discussed in the previous section). Formally, we have the following:

###### Theorem 4.

Let  be a dataset, and assume that there is  with  and . Assume that there is a fully-connected neural network  of width  and depth , such that for all  we have , and the weight matrices  of satisfy for some . Let  be a fully-connected neural network of width  and depth  parameterized by . Let  be a global optimum of the following problem:

 minθ∥θ∥s.t. ∀i∈[n]Nθ(xi)=yi . (2)

Then,

 1k′k′∑i=1∥∥W∗i∥∥σ∥∥W∗i∥∥F≥(1B)kk′ . (3)

Equivalently, we have the following upper bound on the harmonic mean of the ratios

:

 k′∑k′i=1(∥∥W∗i∥∥F∥∥W∗i∥∥σ)−1≤Bkk′ . (4)

By the above theorem if is much larger than , then the average ratio between the spectral and the Frobenius norms (Eq. (3)) is at least roughly . Likewise, the harmonic mean of the ratio between the Frobenius and the spectral norms (Eq. (4)), namely, the square root of the stable rank, is at most roughly . Noting that both these ratios equal  if and only if the matrix is of rank , we see that there is a bias towards low-rank solutions as the depth  of the trained network increases. Note that the result does not depend on the width of the networks. Thus, even if the width  is large, the average ratio is close to . Also, note that the network  of depth  in the theorem might have high ranks (e.g., rank  for each weight matrix), but once we consider networks of a large depth  then the dataset becomes realizable by a network of small average rank, and GF converges to such a network.

## 5 Rank minimization in deep networks with exponentially-tailed losses

In this section, we turn to consider GF in classification tasks with exponentially-tailed losses, namely, the exponential loss or the logistic loss.

Let us first formally define the setting. We consider neural networks of output dimension , i.e., . Let  be a binary classification training dataset. Let  and be the data matrix and labels that correspond to . Let be a neural network parameterized by . For a loss function , the empirical loss of  on the dataset  is

 LX,y(θ):=n∑i=1ℓ(yiNθ(xi)) . (5)

We focus on the exponential loss  and the logistic loss . We say that the dataset is

correctly classified

by the network  if for all  we have . We consider GF on the objective given in Eq. (5). We say that a network  is homogeneous if there exists  such that for every  and  we have . Note that fully-connected ReLU networks are homogeneous. We say that a trajectory  of GF converges in direction to  if

 limt→∞θ(t)∥θ(t)∥=~θ∥~θ∥ .

The following well-known result characterizes the implicit bias in homogeneous neural networks trained with the logistic or the exponential loss:

###### Lemma 1 (Paraphrased from lyu2019gradient and ji2020directional).

Let  be a homogeneous ReLU neural network. Consider minimizing the average of either the exponential or the logistic loss over a binary classification dataset using GF. Suppose that the average loss converges to zero as . Then, GF converges in direction to a first order stationary point (KKT point) of the following maximum margin problem in parameter space:

 minθ12∥θ∥2s.t. ∀i∈[n]yiNθ(xi)≥1 . (6)

The above lemma suggests that GF tends to converge in direction to a network with margin  and small  norm. In the following theorem we show that in deep overparameterized ReLU networks, if GF converges in direction to an optimal solution of Problem 6 (from the above lemma) then the ratios between the Frobenius and the spectral norms in the weight matrices tend to be small. Formally, we have the following:

###### Theorem 5.

Let  be a binary classification dataset, and assume that there is  with . Assume that there is a fully-connected neural network  of width  and depth , such that for all  we have , and the weight matrices  of  satisfy  for some . Let  be a fully-connected neural network of width  and depth  parameterized by . Let  be a global optimum of Problem 6. Namely,  parameterizes a minimum-norm fully-connected network of width  and depth  that labels the dataset correctly with margin . Then, we have

 1k′k′∑i=1∥∥W∗i∥∥σ∥∥W∗i∥∥F≥1√2⋅(√2B)kk′⋅√k′k′+1 . (7)

Equivalently, we have the following upper bound on the harmonic mean of the ratios :

 k′∑k′i=1(∥∥W∗i∥∥F∥∥W∗i∥∥σ)−1≤√2⋅(B√2)kk′⋅√k′+1k′ . (8)

By the above theorem, if  is much larger than , then the average ratio between the spectral and the Frobenius norms (Eq. (7)) is at least roughly . Likewise, the harmonic mean of the ratio between the Frobenius and the spectral norms (Eq. (8)), i.e., the square root of the stable rank, is at most roughly . Note that the result does not depend on the width of the networks. Thus, it holds even if the width  is very large. Similarly to the case of Thm. 4, we note that the network  of depth  might have high ranks (e.g., rank  for each weight matrix), but once we consider networks of a large depth , then the dataset becomes realizable by a network of small average rank, and GF converges to such a network.

The combination of the above result with Lemma 1 suggests that, in overparameterized deep fully-connected networks, GF tends to converge in direction to neural networks with low ranks. Note that we consider the exponential and the logistic losses, and hence if the loss tends to zero as , then we have . To conclude, in our case, the parameters tend to have an infinite norm and to converge in direction to a low-rank solution. Moreover, note that the ratio between the spectral and the Frobenius norms is invariant to scaling, and hence it suggests that after a sufficiently long time, GF tends to reach a network with low ranks.

## 6 Proof ideas

In this section we describe the main ideas for the proofs of Theorems 2, 3, 4 and 5. The full proofs are given in the appendix.

### 6.1 Theorem 2

We define the following regions (see Fig. 2):

 D:={w∈R2∣∀i∈{1,2},σ(w⊤xi)≤0} , S:={w∈R2∣∀i∈{1,2},σ(w⊤xi)>0} , S1:={w∈R2∣σ(w⊤x1)>0,σ(w⊤x2)≤0} , S2:={w∈R2∣σ(w⊤x2)>0,σ(w⊤x1)≤0} .

Intuitively,  defines the “dead” region where the relevant neuron will output  on both ;  is the “active” region where the relevant neuron will output a positive output on both ; and  are the “partially active” regions, where the relevant neuron will output a positive output on one point, and  on the other.

Assume towards contradiction that GF converges to some zero-loss network

with . Since attains zero loss, then , and hence

 2=rank(Y)=rank(V(∞)σ(W(∞)X))≤rank(σ(W(∞)X)) . (9)

Therefore, the weight vectors  and  are not in the region . Indeed, if or  are in , then at least one of the rows of  is zero, in contradiction to Eq. (9). In particular, it implies that  and  are non-zero. Since by our assumption we have , then we conclude that . We denote  where . Note that if , then  for all , in contradiction to Eq. (9). Thus, . Since we also have , then one of these weight vectors is in  and the other is in  (as can be seen from Fig. 2). Assume w.l.o.g. that  and .

By observing the gradients of  w.r.t.  for , the following facts follow. First, if  at some time , then , hence  remains at  indefinitely, in contradiction to . Thus, the trajectory  does not visit . Second, if  at time , then . Since , we can consider the last time  that  enters , which can be either at the initialization (i.e., ) or when moving from  (i.e., ). For all time  we have . It allows us to conclude that  must be in a region  which is illustrated in Fig. 3 (by the union of the orange and green regions).

Furthermore, we show that  cannot be too small, namely, obtaining a lower bound on . First, a theorem from du2018algorithmic implies that  remains constant throughout the training. Since at the initialization both  and  are small, the consequence is that  is small if  is small. Also, since  attains zero loss and  for all , then we have , namely, only the -th hidden neuron contributes to the output of  for the input . Since , it is impossible that both  and  are small. Hence, we are able to obtain a lower bound on , which implies that  is in a region  which is illustrated in Fig. 3.

Finally, we show that since and then the angle between and is smaller than , in contradiction to .

### 6.2 Theorem 3

We show that if the initialization is such that  and  (or, equivalently, that  and ), then GF converges to a zero-loss network, and , are in the required intervals. Since by simple geometric arguments we can show that the initialization satisfies this requirement with probability at least , the theorem follows.

Indeed, suppose that  and . We argue that GF converges to a zero-loss network and  are in the required intervals, as follows. By analyzing the dynamics of GF for such an initialization, we show that for all and we have  for some . Thus,  moves only in the direction of , and  for all . Moreover, we are able to prove that these properties of the trajectories  and  imply that GF converges to a zero-loss network . Then, by similar arguments to the proof of Thm. 2 we have  for all , where  are the regions from Fig. 3, and it allows us to obtain the required bounds on , and .

### 6.3 Theorems 4 and 5

The intuition for the proofs of both theorems can be roughly described as follows. If the dataset is realizable by a shallow network where the Frobenius norm of each layer is , then it is also realizable by a deep network where the Frobenius norm of each layer is , where  is much smaller than . Moreover, if the network is sufficiently deep then  is not much larger than . On the other hand, since for the input  with  the output of the network is of size at least , then the average spectral norm of the layers is at least . Hence, the average ratio between the spectral and the Frobenius norms cannot be too small.

We now describe the proof ideas in a bit more detail, starting with Thm. 4. We use the network  of width  and depth  to construct a network  of width  and depth  as follows. The first  layers of  are obtained by scaling the layers of  by a factor . Since the output dimension of  is , then the -th hidden layer of  has width . Then, the network  has  additional layers of width , such that the weight in each of these layers is . Overall, given input , we have

 N′(xi)=N(xi)⋅αk⋅βk′−k=N(xi) .

We denote by the parameters of the network .

Let  be a global optimum of Problem 2. From the optimality of  it is possible to show that the layers in  must be balanced, namely,  for all . We denote by  the Frobenius norm of the layers. From the global optimality of  we also have . Hence, by a calculation we can obtain

 B∗≤Bkk′ .

Moreover, we show that since there is  with  and , then

 1k′∑i∈[k′]∥∥W∗i∥∥σ≥1 .

Combining the last two displayed equations we get

 1k′∑i∈[k′]∥∥W∗i∥∥σ∥∥W∗i∥∥F=1B∗⋅1k′∑i∈[k′]∥∥W∗i∥∥σ≥(1B)kk′ ,

as required.

Note that the arguments above do not depend on the ranks of the layers in . Thus, even if the weight matrices in  have high ranks, once we consider deep networks which are optimal solutions to Problem 2, the ratios between the spectral and the Frobenius norms are close to .

We now turn to Thm. 5. The proof follows a similar approach to the proof of Thm. 4. However, here the outputs of the network  can be either positive or negative. Hence, when constructing the network  as above, we cannot have width  in layers , since the ReLU activation will not allow us to pass both positive and negative values. Still, we show that we can define a network  such that the width in layers  is and we have  for all . Then, the theorem follows by arguments similar to the proof of Thm. 4, with the required modifications.

### Funding Acknowledgements

This research is supported in part by European Research Council (ERC) grant 754705.

## Appendix A Proof of Thm. 1

Consider a matrix whose rows satisfy

 w1 =x1∥x1∥−x2∥x2∥, w2 =−w1.

The matrix has rank . To complete the proof, we need to show that we can choose a matrix such that attains zero loss. According to Lemma 2 below, it is enough to show that and . Since the angle between the inputs is strictly positive, namely, , it holds that . Thus,

 ∥x1∥⋅∥x2∥−x⊤1x2>0 .

Then,

 w⊤1x1 =x⊤1x1∥x1∥−x⊤1x2∥x2∥=∥x1∥−x⊤1x2∥x2∥=∥x1∥⋅∥x2∥−x⊤1x2∥x2∥>0,

while

 w⊤1x2 =x⊤1x2∥x1∥−x⊤2x2∥x2∥=x⊤1x2∥x1∥−∥x2∥=x⊤1x2−∥x1∥⋅∥x2∥∥x1∥<0.

###### Lemma 2.

Let be a labeled dataset. Let . Suppose that for every data point there is at least one row in such that , and for all . Then, there exists such that .

###### Proof.

Consider the matrix of size , where acts entrywise. Note that our assumption on implies that . Thus, the matrix satisfies , where denotes the Moore-Penrose inverse of a matrix , and is the identity matrix. Hence, the matrix of dimensions yields . By setting , the network  achieves zero loss. Namely, . ∎

## Appendix B Proof of Thm. 2

###### Definition 1.

We define the following regions of interest:

 D :={w∈R2∣∀i∈{1,2},σ(w⊤xi)≤0}, S :={w∈R2∣∀i∈{1,2},σ(w⊤xi)>0}.

Also, for we define