DeepAI

# Exact Solutions of a Deep Linear Network

This work finds the exact solutions to a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that weight decay strongly interacts with the model architecture and can create bad minima in a network with more than 1 hidden layer, qualitatively different for a network with only 1 hidden layer. As an application, we also analyze stochastic nets and show that their prediction variance vanishes to zero as the stochasticity, the width, or the depth tends to infinity.

• 20 publications
• 2 publications
• 17 publications
06/15/2020

### Understanding Global Loss Landscape of One-hidden-layer ReLU Networks, Part 2: Experiments and Analysis

The existence of local minima for one-hidden-layer ReLU networks has bee...
01/30/2022

### Stochastic Neural Networks with Infinite Width are Deterministic

This work theoretically studies stochastic neural networks, a main type ...
07/02/2020

### The Global Landscape of Neural Networks: An Overview

One of the major concerns for neural network training is that the non-co...
10/11/2018

### Bayesian neural networks increasingly sparsify their units with depth

We investigate deep Bayesian neural networks with Gaussian priors on the...
10/06/2022

### A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets

Weight decay is one of the most widely used forms of regularization in d...
02/28/2022

### How and what to learn:The modes of machine learning

We proposal a new approach, namely the weight pathway analysis (WPA), to...
09/02/2022

### Optimal bump functions for shallow ReLU networks: Weight decay, depth separation and the curse of dimensionality

In this note, we study how neural networks with a single hidden layer an...

## 1 Introduction

Applications of neural networks have achieved great success in various fields. One central open theoretical question is why neural networks, being nonlinear and containing many saddle points and local minima, can sometimes be optimized easily to a global minimum (Choromanska et al., 2015a), while becomes almost impossible to train in some other scenarios (Glorot and Bengio, 2010) and may require many tricks to succeed (Gotmare et al., 2018). One established approach is to study the landscape of deep linear nets (Choromanska et al., 2015b), which are believed to approximate the landscape of a nonlinear net well. A series of works proved the famous results that for a deep linear net, all local minima are global (Kawaguchi, 2016; Lu and Kawaguchi, 2017; Laurent and Brecht, 2018), which is regarded to have successfully explained why deep neural networks are so easy to train because it implies that initialization in any attractive basin can reach the global minimum without much effort (Kawaguchi, 2016).

In this work, we theoretically study a deep linear net with weight decay and stochastic neurons, whose loss function takes the following form in general:

 ExEϵ(1),ϵ(2),...,ϵ(D)⎛⎝d,d1,d2,...dD∑i,i1,i2,...,iDUiDϵ(D)iD...ϵ(2)i2W(2)i2i1ϵ(1)i1W(1)i1ixi−y⎞⎠2L0+γu||U||22+D∑i=1γi||W(i)||2FL2 reg., (1)

where and are the model parameters, is the depth of the network,111In this work, we use “depth” to refer to the number of hidden layers. For example, a linear regressor has depth . is the noise in the hidden layer (e.g., due to dropout), is the width of the -th layer, and is the strength of the weight decay. Previous works have studied special cases of this loss function. For example, Kawaguchi (2016) and Lu and Kawaguchi (2017) study the landscape of when is a constant (namely, when there is no noise). Mehta et al. (2021) studies with (a more complicated type of) weight decay but without stochasticity and proved that all the stationary points are isolated. Another line of works studies when the noise is caused by dropout (Mianjy and Arora, 2019; Cavazza et al., 2018). Our setting is more general than the previous works in two respects. First, apart from the mean square error (MSE) loss , an regularization term (weight decay) with arbitrary strength is included; second, the noise

is arbitrary. Thus, our setting is arguably closer to the actual deep learning practice, where the injection of noises to latent layers is common and the use of weight decay is virtually ubiquitous

(Krogh and Hertz, 1992; Loshchilov and Hutter, 2017). One major theoretical limitation of our work is that we assume the label to be -dimensional, and it can be an important future problem to prove whether an exact solution exists or not when is high-dimensional.

Our foremost contribution is to give the exact solution for all the global minimum of an arbitrarily deep and wide linear net on the MSE loss plus a weight decay term and with a general type of stochasticity in the hidden layer. In other words, we identify in closed form the global minimum of Eq. (1). We then show that it has nontrivial properties that can explain many phenomena in deep learning. In particular, the implications of our result include (but are not limited to):

1. Weight decay makes the landscape of neural nets more complicated;

• we show that bad minima222Unless otherwise specified, we use the word “bad minimum” to mean a local minimum that is not a global minimum. Namely, one would have to overcome some nontrivial barrier to reach the global minimum. This usage is consistent with the previous literature (Kawaguchi, 2016). emerge as weight decay is applied, whereas there is no bad minimum when there is no weight decay. This highlights the need to escape bad local minima in deep learning with weight decay.

2. Deeper nets are harder to optimize than shallower ones;

• we show that a 3-layer linear net contains a bad minimum at zero, whereas a 2-layer net does not. This partially explains why deep networks are much harder to optimize than shallower ones in deep learning practice.

3. Stochastic networks in a few asymptotic limits can become deterministic;

• we show that the prediction variance of a stochastic net scales towards on the MSE loss as (a) the width tends to infinity, or (b) the variance of the latent randomness tends to infinity, or (c) the depth tends to infinity.

In summary, our result deepens our understanding of the loss landscape of neural networks and stochastic networks. Organization: In the next section, we discuss the related works. In Section 3, we derive the exact solution for a two-layer net. Section 4 extends the result to an arbitrary depth. In Section 5, we apply our result to study stochastic networks. The last section concludes the work and discusses unresolved open problems. Moreover, the technical, lengthy or less essential proofs are delayed to Section A.

Notation. For a matrix , we use to denote the

-th row vector of

. denotes the norm if is a vector and the Frobenius norm if is a matrix. The notation signals an optimized quantity. Additionally, we the superscript and subscript interchangeably, whichever leads to a simpler expression. For example, and denotes the same quantity, while the former is “simpler.”

## 2 Related Works

Linear Nets. In many ways, linear networks have been used to help understand nonlinear networks. For example, even at depth , where the linear net is nothing but a linear regressor, linear nets are shown to be relevant for understanding the generalization behavior of modern overparametrized networks (Hastie et al., 2019). Saxe et al. (2013) studies the training dynamics of a depth-

network and uses it to understand the dynamics of learning of nonlinear networks. These networks are the same as a linear regression model in terms of expressivity. However, the loss landscape is highly complicated due to the existence of more than one layer, and linear nets are widely believed to approximate the loss landscape of a nonlinear net

(Kawaguchi, 2016; Hardt and Ma, 2016; Laurent and Brecht, 2018). In particular, the landscape of linear nets has been studied as early as 1989 in Baldi and Hornik (1989), which proposed the well-known conjecture that all local minima of a deep linear net are global. This conjecture is first proved in Kawaguchi (2016), and extended to other loss functions and deeper depths in Lu and Kawaguchi (2017) and Laurent and Brecht (2018).

Stochastic Net Theory. A major extension of the standard neural networks is to make them stochastic, namely, to make the output a random function of the input. In a broad sense, stochastic neural networks include neural networks trained with dropout (Srivastava et al., 2014; Gal and Ghahramani, 2016)

(Mackay, 1992)

, variational autoencoders (VAE)

(Kingma and Welling, 2013)(Goodfellow et al., 2014). Stochastic networks are thus of both practical and theoretical importance to study. Theoretically, while a unified approach is lacking, some previous works exist to separately study different stochastic techniques in deep learning. A series of recent works approach the VAE loss theoretically (Dai and Wipf, 2019). Another line of recent works analyzes linear models trained with VAE to study the commonly encountered mode collapse problem of VAE (Lucas et al., 2019; Koehler et al., 2021). Another series of work that extensively studied the dropout technique with a linear network (Cavazza et al., 2018; Mianjy and Arora, 2019; Arora et al., 2020) showed that dropout effectively controls the rank of the learned solution. Lastly, it is worth noting that in the original works of Kawaguchi (2016) and Choromanska et al. (2015a)

, a linear net with a special type of noise that is equivalent to dropout has been used to model the effect of the ReLU nonlinearity in actual neural networks. Our work considers an arbitrary type of noise whose second moment exists, not just limited to dropout.

## 3 Two-layer Linear Net

This section finds the exact solutions of a two-layer linear net. The data point is a -dimensional vector drawn from a data distribution and the labels are generated through an arbitrary function . Common in the deep learning practice, a weight decay term is also added. For generality, the two different layers have different strengths of weight decay even though they often take the same value in practice.

In particular, we are interested in finding the global minimum of the following objective:

 Ld,d1(U,W)=ExEϵ(d1∑jUjϵjd∑iWjixi−y)2+γw||W||2+γu||U||2, (2)

where is the width of the hidden layer and

are independent random variables that are often used to stochastify a neural network, dropout

(Srivastava et al., 2014) being one well-known example. and are the weight decay parameters. Here, we consider a general type of independent noise with and where is the Kronecker’s delta, and . For shorthand, we use the notation

, and the largest and the smallest eigenvalues of

are denoted as and respectively. denotes the -th eigenvalue of viewed in any order. For now, it is sufficient for us to assume that the global minimum of Eq. (2) always exists. We will prove a more general result in Proposition 1, when we deal with multilayer nets.

### 3.1 Main Result

We first present two lemmas showing that the global minimum can only lie on a rather restrictive subspace of all possible parameter settings due to invariances (symmetries) in the objective.

###### Lemma 1.

At the global minimum of Eq. (2), for all .

Proof Sketch. We use the fact that the first term of Eq. (2) is invariant to a simultaneous rescaling of rows of the weight matrix to find the optimal rescaling, which implies the lemma statement.

This lemma implies that for all , must be proportional to the norm of its corresponding row vector in . The following lemma further shows that, at the global minimum, all elements of must be equal.

###### Lemma 2.

At the global minimum, for all and , we have

 {U2i=U2j;UiWi:=UjWj:. (3)

Proof Sketch. We show that if the condition is not satisfied, then, an “averaging” transformation will strictly decrease the objective.

The second lemma imposes strong conditions on the solution of the problem, and the essence of this lemma is the reduction of the original problem to a lower dimension.

We are now ready to prove our first main result.

###### Theorem 1.

The global minimum and of Eq. (2) is and if and only if

 ||E[xy]||2≤γuγw. (4)

When , there exists such that the global minima are

 {U∗=br;W∗=rE[xy]Tb[b2(σ2+d1)A0+γwI]−1, (5)

where is an arbitrary vertex of a -dimensional hypercube.

Proof. By Lemma 2, at any global minimum, we can write for some . We can also write for a general vector . Without loss of generality, we assume that (because the sign of can be absorbed into ).

The original problem in Eq. (2) is now equivalently reduced following problem because :

 minb,vEx⎡⎣(bd1∑jvjxj−y)2+b2d1σ2(∑kvkxk)2⎤⎦+γud1b2+γwd1||v||22. (6)

For any fixed , the global minimum of is well known:333Namely, it is the solution of a ridgeless linear regression problem.

 v=bE[xy]T[b2(σ2+d1)A0+γwI]−1. (7)

By Lemma 1, at a global minimum, also satisfies the following condition:

 b2=γwγu||v||2, (8)

One solution to this equation is , and we are interested in whether solutions with exists. If there is no other solution, then must be the unique global minimum; otherwise, we need to identify which of the solutions are actual global minima and which are just saddles.444We do not discern between saddles or maxima. When ,

 ∣∣∣∣∣∣[b2(σ2+d1)A0+γwI]−1E[xy]∣∣∣∣∣∣2=γuγw. (9)

Note that the left-hand side is monotonically decreasing in , and is equal to when . When , the left-hand side tends to . Because the left-hand side is a continuous and monotonic function of , a unique solution that satisfies Eq. (9) exists if and only if , or,

 ||E[xy]||2>γuγw. (10)

Therefore, at most three candidates for global minima of the loss function exist:

 {b=0, v=0if ||E[xy]||2≤γuγw;b=±b∗, v=b[b2(σ2+d1)A0+γwI]−1E[xy],if ||E[xy]||2>γuγw, (11)

where .

In the second case, one needs to discern the saddle points from the global minima. Using the expression of , one finds the expression of the loss function as a function of

 d1(d1+σ2)b4∑iE[x′y]2iai[b2(σ2+d1)ai+γw]2−2b2d1∑iE[x′y]2ib2(σ2+d1)ai+γw+E[y2]+γud1b2+γwd1∑iE[x′y]2ib2[b2(σ2+d1)ai+γw]2, (12)

where such that is a diagonal matrix. We now show that condition (10) is sufficient to guarantee that is not the global minimum.

At , the first nonvanishing derivative of is the second-order derivative. The second order derivative at is

 −2d1||E[xy]||2/γw+2γud1, (13)

which is negative if and only if . If the second derivative at is negative, cannot be a minimum. It then follows that for , , are the two global minimum (because the loss is invariant to the sign flip of ). For the same reason, when , gives the unique global minimum. This finishes the proof.

Apparently, is the trivial solution that has not learned any feature due to overregularization. Henceforth, we refer to this solution (and similar solutions for deeper nets) as the “trivial” solution. We now analyze the properties of the nontrivial solution when it exists.

The condition for the solution to become nontrivial is interesting: . The term can be seen as the effective strength of the signal, and is the strength of regularization. This precise condition means that the learning of a two-layer can be divided into two qualitatively different regimes: an “overregularized regime,” where the global minimum is trivial, and a “feature learning regime”, where the global minimum involves actual learning.

### 3.2 Exact Form of b∗

Note that our main result does not specify the exact value of . This is because must satisfy the condition in Eq. (9), which is equivalent to a high-order polynomial in with coefficients being general functions of the eigenvalues of , whose solutions are generally not analytical by Galois theory.

However, when , the exact form exists. Practically, this can be achieved for any (full-rank) dataset if we disentangle and rescale the data by the whitening transformation: . In this case, we have

 b2∗=√γwγu||E[xy]||−γw(σ2+d1)σ2x, (14)

and

 v=±  ⎷√γuγw||E[xy]||−γuσ2x(σ2+d1)E[xy]||E[xy]||. (15)

### 3.3 Bounding the General Solution

While the solution to does not admit an analytical form for a general , one can find meaningful lower and upper bound to

such that we can perform asymptotic analysis of

. At the global minimum, the following inequality holds:

 ||[b2(σ2+d1)amaxI+γwI]−1E[xy]||2≤||[ ≤||[b2(σ2+d1)aminI+γwI]−1E[xy]||2, (16)

where and are the smallest and largest eigenvalue of , respectively. The middle term is equal to by the global minimum condition in (9), and so, assuming , this inequality is equivalent to the following two inequalities of :

 √γwγu||E[xy]||−γw(σ2+d1)amax≤b2∗≤√γwγu||E[xy]||−γw(σ2+d1)amin. (17)

Namely, the general solution should scale similarly to the homogeneous solution555We use the word “homogeneous” to mean that all the eigenvalues of take the same value. in Eq. (14) if we treat the eigenvalues of as constants.

## 4 Exact Solution for An Arbitrary-Depth Linear Net

This section extends our result to multiple layers, each with independent stochasticity. We first prove the analytical formula for the global minimum of a general arbitrary-depth model. We then show that the landscape for a deeper network is highly nontrivial.

### 4.1 General Solution

For this problem, the loss function becomes

 ExEϵ(1),ϵ(2),...,ϵ(D)⎛⎝d,d1,d2,...dD∑i,i1,i2,...,iDUiDϵ(D)iD...ϵ(2)i2W(2)i2i1ϵ(1)i1W(1)i1ixi−y⎞⎠2+γu||U||2+D∑i=1γi||W(i)||2, (18)

where all the noises are independent from each other, and for all and , and . We first show that for general , the global minimum exists for this objective.

###### Proposition 1.

For and strictly positive , the global minimum for Eq.(18) exists.

Note that the positivity of the regularization strength is crucial. If one of the is zero, the global minimum may not exist. The following theorem is our second main result.

###### Theorem 2.

Any global minimum of Eq. (18) is of the form

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩U=burD;W(i)=birirTi−1;W(1)=r1E[xy]T(bu∏Di=2bi)μ[(bu∏Di=2bi)2s2(σ2+d1)A0+γwI]−1, (19)

where , , and , and is an arbitrary vertex of a -dimensional hypercube for all .

Proof Sketch. We prove by induction on the depth . The base case is proved in Theorem 1. We then show that for a general depth, the objective involves optimizing subproblems, one of which is a layer problem that follows by the induction assumption, and the other is a two-layer problem that has been solved in Theorem 1. Putting these two subproblems together, one obtains Eq. (19).

Similar to a two-layer network, the scaling factor for all is not independent from one another. The following Lemma is a generalization of Lemma 1

to a multilayer setting and shows that there is only one degree of freedom (instead of

) in the form of solutions in Eq. (19).

###### Lemma 3.

At any global minimum of Eq. (18), let and ,

 γk+1dk+1b2k+1=γkdk−1b2k. (20)

Proof Sketch. The proof is similar to Lemma 1.

The lemma implies the product of all the can be written in terms of one of the :

 buD∏i=2bi=c0sgn(buD∏i=2bi)|bD2|:=c0sgn(buD∏i=2bi)bD (21)

where and . Applying Lemma 3 to the first layer of Theorem 2 shows that the global minimum must satisfy the following equation, which is equivalent to a high-order polynomial in that does not have an analytical solution in general:

 ||E[xy]Tc0bDμ[c20b2Ds2(σ2+d1)A0+γwI]−1||2=d2b2. (22)

At this point, it pays to clearly define the word “solution,” especially given that it has a special meaning in this work because it now becomes highly nontrivial to differentiate between the two types of solutions.

###### Definition 1.

We say that a non-negative real is a solution if it satisfies Eq. (22). A solution is trivial if and nontrivial otherwise.

Namely, a global minimum must be a solution, but a solution is not necessarily a global minimum. We have seen that even in the two-layer case, the global minimum can be the trivial one when the strength of the signal is too weak or when the strength of regularization is too strong. It is thus natural to expect to be the global minimum under a similar condition, and one is interested in whether the condition becomes stronger or weaker as the depth of the model is increased. However, it turns out this naive expectation is not true. In fact, when the depth of the model is larger than , the condition for the trivial global minimum becomes highly nontrivial.

The following proposition shows why the problem becomes more complicated. In particular, we have seen that in the case of a two-layer net, some elementary argument has helped us show that the trivial solution is either a saddle or the global minimum. However, the proposition below shows that with , the landscape becomes more complicated in the sense that the trivial solution is always a local minimum, and it becomes difficult to compare the loss value of the trivial solution with the nontrivial solution because the value of is unknown in general.

###### Proposition 2.

Let in Eq. (18). Then, the solution , , …, is a local minimum with a diagonal positive-definite Hessian.

Proof Sketch. This is a technical proof that directly computes the Hessian.

Comparing the Hessian of and , one notices a qualitative difference: for , the Hessian is always diagonal (at ); for , in sharp contrast, the off-diagonal terms are nonzero in general, and it is these off-diagonal terms that can break the positive-definiteness of the Hessian. This offers a different perspective about why there is a qualitative difference between and .

Lastly, note that, unlike the depth- case, one can no longer find a precise condition such that a solution exists for a general . The reason is that the condition for the existence of the solution is now a high-order polynomial with quite arbitrary intermediate terms. The following proposition gives a sufficient but stronger-than-necessary condition for the existence of a nontrivial solution, when all the , intermediate width and regularization strength are the same.666This is equivalent to setting . The result is qualitatively similar but involves additional factors of if , , and all take different values. We thus only present the case when , , and are the same for notational concision and for emphasizing the most relevant terms. Also, note that this proposition gives a sufficient and necessary condition if is proportional to the identity.

###### Proposition 3.

Let , and for all . Assuming , the only solution is trivial if

 D+12D||E[xy]||dD−10((D−1)||E[xy]||2Dd0(σ2+d0)Damin)D−1D+1<γ. (23)

Nontrivial solutions exist if

 D+12D||E[xy]||dD−10((D−1)||E[xy]||2Dd0(σ2+d0)Damax)D−1D+1≥γ. (24)

Moreover, the nontrivial solutions are both lower and upper-bounded:777For , we define the lower-bound as , which equal to zero if , and if . With this definition, this proposition applies to a two-layer net as well.

 1d0[γ||E[xy]||]1D−1≤b∗≤[||E[xy]||d0(σ2+d0)Damax]1D+1. (25)

Proof Sketch. The proof follows from the observation that the l.h.s. of Eq. (22) is a continuous function and must cross the r.h.s. under certain sufficient conditions.

One should compare the general condition here with the special condition for . One sees that for , many the other factors (such as the width, the depth and the spectrum of the data covariance ) come into play to determine the existence of a solution apart from the signal strength and the regularization strength .

### 4.2 Which Solution is the Global Minimum?

Again, we set , and for all for notational concision. Using this condition and applying Lemma 3 to Theorem 2, the solution now takes the following form, where ,

 ⎧⎪ ⎪⎨⎪ ⎪⎩U=√d0brD;W(i)=brirTi−1;W(1)=r1E[xy]TdD−120bD[dD0(σ2+d0)Db2DA0+γ]−1. (26)

Note that the signs of and are arbitrary due to the invariances of the original objective. The following theorem gives a sufficient condition for the global minimum to be nontrivial. It also shows that the landscape of the linear net becomes complicated and can contain more than local minimum when a certain condition is satisfied.

###### Theorem 3.

Let , and for all and assuming . Then, there exists a constant such that for any

 0<γ<[1Dd20C1/D14dD0(σ2+d)D||E[xy]||2A−10]DD+1, (27)

the global minimum of Eq. (18) is one of the nontrivial solutions.

Proof Sketch. We find an easy-to-solve upper bound the objective, which is simplified to the above condition.

While there are various ways this bound can be proved, it is general enough for our purpose. In particular, one sees that, for a general depth, the condition for having a nontrivial global minimum depends not only on the and but also on the model architecture in general. For a more general architecture with different widths etc., the architectural constant from Eq. (22) will also enter the equation.

The combination of Theorem 3 and Proposition 2 shows that the landscape of a deep neural network becomes highly nontrivial when there is a weight decay and when the depth of the model is larger than . This gives an incomplete but meaningful picture of a network’s complicated but interesting landscape beyond two layers (see Figure 1 for a incomplete summary of our results). In particular, even when the nontrivial solution is the global minimum, the trivial solution is still a local minimum that needs to be escaped. Our result suggests the previous understanding that all local minima of a deep linear net are global cannot generalize to many practical settings where deep learning is found to work well. For example, a series of works attribute the existence of bad (non-global) minima to the use of nonlinearities nonlinear nets (Kawaguchi, 2016) or the use of a non-regular (non-differentiable) loss function (Laurent and Brecht, 2018). Our result, in contrast, shows that the use of a simple weight decay is sufficient to create a bad minimum.888Some previous works do suggest the existence of bad minima when weight decay is present, but no direct proof exists yet. For example, Taghvaei et al. (2017) shows that when the model is approximated by a linear dynamical system, regularization can cause bad local minima. Mehta et al. (2021) shows the existence of bad local minima in deep linear networks with weight decay through numerical simulations. Moreover, the problem with such a minimum is two-fold: (1) (optimization) it is not global and so needs to be “overcome”999This becomes more a problem because we have shown that this bad minimum is located at the origin where the neural networks are commonly initialized at. and (2) (generalization) it is a minimum that has not learned any feature at all because the model constantly outputs zero. To the best of our knowledge, previous to our work, there has not been any proof that a bad minimum can generically exist in a rather arbitrary network without any restriction on the data.101010In the case of non-linear networks without regularization, a few works proved the existence of bad minima. However, the previous results strongly depend on the data and rather independent of architecture. For example, one major assumption is that the data cannot be perfected fitted by a linear model (Yun et al., 2018; Liu, 2021; He et al., 2020). Some other works explicitly construct data distribution (Safran and Shamir, 2018; Venturi et al., 2019). Our result, in contrast, is independent of the data. Thus, our result offers direct and solid theoretical justification for the widely believed importance of escaping local minima in the field of deep learning (Kleinberg et al., 2018; Liu et al., 2021; Mori et al., 2022). In particular, previous works on escaping local minima often hypothesize landscapes that are of unknown relevance to an actual neural network. With our result, this line of research can now be established with respect to landscapes that are actually deep-learning-relevant.

Previous works also argue that having a deeper depth does not create a bad minimum (Lu and Kawaguchi, 2017). While this remains true, its generality and applicability to practical settings now also seem low. Our result shows that as long as weight decay is used, and as long as , there is indeed a bad local minimum at . In contrast, there is no bad minimum at for a depth- network: the point is either a saddle or the global minimum.111111Of course, in practice, the model trained with SGD can still converge to the trivial solution even if it is a saddle point (Ziyin et al., 2021)

because SGD with a finite learning rate is in general not a good estimator of the local minima.

Having a deeper depth thus alters the qualitative nature of the landscape and our results agree better with the common observation that a deeper network is harder, if not impossible, to optimize.

### 4.3 Asymptotic Analysis

Now we analyze the solution when tends to infinity. We first note that the existence condition bound in (24) becomes exponentially harder to satisfy as becomes large:

 ||E[xy]||2≥4d20amaxγeDlog[(σ2+d0)/d0]+O(1). (28)

Recall that for a two-layer net, the existence condition is nothing but , independent of the depth, width, or the stochasticity in the model. For a deeper network, however, every factor comes into play, and the architecture of the model has a strong (and dominant) influence on the condition. In particular, a factor that increases polynomial in the model width and exponentially in the model depth appears.

There are many ways to interpret this condition. For example, it can be seen as an upper bound for the model depth. Alternatively, it is also an upper bound for and an lower bound for . This condition means that learning becomes difficult and even impossible if we increase the depth of the model while fixing the weight decay or the dataset, also agreeing with the common observation that deep networks are very hard to train.

## 5 An Application to Stochastic Nets

As discussed, the property of stochastic nets is an important topic to study in deep learning. In this section, we present a simple application of our general solution to analyze the properties of a stochastic net. The following theorem summarizes our technical results.

###### Theorem 4.

Let , and for all . Let . Then, at any global minimum of Eq. (18), in the limit of large ,

 Var[f(x)]=O(d−10). (29)

In the limit of large ,

 Var[f(x)]=O(1(σ2)D). (30)

In the limit of large ,

 Var[f(x)]=O(e−2Dlog[(σ2+d0)/d0]). (31)

Proof Sketch. The result follows by applying the bound in Proposition 3 to bound the model prediction.

Interestingly, the scaling of prediction variance in asymptotic is different for different widths. The third result shows that the prediction variance decreases exponentially fast in . In particular, this result answers a question recently proposed in Ziyin et al. (2022): does a stochastic net trained on MSE have prediction variance that scales towards ? Ziyin et al. (2022) shows that for a generic nonlinear model, the prediction variance at the global minimum scales is . Ziyin et al. (2022) also hypothesizes that the prediction variance scales towards as the strength of latent variance increases to infinity. We improve on their result in the case of a deep linear net by (a) showing that the is tight in general, independent of the depth or other factors of the model, (b) proving a power-law bound for asymptotic that has been conjectured, and (c) proving a novel bound showing that the variance also scales towards zero as depth increases, which is a novel result of our work.

## 6 Conclusion

In this work, we derived the exact solution of a deep linear net with arbitrary depth and with stochasticity. The global minimum is shown to take exact forms with a hard-to-determine constant. As argued, we expect our work to shed more light on the highly nonlinear landscape of a deep neural network. Compared to the previous works that mostly focus on the qualitative understanding of the linear net, our result offers more precise quantitative understanding of deep linear nets. The quantitative understanding is one major benefit of knowing the exact solution, whose usefulness we have also demonstrated with the application to stochastic nets. In particular, our result directly implies that a line of previous theoretical results is insufficient to explain why neural networks can often be efficiently optimized by the gradient descent method, especially in realistic settings when a weight decay term is present. One application of our main results also provides insight regarding how stochastic networks works, an understudied but important topic.

Restricting to the specific problem setting we studied, there are also many interesting unresolved problems. The following is an incomplete list:

1. In general, more than one nontrivial solutions of Eq. (22) can exist. When they do, is every solution a local minimum? If not, is every nontrivial solution a stationary point?

2. Is there any local minimum that is not a solution?

3. We have seen that a depth- network is qualitatively different from a depth- network. Are there any qualitative changes happening for even deeper networks?

4. Additionally, is an infinite-depth network qualitatively different from a finite-depth network?

Answering these questions can significantly deepen our understanding of the landscape of a neural network.

Our mathematical setting is also limited in two ways. First of all, it is unclear whether the derived results are unique to linear systems or also relevant for nonlinear networks. Secondly, we only gave the exact formula for the global minimum when the output neuron has dimension , and it is not clear whether an analytical solution exists for higher output dimensions. To generalize beyond these two limitations is also an important future step.

## Appendix A Proofs

### a.1 Proof of Lemma 1

Proof. Note that the first term in the loss function is invariant to the following rescaling for any :

 {Ui→aUi;Wij→Wij/a; (32)

meanwhile, the regularization term changes as changes. Therefore, the global minimum must have a minimized with respect to any and .

One can easily find the solution:

 a∗=argmina(γua2U2i+γw∑jW2ija2)=(γw∑jW2ijγuU2i)1/4. (33)

Therefore, at the global minimum, we must have , so that

 (U∗i)2=(a∗Ui)2=γwγu∑j(W∗ij)2, (34)

which completes the proof.

### a.2 Proof of Lemma 2

Proof. By Lemma 1, we can write as and as where is a unit vector, and finding the global minimizer of Eq. (2) is equivalent to finding the minimizer of the following objective,

 Ex,ε⎡⎣(∑i,jb2iϵiwijxj−y)2⎤⎦+(γu+γw)||b||22, (35) (36)

The lemma statement is equivalent to for all and .

We prove by contradiction. Suppose there exist and such that , we can choose to be the index of with maximum , and let be the index of with minimum . Now, we can construct a different solution by the following replacement of and :

 {b2iwi:→c2v;b2jwj:→c2v, (37)

where is a positive scalar and is a unit vector such that . Note that, by the triangular inequality, . Meanwhile, all the other terms, for and , are left unchanged. This transformation leaves the first term in the loss function (36) unchanged, and we now show that it decreases the other terms.

The change in the second term is

 (b2i∑kwikxk)2+(b2j∑kwjkxk)2→2(c2∑kvkxk)2=12(b2i∑kwikxk+b2j∑kwjkxk)2. (38)

By the inequality , we see that the left hand side the larger than the right hand side.

We now consider the regularization term. The change is

 (γu+γw)(b2i+b2j)→2(γu+γw)c2, (39)

and the left hand side is again larger than the right hand side by the inequality mentioned above: . Therefore, we have constructed a solution whose loss is strictly smaller than that of the global minimum: a contradiction. Thus, the global minimum must satisfy

 U2i=U2j (40)

for all and .

Likewise, we can show that for all and . This is because the triangular inequality is only an equality if . If , following the same argument above, we arrive at another contradiction.

### a.3 Proof of Proposition 1

Proof. We first show that there exists a constant such that the global minimum must be confined within a (closed) -Ball around the origin. The objective (18) can be upper-bounded by

 Eq. (???)≥γu||U||2+D∑i=1γi||W(i)||2≥γmin(||U||2+∑i||W(i)||2), (41)

where . Now, let denote be the union of all the parameters () and viewed as a vector. We see that the above inequality is equivalent to

 Eq. (???)≥γmin||w||2. (42)

Now, note that the loss value at the origin is , which means that for any , whose norm , the loss value must be larger than the loss value of the origin. Therefore, let , we have proved that the global minimum must lie in a closed -Ball around the origin.

As the last step, because the objective is a continuous function of and the -Ball is a compact set, the minimum of the objective in this -Ball is achievable. This completes the proof.

### a.4 Proof of Theorem 2

Proof. Note that the trivial solution is also a special case of this solution with . We thus focus on deriving the form of the nontrivial solution.

We prove by induction on . The base case with depth is proved in Theorem 1. We now assume that the same holds for depth and prove that it also holds for depth .

For any fixed , the loss function can be equivalently written as

 E~xEϵ(2),...,ϵ(D)⎛⎝d1,d2,...dD∑i1,i2,...,iDUiDϵ(D)iD...ϵ(2)i2W(2)i2i1~xi1−y⎞⎠2+γu||U||2+D∑i=2γi||W(i)||2+const., (43)

where . Namely, we have reduced the problem to a problem involving only a depth linear net with a transformed input .

By the induction assumption, the global minimum of this problem takes the form in Eq. (19), which means that the loss function can be written as the following form:

 E~xEϵ(2),...,ϵ(D)⎛⎝bubD...b3d1,d2,...dD∑i1,i2,...,iDϵ(D)iD...ϵ(2)i2vi1~xi1−y⎞⎠2+L2 reg., (44)

for an arbitrary optimizable vector . The term