1 Introduction
Applications of neural networks have achieved great success in various fields. One central open theoretical question is why neural networks, being nonlinear and containing many saddle points and local minima, can sometimes be optimized easily to a global minimum (Choromanska et al., 2015a), while becomes almost impossible to train in some other scenarios (Glorot and Bengio, 2010) and may require many tricks to succeed (Gotmare et al., 2018). One established approach is to study the landscape of deep linear nets (Choromanska et al., 2015b), which are believed to approximate the landscape of a nonlinear net well. A series of works proved the famous results that for a deep linear net, all local minima are global (Kawaguchi, 2016; Lu and Kawaguchi, 2017; Laurent and Brecht, 2018), which is regarded to have successfully explained why deep neural networks are so easy to train because it implies that initialization in any attractive basin can reach the global minimum without much effort (Kawaguchi, 2016).
In this work, we theoretically study a deep linear net with weight decay and stochastic neurons, whose loss function takes the following form in general:
(1) 
where and are the model parameters, is the depth of the network,^{1}^{1}1In this work, we use “depth” to refer to the number of hidden layers. For example, a linear regressor has depth . is the noise in the hidden layer (e.g., due to dropout), is the width of the th layer, and is the strength of the weight decay. Previous works have studied special cases of this loss function. For example, Kawaguchi (2016) and Lu and Kawaguchi (2017) study the landscape of when is a constant (namely, when there is no noise). Mehta et al. (2021) studies with (a more complicated type of) weight decay but without stochasticity and proved that all the stationary points are isolated. Another line of works studies when the noise is caused by dropout (Mianjy and Arora, 2019; Cavazza et al., 2018). Our setting is more general than the previous works in two respects. First, apart from the mean square error (MSE) loss , an regularization term (weight decay) with arbitrary strength is included; second, the noise
is arbitrary. Thus, our setting is arguably closer to the actual deep learning practice, where the injection of noises to latent layers is common and the use of weight decay is virtually ubiquitous
(Krogh and Hertz, 1992; Loshchilov and Hutter, 2017). One major theoretical limitation of our work is that we assume the label to be dimensional, and it can be an important future problem to prove whether an exact solution exists or not when is highdimensional.Our foremost contribution is to give the exact solution for all the global minimum of an arbitrarily deep and wide linear net on the MSE loss plus a weight decay term and with a general type of stochasticity in the hidden layer. In other words, we identify in closed form the global minimum of Eq. (1). We then show that it has nontrivial properties that can explain many phenomena in deep learning. In particular, the implications of our result include (but are not limited to):

Weight decay makes the landscape of neural nets more complicated;

we show that bad minima^{2}^{2}2Unless otherwise specified, we use the word “bad minimum” to mean a local minimum that is not a global minimum. Namely, one would have to overcome some nontrivial barrier to reach the global minimum. This usage is consistent with the previous literature (Kawaguchi, 2016). emerge as weight decay is applied, whereas there is no bad minimum when there is no weight decay. This highlights the need to escape bad local minima in deep learning with weight decay.


Deeper nets are harder to optimize than shallower ones;

we show that a 3layer linear net contains a bad minimum at zero, whereas a 2layer net does not. This partially explains why deep networks are much harder to optimize than shallower ones in deep learning practice.


Stochastic networks in a few asymptotic limits can become deterministic;

we show that the prediction variance of a stochastic net scales towards on the MSE loss as (a) the width tends to infinity, or (b) the variance of the latent randomness tends to infinity, or (c) the depth tends to infinity.

In summary, our result deepens our understanding of the loss landscape of neural networks and stochastic networks. Organization: In the next section, we discuss the related works. In Section 3, we derive the exact solution for a twolayer net. Section 4 extends the result to an arbitrary depth. In Section 5, we apply our result to study stochastic networks. The last section concludes the work and discusses unresolved open problems. Moreover, the technical, lengthy or less essential proofs are delayed to Section A.
Notation. For a matrix , we use to denote the
th row vector of
. denotes the norm if is a vector and the Frobenius norm if is a matrix. The notation signals an optimized quantity. Additionally, we the superscript and subscript interchangeably, whichever leads to a simpler expression. For example, and denotes the same quantity, while the former is “simpler.”2 Related Works
Linear Nets. In many ways, linear networks have been used to help understand nonlinear networks. For example, even at depth , where the linear net is nothing but a linear regressor, linear nets are shown to be relevant for understanding the generalization behavior of modern overparametrized networks (Hastie et al., 2019). Saxe et al. (2013) studies the training dynamics of a depth
network and uses it to understand the dynamics of learning of nonlinear networks. These networks are the same as a linear regression model in terms of expressivity. However, the loss landscape is highly complicated due to the existence of more than one layer, and linear nets are widely believed to approximate the loss landscape of a nonlinear net
(Kawaguchi, 2016; Hardt and Ma, 2016; Laurent and Brecht, 2018). In particular, the landscape of linear nets has been studied as early as 1989 in Baldi and Hornik (1989), which proposed the wellknown conjecture that all local minima of a deep linear net are global. This conjecture is first proved in Kawaguchi (2016), and extended to other loss functions and deeper depths in Lu and Kawaguchi (2017) and Laurent and Brecht (2018).Stochastic Net Theory. A major extension of the standard neural networks is to make them stochastic, namely, to make the output a random function of the input. In a broad sense, stochastic neural networks include neural networks trained with dropout (Srivastava et al., 2014; Gal and Ghahramani, 2016)
(Mackay, 1992), variational autoencoders (VAE)
(Kingma and Welling, 2013), and generative adversarial networks
(Goodfellow et al., 2014). Stochastic networks are thus of both practical and theoretical importance to study. Theoretically, while a unified approach is lacking, some previous works exist to separately study different stochastic techniques in deep learning. A series of recent works approach the VAE loss theoretically (Dai and Wipf, 2019). Another line of recent works analyzes linear models trained with VAE to study the commonly encountered mode collapse problem of VAE (Lucas et al., 2019; Koehler et al., 2021). Another series of work that extensively studied the dropout technique with a linear network (Cavazza et al., 2018; Mianjy and Arora, 2019; Arora et al., 2020) showed that dropout effectively controls the rank of the learned solution. Lastly, it is worth noting that in the original works of Kawaguchi (2016) and Choromanska et al. (2015a), a linear net with a special type of noise that is equivalent to dropout has been used to model the effect of the ReLU nonlinearity in actual neural networks. Our work considers an arbitrary type of noise whose second moment exists, not just limited to dropout.
3 Twolayer Linear Net
This section finds the exact solutions of a twolayer linear net. The data point is a dimensional vector drawn from a data distribution and the labels are generated through an arbitrary function . Common in the deep learning practice, a weight decay term is also added. For generality, the two different layers have different strengths of weight decay even though they often take the same value in practice.
In particular, we are interested in finding the global minimum of the following objective:
(2) 
where is the width of the hidden layer and
are independent random variables that are often used to stochastify a neural network, dropout
(Srivastava et al., 2014) being one wellknown example. and are the weight decay parameters. Here, we consider a general type of independent noise with and where is the Kronecker’s delta, and . For shorthand, we use the notation, and the largest and the smallest eigenvalues of
are denoted as and respectively. denotes the th eigenvalue of viewed in any order. For now, it is sufficient for us to assume that the global minimum of Eq. (2) always exists. We will prove a more general result in Proposition 1, when we deal with multilayer nets.3.1 Main Result
We first present two lemmas showing that the global minimum can only lie on a rather restrictive subspace of all possible parameter settings due to invariances (symmetries) in the objective.
Lemma 1.
At the global minimum of Eq. (2), for all .
Proof Sketch. We use the fact that the first term of Eq. (2) is invariant to a simultaneous rescaling of rows of the weight matrix to find the optimal rescaling, which implies the lemma statement.
This lemma implies that for all , must be proportional to the norm of its corresponding row vector in . The following lemma further shows that, at the global minimum, all elements of must be equal.
Lemma 2.
At the global minimum, for all and , we have
(3) 
Proof Sketch. We show that if the condition is not satisfied, then, an “averaging” transformation will strictly decrease the objective.
The second lemma imposes strong conditions on the solution of the problem, and the essence of this lemma is the reduction of the original problem to a lower dimension.
We are now ready to prove our first main result.
Theorem 1.
The global minimum and of Eq. (2) is and if and only if
(4) 
When , there exists such that the global minima are
(5) 
where is an arbitrary vertex of a dimensional hypercube.
Proof. By Lemma 2, at any global minimum, we can write for some . We can also write for a general vector . Without loss of generality, we assume that (because the sign of can be absorbed into ).
The original problem in Eq. (2) is now equivalently reduced following problem because :
(6) 
For any fixed , the global minimum of is well known:^{3}^{3}3Namely, it is the solution of a ridgeless linear regression problem.
(7) 
By Lemma 1, at a global minimum, also satisfies the following condition:
(8) 
One solution to this equation is , and we are interested in whether solutions with exists. If there is no other solution, then must be the unique global minimum; otherwise, we need to identify which of the solutions are actual global minima and which are just saddles.^{4}^{4}4We do not discern between saddles or maxima. When ,
(9) 
Note that the lefthand side is monotonically decreasing in , and is equal to when . When , the lefthand side tends to . Because the lefthand side is a continuous and monotonic function of , a unique solution that satisfies Eq. (9) exists if and only if , or,
(10) 
Therefore, at most three candidates for global minima of the loss function exist:
(11) 
where .
In the second case, one needs to discern the saddle points from the global minima. Using the expression of , one finds the expression of the loss function as a function of
(12) 
where such that is a diagonal matrix. We now show that condition (10) is sufficient to guarantee that is not the global minimum.
At , the first nonvanishing derivative of is the secondorder derivative. The second order derivative at is
(13) 
which is negative if and only if . If the second derivative at is negative, cannot be a minimum. It then follows that for , , are the two global minimum (because the loss is invariant to the sign flip of ). For the same reason, when , gives the unique global minimum. This finishes the proof.
Apparently, is the trivial solution that has not learned any feature due to overregularization. Henceforth, we refer to this solution (and similar solutions for deeper nets) as the “trivial” solution. We now analyze the properties of the nontrivial solution when it exists.
The condition for the solution to become nontrivial is interesting: . The term can be seen as the effective strength of the signal, and is the strength of regularization. This precise condition means that the learning of a twolayer can be divided into two qualitatively different regimes: an “overregularized regime,” where the global minimum is trivial, and a “feature learning regime”, where the global minimum involves actual learning.
3.2 Exact Form of
Note that our main result does not specify the exact value of . This is because must satisfy the condition in Eq. (9), which is equivalent to a highorder polynomial in with coefficients being general functions of the eigenvalues of , whose solutions are generally not analytical by Galois theory.
However, when , the exact form exists. Practically, this can be achieved for any (fullrank) dataset if we disentangle and rescale the data by the whitening transformation: . In this case, we have
(14) 
and
(15) 
3.3 Bounding the General Solution
While the solution to does not admit an analytical form for a general , one can find meaningful lower and upper bound to
such that we can perform asymptotic analysis of
. At the global minimum, the following inequality holds:(16) 
where and are the smallest and largest eigenvalue of , respectively. The middle term is equal to by the global minimum condition in (9), and so, assuming , this inequality is equivalent to the following two inequalities of :
(17) 
Namely, the general solution should scale similarly to the homogeneous solution^{5}^{5}5We use the word “homogeneous” to mean that all the eigenvalues of take the same value. in Eq. (14) if we treat the eigenvalues of as constants.
4 Exact Solution for An ArbitraryDepth Linear Net
This section extends our result to multiple layers, each with independent stochasticity. We first prove the analytical formula for the global minimum of a general arbitrarydepth model. We then show that the landscape for a deeper network is highly nontrivial.
4.1 General Solution
For this problem, the loss function becomes
(18) 
where all the noises are independent from each other, and for all and , and . We first show that for general , the global minimum exists for this objective.
Proposition 1.
For and strictly positive , the global minimum for Eq.(18) exists.
Note that the positivity of the regularization strength is crucial. If one of the is zero, the global minimum may not exist. The following theorem is our second main result.
Theorem 2.
Any global minimum of Eq. (18) is of the form
(19) 
where , , and , and is an arbitrary vertex of a dimensional hypercube for all .
Proof Sketch. We prove by induction on the depth . The base case is proved in Theorem 1. We then show that for a general depth, the objective involves optimizing subproblems, one of which is a layer problem that follows by the induction assumption, and the other is a twolayer problem that has been solved in Theorem 1. Putting these two subproblems together, one obtains Eq. (19).
Similar to a twolayer network, the scaling factor for all is not independent from one another. The following Lemma is a generalization of Lemma 1
to a multilayer setting and shows that there is only one degree of freedom (instead of
) in the form of solutions in Eq. (19).Lemma 3.
At any global minimum of Eq. (18), let and ,
(20) 
Proof Sketch. The proof is similar to Lemma 1.
The lemma implies the product of all the can be written in terms of one of the :
(21) 
where and . Applying Lemma 3 to the first layer of Theorem 2 shows that the global minimum must satisfy the following equation, which is equivalent to a highorder polynomial in that does not have an analytical solution in general:
(22) 
At this point, it pays to clearly define the word “solution,” especially given that it has a special meaning in this work because it now becomes highly nontrivial to differentiate between the two types of solutions.
Definition 1.
We say that a nonnegative real is a solution if it satisfies Eq. (22). A solution is trivial if and nontrivial otherwise.
Namely, a global minimum must be a solution, but a solution is not necessarily a global minimum. We have seen that even in the twolayer case, the global minimum can be the trivial one when the strength of the signal is too weak or when the strength of regularization is too strong. It is thus natural to expect to be the global minimum under a similar condition, and one is interested in whether the condition becomes stronger or weaker as the depth of the model is increased. However, it turns out this naive expectation is not true. In fact, when the depth of the model is larger than , the condition for the trivial global minimum becomes highly nontrivial.
The following proposition shows why the problem becomes more complicated. In particular, we have seen that in the case of a twolayer net, some elementary argument has helped us show that the trivial solution is either a saddle or the global minimum. However, the proposition below shows that with , the landscape becomes more complicated in the sense that the trivial solution is always a local minimum, and it becomes difficult to compare the loss value of the trivial solution with the nontrivial solution because the value of is unknown in general.
Proposition 2.
Let in Eq. (18). Then, the solution , , …, is a local minimum with a diagonal positivedefinite Hessian.
Proof Sketch. This is a technical proof that directly computes the Hessian.
Comparing the Hessian of and , one notices a qualitative difference: for , the Hessian is always diagonal (at ); for , in sharp contrast, the offdiagonal terms are nonzero in general, and it is these offdiagonal terms that can break the positivedefiniteness of the Hessian. This offers a different perspective about why there is a qualitative difference between and .
Lastly, note that, unlike the depth case, one can no longer find a precise condition such that a solution exists for a general . The reason is that the condition for the existence of the solution is now a highorder polynomial with quite arbitrary intermediate terms. The following proposition gives a sufficient but strongerthannecessary condition for the existence of a nontrivial solution, when all the , intermediate width and regularization strength are the same.^{6}^{6}6This is equivalent to setting . The result is qualitatively similar but involves additional factors of if , , and all take different values. We thus only present the case when , , and are the same for notational concision and for emphasizing the most relevant terms. Also, note that this proposition gives a sufficient and necessary condition if is proportional to the identity.
Proposition 3.
Let , and for all . Assuming , the only solution is trivial if
(23) 
Nontrivial solutions exist if
(24) 
Moreover, the nontrivial solutions are both lower and upperbounded:^{7}^{7}7For , we define the lowerbound as , which equal to zero if , and if . With this definition, this proposition applies to a twolayer net as well.
(25) 
Proof Sketch. The proof follows from the observation that the l.h.s. of Eq. (22) is a continuous function and must cross the r.h.s. under certain sufficient conditions.
One should compare the general condition here with the special condition for . One sees that for , many the other factors (such as the width, the depth and the spectrum of the data covariance ) come into play to determine the existence of a solution apart from the signal strength and the regularization strength .
4.2 Which Solution is the Global Minimum?
Again, we set , and for all for notational concision. Using this condition and applying Lemma 3 to Theorem 2, the solution now takes the following form, where ,
(26) 
Note that the signs of and are arbitrary due to the invariances of the original objective. The following theorem gives a sufficient condition for the global minimum to be nontrivial. It also shows that the landscape of the linear net becomes complicated and can contain more than local minimum when a certain condition is satisfied.
Theorem 3.
Let , and for all and assuming . Then, there exists a constant such that for any
(27) 
the global minimum of Eq. (18) is one of the nontrivial solutions.
Proof Sketch. We find an easytosolve upper bound the objective, which is simplified to the above condition.
While there are various ways this bound can be proved, it is general enough for our purpose. In particular, one sees that, for a general depth, the condition for having a nontrivial global minimum depends not only on the and but also on the model architecture in general. For a more general architecture with different widths etc., the architectural constant from Eq. (22) will also enter the equation.
The combination of Theorem 3 and Proposition 2 shows that the landscape of a deep neural network becomes highly nontrivial when there is a weight decay and when the depth of the model is larger than . This gives an incomplete but meaningful picture of a network’s complicated but interesting landscape beyond two layers (see Figure 1 for a incomplete summary of our results). In particular, even when the nontrivial solution is the global minimum, the trivial solution is still a local minimum that needs to be escaped. Our result suggests the previous understanding that all local minima of a deep linear net are global cannot generalize to many practical settings where deep learning is found to work well. For example, a series of works attribute the existence of bad (nonglobal) minima to the use of nonlinearities nonlinear nets (Kawaguchi, 2016) or the use of a nonregular (nondifferentiable) loss function (Laurent and Brecht, 2018). Our result, in contrast, shows that the use of a simple weight decay is sufficient to create a bad minimum.^{8}^{8}8Some previous works do suggest the existence of bad minima when weight decay is present, but no direct proof exists yet. For example, Taghvaei et al. (2017) shows that when the model is approximated by a linear dynamical system, regularization can cause bad local minima. Mehta et al. (2021) shows the existence of bad local minima in deep linear networks with weight decay through numerical simulations. Moreover, the problem with such a minimum is twofold: (1) (optimization) it is not global and so needs to be “overcome”^{9}^{9}9This becomes more a problem because we have shown that this bad minimum is located at the origin where the neural networks are commonly initialized at. and (2) (generalization) it is a minimum that has not learned any feature at all because the model constantly outputs zero. To the best of our knowledge, previous to our work, there has not been any proof that a bad minimum can generically exist in a rather arbitrary network without any restriction on the data.^{10}^{10}10In the case of nonlinear networks without regularization, a few works proved the existence of bad minima. However, the previous results strongly depend on the data and rather independent of architecture. For example, one major assumption is that the data cannot be perfected fitted by a linear model (Yun et al., 2018; Liu, 2021; He et al., 2020). Some other works explicitly construct data distribution (Safran and Shamir, 2018; Venturi et al., 2019). Our result, in contrast, is independent of the data. Thus, our result offers direct and solid theoretical justification for the widely believed importance of escaping local minima in the field of deep learning (Kleinberg et al., 2018; Liu et al., 2021; Mori et al., 2022). In particular, previous works on escaping local minima often hypothesize landscapes that are of unknown relevance to an actual neural network. With our result, this line of research can now be established with respect to landscapes that are actually deeplearningrelevant.
Previous works also argue that having a deeper depth does not create a bad minimum (Lu and Kawaguchi, 2017). While this remains true, its generality and applicability to practical settings now also seem low. Our result shows that as long as weight decay is used, and as long as , there is indeed a bad local minimum at . In contrast, there is no bad minimum at for a depth network: the point is either a saddle or the global minimum.^{11}^{11}11Of course, in practice, the model trained with SGD can still converge to the trivial solution even if it is a saddle point (Ziyin et al., 2021)
because SGD with a finite learning rate is in general not a good estimator of the local minima.
Having a deeper depth thus alters the qualitative nature of the landscape and our results agree better with the common observation that a deeper network is harder, if not impossible, to optimize.4.3 Asymptotic Analysis
Now we analyze the solution when tends to infinity. We first note that the existence condition bound in (24) becomes exponentially harder to satisfy as becomes large:
(28) 
Recall that for a twolayer net, the existence condition is nothing but , independent of the depth, width, or the stochasticity in the model. For a deeper network, however, every factor comes into play, and the architecture of the model has a strong (and dominant) influence on the condition. In particular, a factor that increases polynomial in the model width and exponentially in the model depth appears.
There are many ways to interpret this condition. For example, it can be seen as an upper bound for the model depth. Alternatively, it is also an upper bound for and an lower bound for . This condition means that learning becomes difficult and even impossible if we increase the depth of the model while fixing the weight decay or the dataset, also agreeing with the common observation that deep networks are very hard to train.
5 An Application to Stochastic Nets
As discussed, the property of stochastic nets is an important topic to study in deep learning. In this section, we present a simple application of our general solution to analyze the properties of a stochastic net. The following theorem summarizes our technical results.
Theorem 4.
Let , and for all . Let . Then, at any global minimum of Eq. (18), in the limit of large ,
(29) 
In the limit of large ,
(30) 
In the limit of large ,
(31) 
Proof Sketch. The result follows by applying the bound in Proposition 3 to bound the model prediction.
Interestingly, the scaling of prediction variance in asymptotic is different for different widths. The third result shows that the prediction variance decreases exponentially fast in . In particular, this result answers a question recently proposed in Ziyin et al. (2022): does a stochastic net trained on MSE have prediction variance that scales towards ? Ziyin et al. (2022) shows that for a generic nonlinear model, the prediction variance at the global minimum scales is . Ziyin et al. (2022) also hypothesizes that the prediction variance scales towards as the strength of latent variance increases to infinity. We improve on their result in the case of a deep linear net by (a) showing that the is tight in general, independent of the depth or other factors of the model, (b) proving a powerlaw bound for asymptotic that has been conjectured, and (c) proving a novel bound showing that the variance also scales towards zero as depth increases, which is a novel result of our work.
6 Conclusion
In this work, we derived the exact solution of a deep linear net with arbitrary depth and with stochasticity. The global minimum is shown to take exact forms with a hardtodetermine constant. As argued, we expect our work to shed more light on the highly nonlinear landscape of a deep neural network. Compared to the previous works that mostly focus on the qualitative understanding of the linear net, our result offers more precise quantitative understanding of deep linear nets. The quantitative understanding is one major benefit of knowing the exact solution, whose usefulness we have also demonstrated with the application to stochastic nets. In particular, our result directly implies that a line of previous theoretical results is insufficient to explain why neural networks can often be efficiently optimized by the gradient descent method, especially in realistic settings when a weight decay term is present. One application of our main results also provides insight regarding how stochastic networks works, an understudied but important topic.
Restricting to the specific problem setting we studied, there are also many interesting unresolved problems. The following is an incomplete list:

In general, more than one nontrivial solutions of Eq. (22) can exist. When they do, is every solution a local minimum? If not, is every nontrivial solution a stationary point?

Is there any local minimum that is not a solution?

We have seen that a depth network is qualitatively different from a depth network. Are there any qualitative changes happening for even deeper networks?

Additionally, is an infinitedepth network qualitatively different from a finitedepth network?
Answering these questions can significantly deepen our understanding of the landscape of a neural network.
Our mathematical setting is also limited in two ways. First of all, it is unclear whether the derived results are unique to linear systems or also relevant for nonlinear networks. Secondly, we only gave the exact formula for the global minimum when the output neuron has dimension , and it is not clear whether an analytical solution exists for higher output dimensions. To generalize beyond these two limitations is also an important future step.
References
 Arora et al. (2020) Arora, R., Bartlett, P., Mianjy, P., and Srebro, N. (2020). Dropout: Explicit forms and capacity control.

Baldi and Hornik (1989)
Baldi, P. and Hornik, K. (1989).
Neural networks and principal component analysis: Learning from examples without local minima.
Neural networks, 2(1):53–58. 
Cavazza et al. (2018)
Cavazza, J., Morerio, P., Haeffele, B., Lane, C., Murino, V., and Vidal, R.
(2018).
Dropout as a lowrank regularizer for matrix factorization.
In
International Conference on Artificial Intelligence and Statistics
, pages 435–444. PMLR.  Choromanska et al. (2015a) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015a). The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204.
 Choromanska et al. (2015b) Choromanska, A., LeCun, Y., and Arous, G. B. (2015b). Open problem: The landscape of the loss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756–1760. PMLR.
 Dai and Wipf (2019) Dai, B. and Wipf, D. (2019). Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789.

Gal and Ghahramani (2016)
Gal, Y. and Ghahramani, Z. (2016).
Dropout as a bayesian approximation: Representing model uncertainty
in deep learning.
In
international conference on machine learning
, pages 1050–1059. PMLR.  Glorot and Bengio (2010) Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
 Gotmare et al. (2018) Gotmare, A., Keskar, N. S., Xiong, C., and Socher, R. (2018). A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243.
 Hardt and Ma (2016) Hardt, M. and Ma, T. (2016). Identity matters in deep learning. arXiv preprint arXiv:1611.04231.
 Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in highdimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560.
 He et al. (2020) He, F., Wang, B., and Tao, D. (2020). Piecewise linear activations substantially shape the loss surfaces of neural networks. arXiv preprint arXiv:2003.12236.
 Kawaguchi (2016) Kawaguchi, K. (2016). Deep learning without poor local minima. Advances in Neural Information Processing Systems, 29:586–594.
 Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 Kleinberg et al. (2018) Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pages 2698–2707. PMLR.
 Koehler et al. (2021) Koehler, F., Mehta, V., Risteski, A., and Zhou, C. (2021). Variational autoencoders in the presence of lowdimensional data: landscape and implicit bias. arXiv preprint arXiv:2112.06868.
 Krogh and Hertz (1992) Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957.
 Laurent and Brecht (2018) Laurent, T. and Brecht, J. (2018). Deep linear networks with arbitrary loss: All local minima are global. In International conference on machine learning, pages 2902–2907. PMLR.
 Liu (2021) Liu, B. (2021). Spurious local minima are common for deep neural networks with piecewise linear activations. arXiv preprint arXiv:2102.13233.

Liu et al. (2021)
Liu, K., Ziyin, L., and Ueda, M. (2021).
Noise and fluctuation of finite learning rate stochastic gradient descent.
 Loshchilov and Hutter (2017) Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
 Lu and Kawaguchi (2017) Lu, H. and Kawaguchi, K. (2017). Depth creates no bad local minima. arXiv preprint arXiv:1702.08580.
 Lucas et al. (2019) Lucas, J., Tucker, G., Grosse, R., and Norouzi, M. (2019). Don’t blame the elbo! a linear vae perspective on posterior collapse.
 Mackay (1992) Mackay, D. J. C. (1992). Bayesian methods for adaptive models. PhD thesis, California Institute of Technology.
 Mehta et al. (2021) Mehta, D., Chen, T., Tang, T., and Hauenstein, J. (2021). The loss surface of deep linear networks viewed through the algebraic geometry lens. IEEE Transactions on Pattern Analysis and Machine Intelligence.
 Mianjy and Arora (2019) Mianjy, P. and Arora, R. (2019). On dropout and nuclear norm regularization. In International Conference on Machine Learning, pages 4575–4584. PMLR.
 Mori et al. (2022) Mori, T., Ziyin, L., Liu, K., and Ueda, M. (2022). Powerlaw escape rate of sgd.
 Safran and Shamir (2018) Safran, I. and Shamir, O. (2018). Spurious local minima are common in twolayer relu neural networks. In International conference on machine learning, pages 4433–4441. PMLR.
 Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
 Taghvaei et al. (2017) Taghvaei, A., Kim, J. W., and Mehta, P. (2017). How regularization affects the critical points in linear networks. Advances in neural information processing systems, 30.
 Venturi et al. (2019) Venturi, L., Bandeira, A. S., and Bruna, J. (2019). Spurious valleys in onehiddenlayer neural network optimization landscapes. Journal of Machine Learning Research, 20:133.
 Yun et al. (2018) Yun, C., Sra, S., and Jadbabaie, A. (2018). Small nonlinearities in activation functions create bad local minima in neural networks. arXiv preprint arXiv:1802.03487.
 Ziyin et al. (2021) Ziyin, L., Li, B., Simon, J. B., and Ueda, M. (2021). Sgd may never escape saddle points.
 Ziyin et al. (2022) Ziyin, L., Zhang, H., Meng, X., Lu, Y., Xing, E., and Ueda, M. (2022). Stochastic neural networks with infinite width are deterministic.
Appendix A Proofs
a.1 Proof of Lemma 1
Proof. Note that the first term in the loss function is invariant to the following rescaling for any :
(32) 
meanwhile, the regularization term changes as changes. Therefore, the global minimum must have a minimized with respect to any and .
One can easily find the solution:
(33) 
Therefore, at the global minimum, we must have , so that
(34) 
which completes the proof.
a.2 Proof of Lemma 2
Proof. By Lemma 1, we can write as and as where is a unit vector, and finding the global minimizer of Eq. (2) is equivalent to finding the minimizer of the following objective,
(35)  
(36) 
The lemma statement is equivalent to for all and .
We prove by contradiction. Suppose there exist and such that , we can choose to be the index of with maximum , and let be the index of with minimum . Now, we can construct a different solution by the following replacement of and :
(37) 
where is a positive scalar and is a unit vector such that . Note that, by the triangular inequality, . Meanwhile, all the other terms, for and , are left unchanged. This transformation leaves the first term in the loss function (36) unchanged, and we now show that it decreases the other terms.
The change in the second term is
(38) 
By the inequality , we see that the left hand side the larger than the right hand side.
We now consider the regularization term. The change is
(39) 
and the left hand side is again larger than the right hand side by the inequality mentioned above: . Therefore, we have constructed a solution whose loss is strictly smaller than that of the global minimum: a contradiction. Thus, the global minimum must satisfy
(40) 
for all and .
Likewise, we can show that for all and . This is because the triangular inequality is only an equality if . If , following the same argument above, we arrive at another contradiction.
a.3 Proof of Proposition 1
Proof. We first show that there exists a constant such that the global minimum must be confined within a (closed) Ball around the origin. The objective (18) can be upperbounded by
(41) 
where . Now, let denote be the union of all the parameters () and viewed as a vector. We see that the above inequality is equivalent to
(42) 
Now, note that the loss value at the origin is , which means that for any , whose norm , the loss value must be larger than the loss value of the origin. Therefore, let , we have proved that the global minimum must lie in a closed Ball around the origin.
As the last step, because the objective is a continuous function of and the Ball is a compact set, the minimum of the objective in this Ball is achievable. This completes the proof.
a.4 Proof of Theorem 2
Proof. Note that the trivial solution is also a special case of this solution with . We thus focus on deriving the form of the nontrivial solution.
We prove by induction on . The base case with depth is proved in Theorem 1. We now assume that the same holds for depth and prove that it also holds for depth .
For any fixed , the loss function can be equivalently written as
(43) 
where . Namely, we have reduced the problem to a problem involving only a depth linear net with a transformed input .
By the induction assumption, the global minimum of this problem takes the form in Eq. (19), which means that the loss function can be written as the following form:
(44) 
for an arbitrary optimizable vector . The term