# Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

The past decade has witnessed a successful application of deep learning to solving many challenging problems in machine learning and artificial intelligence. However, the loss functions of deep neural networks (especially nonlinear networks) are still far from being well understood from a theoretical aspect. In this paper, we enrich the current understanding of the landscape of the square loss functions for three types of neural networks. Specifically, when the parameter matrices are square, we provide an explicit characterization of the global minimizers for linear networks, linear residual networks, and nonlinear networks with one hidden layer. Then, we establish two quadratic types of landscape properties for the square loss of these neural networks, i.e., the gradient dominance condition within the neighborhood of their full rank global minimizers, and the regularity condition along certain directions and within the neighborhood of their global minimizers. These two landscape properties are desirable for the optimization around the global minimizers of the loss function for these neural networks.

There are no comments yet.

## Authors

• 92 publications
• 54 publications
• ### Critical Points of Neural Networks: Analytical Forms and Landscape Properties

Due to the success of deep learning to solving a variety of challenging ...
10/30/2017 ∙ by Yi Zhou, et al. ∙ 0

• ### Global optimality conditions for deep neural networks

We study the error landscape of deep linear and nonlinear neural network...
07/08/2017 ∙ by Chulhee Yun, et al. ∙ 0

• ### Tilting the playing field: Dynamical loss functions for machine learning

We show that learning can be improved by using loss functions that evolv...
02/07/2021 ∙ by Miguel Ruiz-Garcia, et al. ∙ 0

• ### Learning One-hidden-layer Neural Networks under General Input Distributions

Significant advances have been made recently on training neural networks...
10/09/2018 ∙ by Weihao Gao, et al. ∙ 6

• ### Learning in Gated Neural Networks

Gating is a key feature in modern neural networks including LSTMs, GRUs ...
06/06/2019 ∙ by Ashok Vardhan Makkuva, et al. ∙ 0

• ### Porcupine Neural Networks: (Almost) All Local Optima are Global

Neural networks have been used prominently in several machine learning a...
10/05/2017 ∙ by Soheil Feizi, et al. ∙ 0

• ### On the alpha-loss Landscape in the Logistic Model

We analyze the optimization landscape of a recently introduced tunable c...
06/22/2020 ∙ by Tyler Sypherd, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The significant success of deep learning (see, e.g., Goodfellow et al. (2016)

) has influenced many fields such as machine learning, artificial intelligence, computer vision, natural language processing, etc. Consequently, there is a rising interest in understanding the fundamental properties of deep neural networks. Among them, the landscape (also referred to as geometry) of the loss functions of neural networks is an important aspect, since it is central to determine the performance of optimization algorithms that are designed to minimize these functions. The loss functions of neural networks are typically nonconvex, and hence understanding these functions requires significantly new insights and analysis techniques.

There has been a growing literature recently that contributed towards understanding the landscape properties of loss functions of neural networks. For example, Baldi and Hornik (1989) showed that any local minimum is a global minimum for the square loss function for linear networks with one hidden layer, and more recently Kawaguchi (2016); Yun et al. (2017) showed that such a result continues to hold for deep linear networks. Choromanska et al. (2015a, b) characterized the distribution properties of the local minimizers for deep nonlinear networks, and Kawaguchi (2016) further eliminated some assumptions in Choromanska et al. (2015a), and established the equivalence between the local minimum and the global minimum. More results on this topic are further discussed in Section 1.2.

The main focus of this paper is on two important landscape properties that have been shown to be important to determine the convergence of the first-order algorithms for nonconvex optimization. The first property is referred to as gradient dominance condition as we describe below. Consider a global minimizer of a generic function , and a neighborhood around . The (local) gradient dominance condition with regard to is given by

 ∀x∈Bx∗(δ), f(x)−f(x∗)≤λ∥∇f(x)∥22,

where and is a neighborhood of . This condition is a special case of the Łojasiwiecz gradient inequality Łojasiewicz (1965) (with exponent ), and has been shown to hold for a variety of machine learning problems, e.g., a square loss function for phase retrieval Zhou et al. (2016) and blind deconvolution Li et al. (2016b). If the algorithm iterates in the neighborhood , then the gradient dominance condition, together with a Lipschitz property of the gradient of an objective function, guarantees a linear convergence of the function value residual Karimi et al. (2016); Reddi et al. (2016).

The second property is referred to as regularity condition, with the (local) regularity condition given by

 ∀x∈Bx∗(δ),

where . This condition can be viewed as a restricted version of the strong convexity, and it has been shown to guarantee a linear convergence of the iterate residual in this local neighborhood Nesterov (2014); Candès et al. (2015). Problems such as phase retrieval Candès et al. (2015), affine rank minimization Zheng and Lafferty (2015); Tu et al. (2016) and matrix completion Zheng and Lafferty (2016) have been shown to satisfy the local regularity condition.

However, these two properties have not been explored thoroughly for the loss functions of neural networks with only very few exceptions. Hardt and Ma (2017) established the gradient dominance condition for a linear residual network (in which each residual unit having only one linear layer) within a local neighborhood of the origin. The goal of this paper is to explore these two geometric conditions for a much broader types of neural networks. In particular, we focus on three types of neural networks: feed forward linear neural networks Lippmann (1988), linear residual neural networks He et al. (2016), and nonlinear neural networks with one hidden layer.

### 1.1 Our Contributions

We study the square loss function of linear, linear residual, and one-hidden-layer nonlinear neural networks. We focus on the scenario, in which all parameter matrices of the neural networks are square so that the global minimizers, gradient and Hessian of loss functions can be expressed in a trackable form for analysis . We first characterize the form of global minimizers of these loss functions, and then establish local gradient dominance and regularity conditions for these loss functions.

Characterization of global minimizers: For deep linear neural networks, we show that global minimizers can be uniquely characterized in an explicit form up to an equivalence class. Furthermore, all the global minimizers correspond to parameter matrices that are full rank. We then extend such a result to further characterize the full-rank global minimizers of deep linear residual networks and one-hidden-layer nonlinear neural networks. Our results generalize the characterization of global minimizers of shallow linear networks in Baldi and Hornik (1989) to deep linear, residual and one-hidden-layer nonlinear neural networks.

Gradient dominance condition: For deep linear networks, we show that the gradient dominance condition holds within the neighborhood of any global minimizer, and hence any critical point within such a neighborhood is also a global minimizer. We further show that the same result also holds in parallel for deep linear residual networks within the neighborhood of any full-rank global minimizer, and for nonlinear networks with one hidden layer within the neighborhood of any global minimizer. Moreover, comparing the gradient dominance condition of the two types of linear networks, the identity shortcut in the residual networks helps to regularize the constant of the gradient dominance condition in the neighborhood of the origin to be more amenable for optimization. Our results generalize that in Hardt and Ma (2017) within the neighborhood of the origin for residual networks with shortcut depth to the neighborhood of any full-rank global minimizer for residual networks with .

Regularity condition: For deep linear networks, we establish the local regularity condition within the neighborhood of any global minimizer along certain directions. We further show that the same result also holds in parallel for deep linear residual networks and one-hidden-layer nonlinear neural networks. Comparing the local regularity condition of the two types of linear networks, the identity shortcut in residual networks broadens the range of directions along which the regularity condition holds in the neighborhood of the origin. Hence, the global minimizers of the linear residual networks near the origin open a larger aperture of attraction for optimization paths than that of the global minimizer of the linear networks.

### 1.2 Related Work

Gradient dominance condition and regularity condition for nonconvex problems: As we discussed above, the gradient dominance condition have recently been exploited to characterize the linear convergence of first-order algorithms for nonconvex optimization Karimi et al. (2016); Reddi et al. (2016). This condition was established for problems such as phase retrieval Zhou et al. (2016), blind deconvolution Li et al. (2016b), and linear residual neural networks Hardt and Ma (2017). The regularity condition has also been exploited to characterize the linear convergence of first-order algorithms for nonconvex optimization Candès et al. (2015). This condition was established for phase retrieval Candès et al. (2015); Chen and Candès (2015); Zhang and Liang (2016); Wang and Giannakis (2016), for affine rank minimization Zheng and Lafferty (2015); Tu et al. (2016); White et al. (2015), and for matrix completion problems Chen and Wainwright (2015); Zheng and Lafferty (2016).

Other landscape properties of linear networks: The study of the landscape of the square loss function for linear neural networks dates back to the pioneering work Baldi and Hornik (1989); Baldi (1989)

. There, they studied the autoencoder with one hidden layer and showed the equivalence between the local minimum and the global minimum with a characterization of the form of global minimum points.

Baldi and Lu (2012) further generalizes these results to the complex-valued autoencoder setting. The equivalence between the local minimum and the global minimum of deep linear networks was established in Kawaguchi (2016); Lu and Kawaguchi (2017); Yun et al. (2017) respectively under different assumptions. In particular, Yun et al. (2017) established the necessary and sufficient conditions for a critical point of the deep linear network to be a global minimum. The same result was shown in Freeman and Bruna (2017) for deep linear networks with the width of intermediate layers being larger than the input and output layers. Taghvaei et al. (2017) studied the effect of regularization on the critical points for a two-layer linear network. Li et al. (2016a) studied the property of the Hessian matrix for deep linear residual networks.

Other landscape properties of nonlinear networks: There have also been studies on understanding the landscape of nonlinear neural networks from theoretical perspectives. Yu and Chen (1995) considered a one-hidden-layer nonlinear neural network with sigmoid activation and showed that all local minimum are also global minimum provided that the number of input units equals the number of data samples. Gori and Tesi (1992) studied a class of multi-layer nonlinear neural networks, and showed that all critical points of full column rank achieve the global minimum with zero loss, if the sample size is less than the input dimension and the widths of the layers form a pyramidal structure. Nguyen and Hein (2017) further generalized the results in Gori and Tesi (1992) to a larger class of nonlinear networks and showed that critical points with non-degenerate Hessian are the global minimum. Choromanska et al. (2015a, b) connected the loss function of deep nonlinear networks with the Hamiltonian of the spin-glass model under certain assumptions and characterized the distribution properties of the local minimizers. Then, Kawaguchi (2016) further eliminated some of the assumptions in Choromanska et al. (2015a), and established the equivalence between the local minimum and the global minimum by reducing the loss function of the deep nonlinear network to that of the deep linear network. Soltanolkotabi et al. (2017)

established the local strong convexity of overparameterized nonlinear networks with one hidden layer and quadratic activation functions. Furthermore,

Zhong et al. (2017)

established the local strong convexity of a class of nonlinear networks with one hidden layer with the Gaussian input data, and established the local linear convergence of gradient descent method with tensor initialization.

Soudry and Hoffer (2017) studied a one-hidden-layer nonlinear neural network with piecewise linear activation function and a single output, and showed that the volume of differentiable regions of the empirical loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima as the number of data samples tends to infinity. Xie et al. (2016) studied the nonlinear neural network with one hidden layer, and showed that a diversified weight can lead to good generalization error. Dauphin et al. (2014)

investigated the saddle point issues in deep neural networks, motivated by the results from statistical physics and random matrix theory. Recently,

Feizi et al. (2017) studied a one-hidden-layer nonlinear neural network with the parameters constrained in a finite set of lines, and showed that most local optima are global optima.

## 2 Preliminaries of Three Neural Networks

In this section, we describe the square loss functions that we consider for three types of neural networks, and characterize the forms of global minimizers of these loss functions, which further help to establish our main results of landscape properties for these loss functions in Sections 4 and 3.

Throughout, denotes the input and output data matrix pair. We assume that , i.e., there are data samples. We denote . We assume that and are full rank, and assume that

has distinct eigenvalues. Note that these are standard assumptions adopted in

Baldi and Hornik (1989); Kawaguchi (2016).

We also adopt the following notations. For a matrix , we denote as the vertical stack of the columns of , i.e., . The kronecker product between matrices and is denoted as . For a matrix , the spectral norm is denoted by

, the smallest nonzero singular value is denoted by

, and the trace is denoted by . We also denote a collection of natural numbers as .

### 2.1 Linear Neural Networks

Consider a feed forward linear neural network with hidden layers. Each layer is parameterized by a matrix , and we use to denote the collection of all model parameter matrices. In particular, we consider the setting where all parameter matrices are square, i.e., for all . We are interested in understanding the properties of the following square loss function in training the network as adopted by Baldi and Hornik (1989); Kawaguchi (2016):

 h(W):=12∥WlWl−1…W1X−Y∥2F, (1)

where denotes the matrix Frobenius norm.

It can be observed that the set of the global minimizers of is invariant under invertible transformations. Namely, if is a global minimizer, then is also a global minimizer, where

are arbitrary invertible square matrices. Thus, we treat all global minimizers up to such invertible matrix transformation as an

equivalence class. The following result states that under certain conditions, the global minimizers of can be uniquely characterized up to an equivalent class.

###### Proposition 1.

Consider of a linear neural network with square parameter matrices. Then the global minimizers can be uniquely (up to an equivalence class) characterized by

 W∗l =UCl, ⋯ W∗k =C−1k+1Ck, W∗1 =C−12U⊤Σ⊤XYΣ−1XX, (2)

where are arbitrary invertible matrices and

is the matrix formed by the eigenvectors corresponding to the top

eigenvalues of . In such a case, the global minimal value of is given by , where are the top eigenvalues of .

Proposition 1 generalizes the characterization of the global minimizers of shallow linear networks in Baldi and Hornik (1989) to deep linear networks. It states a not-so-obvious fact that any global minimizer in such a case must take the form in eq. 2, although it is easy to observe that given in eq. 2 achieves the global minimum. Moreover, the following corollary follows as an immediate observation from eq. 2.

###### Corollary 1.

Any global minimizer of of a linear neural network with square parameter matrices must be full rank.

### 2.2 Linear Residual Neural Networks

Consider the linear residual neural network, which further introduces the residual structure to the linear neural network. That is, one adds a shortcut (identity map) for every, say, hidden layers. Assuming we have in total residual units, then we consider the square loss of a linear residual neural network given as follows:

 f(A):=12∥(I+ Alr…Al1)…(WkI+Akr…Ak1) ⋯(I+A1r…A11)X−Y∥2F, (3)

where the model parameters of each layer are denoted by . Again, we consider the case when all parameter matrices are square, i.e., for all . The following property of the linear residual neural network follows directly from Proposition 1 for the linear nuerual network.

###### Proposition 2.

Consider of the linear residual network with . Then a full-rank global minimizer is fully characterized as, for all

 A∗kr =UkCkr, ⋯ A∗kq =C−1k(q+1)Ckq, ⋯ A∗k1 =C−1k2U⊤k(W∗k−I), (4)

where for is characterized as

 W∗l =ˆUˆCl, ⋯ W∗k =ˆC−1k+1ˆCk, ⋯ W∗1 =ˆC−12ˆU⊤Σ⊤XYΣ−1XX. (5)

Here, for all and are arbitrary invertible matrices, and is the matrix formed by all the eigenvectors of and is the matrix formed by all the eigenvectors of .

The above result characterizes the full-rank global minimizers via the form given by proposition 2. In particular, the characterization in proposition 2 implies that all residual units are also full rank. We note that when , the above characterization is consistent with the construction of the global minimizer in Hardt and Ma (2017).

### 2.3 Nonlinear Neural Network with One-hidden-layer

Consider a nonlinear neural network with one hidden layer, where the layer parameters are square, i.e., and

, and each hidden neuron adopts a differentiable nonlinear activation function

. We consider the following square loss function

 g(W):=12∥W2σ(W1X)−Y∥2F, (6)

where acts on entrywise. In particular, we consider a class of activation functions that satisfy the condition

. A typical example of such activation function is the class of parametric ReLU activation functions, i.e.,

, where .

The following result characterizes the form of the global minimizers of .

###### Proposition 3.

Consider of the one-hidden-layer nonlinear neural networks with and . Then any global minimizer can be characterized as

 W∗2=˜W∗2, W∗1=σ−1(˜W∗1X)X−1, (7)

where is a global minimizer of the corresponding linear network, and is fully characterized by Proposition 1.

We note that the inverse function should be understood as , and is well defined since . It can be seen from eq. 7 that the global minimizers of satisfy , which is full rank by creftypecap 1.

Based on the results in this section, we observe that for the scenario where all parameter matrices are square, all/partial global minimizers of all three aforestudied neural networks consist of full rank parameter matrices. Such a property further assists the establishment of the gradient dominance condition and the regularity condition for the loss functions of these types of neural networks, as we present in Sections 4 and 3.

## 3 Gradient Dominance Condition

The gradient dominance condition is generally regarded as a useful property that can be exploited for analyzing the convergence performance of optimization methods. In particular, this condition, together with a Lipschitz property of the gradient of an objective function, guarantees the linear convergence of the function value sequence generated by the gradient descent method. In this section, we establish the gradient dominance condition for the three types of neural networks of interest in this paper.

### 3.1 Linear Neural Networks

For the linear network, we define , and denote as the error matrix. We start our analysis by exploring the gradient of . We present them as follows in the denominator layout (i.e., column layout), and all calculations are provided in the supplemental material for the reader’s convenience.

The -th block of the gradient of can be characterized as

 ∀k∈[l],∇vec(Wk)h(W)=Gk(W)⊤vec(e), (8)

where is the error matrix of , and

 Gk(W):=(Wk−1…W1X)⊤⊗(Wl…Wk+1). (9)

Note that for the boundary cases , and

should be understood as identity matrix.

We next establish the gradient dominance condition in the neighborhood of any global minimizer of for the linear networks.

###### Theorem 1.

Consider of the linear neural network with . Consider a global minimizer and let . Then any point in the neighborhood of defined as satisfies

 h(W)−h(W∗)≤λh∥∥∇vec(W)h(W)∥∥22, (10)

where . Consequently, any critical point in this neighborhood is a global minimizer.

We note that creftypecap 1 guarantees that any global minimizer of under the assumption of creftypecap 1 is full rank, and hence the parameter defined in creftypecap 1 is strictly positive. The gradient dominance condition implies a linear convergence of the function value to the global minimum via a gradient descent algorithm, if the iterations of the algorithm stay in this neighborhood. In particular, a larger parameter (a larger minimum singular value) implies a smaller , which yields a faster convergence of the function value to the global minimum via a gradient descent algorithm.

### 3.2 Linear Residual Neural Networks

For the linear residual network, we define . For all , we denote , and denote as the error matrix. Please note that here are only notations for convenience, and the real parametrization is by the matrices .

We derive the following first-order derivatives of with details given in the supplemental material. The -th block of the first-order derivative of can be characterized as

 ∀k∈[l], q∈[r], ∇vec(Akq)f(A)=Qkq(A)⊤vec(e), (11)

where matrix takes the form

 Qkq (A):=[(Wk−1…W1X)⊤⊗(Wl…Wk+1)] [(Ak(q−1)…Ak1)⊤⊗(Akr…Ak(q+1))]. (12)

We then obtain the following gradient dominance condition within the neighborhood of a full-rank global minimizer.

###### Theorem 2.

Consider of the linear residual neural network with . Consider a full-rank global minimizer , and let , and pick sufficiently small such that any point in the neighborhood of defined as satisfies for all . Then any point in the neighborhood of defined as satisfies

 f(A)−f(A∗)≤λf∥∥∇vec(A)f(A)∥∥22, (13)

where Consequently, any critical point in this neighborhood is a global minimizer.

We note that the above theorem establishes the gradient dominance condition around full rank global minimizers. This is not too restrictive as is guaranteed to be full rank for all by proposition 2. Then, is full rank if and only if is full rank, which is satisfied if the column space of is incoherent with its row space. Also, positive exists because by continuity as for all .

As a comparison to the gradient dominance condition obtained in Hardt and Ma (2017), which is applicable to the neighborhood of the origin for the residual network with , the above result characterizes the gradient dominance condition for a broader range of the parameter space, which is applicable to the neighborhood of any full-rank minimizer and to more general residual networks with .

We note that the parameter in creftypecap 2 depends on both and , where captures the overall property of each residual unit and captures the property of individual linear unit in each residual unit. Hence, in general, the in creftypecap 2 for linear residual networks is very different from the in the gradient dominance condition in creftypecap 1 for linear networks. When the shortcut depth becomes large, the parameter involves that depends on more unparameterized variables in , and hence becomes more similar to the parameter of linear networks.

To further compare the in creftypecap 2 and the in creftypecap 1, consider a simplified setting of the linear residual network with the shortcut depth . Then , which is better regularized, although it takes the same expression as in creftypecap 1 for the linear network. The reason is that for residual networks, all are further parameterized by . When is small (in particular, less than one), (and hence the parameter ) is regularized away from zero by the identity map, which was also observed by Hardt and Ma (2017). Consequently, the identity shortcut leads to a smaller (due to larger ) compared to a large when the parameters of linear networks have small spectral norm. Such a smaller is more desirable for optimization, because the function value approaches closer to the global minimum after one iteration of a gradient descent algorithm.

### 3.3 Nonlinear Neural Network with One Hidden layer

For a nonlinear network with one hidden layer, we define and denote as the error matrix.

For such nonlinear networks, the structures of the gradient and Hessian of the loss function are much more complicated than those of linear networks. In specific, the gradient of can be characterized as follows (see supplemental material for the derivation)

 ∇vec(W2)g(W) (14) ∇vec(W1)g(W) =(X⊗I)vec(σ′(W1X)∘(W⊤2e)), (15)

where “” denotes the entrywise Hadamard product, and denotes the derivative of . The following result establishes the gradient dominance condition for one-hidden-layer nonlinear neural networks.

###### Theorem 3.

Consider the loss function of one-hidden-layer nonlinear neural networks with and with . Consider a global minimizer , and let . Then any point in the neighborhood of defined as satisfies

 g(W)≤λg∥∥∇vec(W)g(W)∥∥22, (16)

where .

We note that the characterization in Proposition 3 guarantees that is full rank, and hence is well defined. Differently from linear networks, the gradient dominance condition for nonlinear networks holds in a nonlinear neighborhood that involves the activation function . This is naturally due to the nonlinearity of the network. Furthermore, the parameter depends on the nonlinear term , whereas the in creftypecap 1 of linear networks depends on the individual parameters .

## 4 Regularity Condition

The regularity condition is an important landscape property in optimization theory, which has been shown to guarantee the linear convergence of a gradient descent iteration sequence to a global minimizer in various nonconvex problems as we discuss in Section 1. In this section, we establish the regularity condition for the loss functions of the three neural networks of interest here.

### 4.1 Linear Neural Networks

Consider any global minimizer of . We focus on the more amenable case where . In such a case, the global minimal value is zero (i.e., ), and the Hessian at the global minimizers can be expressed in a trackable form. More specifically, the -th block of the Hessian of can be characterized as

 ∀k,k′∈[l], ∇vec(W′k)(∇vec(Wk)h(W∗))=Gk′(W∗)⊤Gk(W∗), (17)

where the matrix is given in eq. 9. By further denoting , the entire Hessian matrix at any global minimizer can be written as

 ∇2vec(W)h(W∗)=G(W∗)⊤G(W∗). (18)

The following result establishes the regularity condition for the square loss function for linear networks.

###### Theorem 4.

Consider of linear neural networks with . Further consider a global minimizer , and let . Then for any , there exists a sufficiently small such that any point that satisfies

 ∥G(W∗)vec(W−W∗)∥2≥δ∥vec(W−W∗)∥2 (19)

and within the neighborhood of defined as satisfies

 ⟨∇vec(W)h(W),vec(W−W∗)⟩ (20)

where and for any .

We note that the regularity condition as in eq. 20 has been established and exploited for the convergence analysis in various nonconvex problems such as phase retrieval and rank minimization (see the references in Section 1.2). There, the regularity condition was shown to hold within the entire neighborhood of any global minimizer. In comparison, creftypecap 4 guarantees the regularity condition for linear neural networks within a neighborhood of with the further constraint in eq. 19. It can be observed that the condition in eq. 19 does not depend on the norm of , and hence it is a condition only on the direction of , along which the regularity condition can be satisfied. Furthermore, the parameter in eq. 19 determines the range of directions that satisfy eq. 19. For example, if we set , then all such that satisfy the condition eq. 19.

For all with directions that satisfy eq. 19 and hence satisfy the regularity condition, it can be shown that one gradient descent iteration yields an update that is closer to the global minimizer Candès et al. (2015). Hence, serves as an attractive point along those directions of that satisfy eq. 19. Furthermore, the value of in eq. 19 affects the parameter in the regularity condition. Larger results larger , which further implies that one gradient descent iteration at the point yields an update that is closer to the global minimizer Candès et al. (2015).

### 4.2 Linear Residual Neural Networks

Consider any global minimizer of of a linear residual network. Suppose, , which implies . Then the -th block of the Hessian of can be characterized as

 ∀k,k′∈[l],q,q′∈[r] ∇vec(Ak′q′)(∇vec(Akq)f(A∗))=Qk′q′(A∗)⊤Qkq(A∗), (21)

where matrix is given in section 3.2. We note that the above Hessian is evaluated at a global minimizer , and is hence different from the Hessian evaluated at the origin (i.e., ) in Li et al. (2016a).

Denote . Then the entire Hessian matrix at any global minimizer can be characterized as

 ∇2vec(A)f(A∗)=Q(A∗)⊤Q(A∗). (22)

The following theorem characterizes the regularity condition for the square loss function of linear residual networks.

###### Theorem 5.

Consider of linear residual neural networks with . Further consider a global minimizer , and let and . Then, for any constant , there exists a sufficiently small such that any point that satisfies

 ∥∥Q(A∗)vec(A−A∗)∥∥2≥δ∥∥vec(A−A∗)∥∥2

and within the neighborhood of defined as , satisfies

 ⟨∇vec(A)f(A),vec(A−A∗)⟩ (23)

where and with any .

Similarly to the regularity condition for linear networks, the regularity condition for linear residual networks holds only along the directions of such that . However, the parametrization of of linear residual networks is different from that of of linear networks. To illustrate, consider a simplified setting of the linear residual network where the shortcut depth is . Then, one has

 Qk(A∗)=(W∗k−1…W∗1X)⊤⊗(W∗l…W∗k+1).

Although it takes the same form as of the linear network, the reparameterization of keeps away from zero when is small. This enlarges so that the constraint can be satisfied along a wider range of directions. In this way, attracts the optimization iteration path to converge along a wider range of directions in the neighborhood of the origin.

### 4.3 Nonlinear Neural Networks with One Hidden Layer

The Hessian of general nonlinear networks can take complicated forms, analyzing which is typically not trackable. Here, we consider nonlinear neural networks with one hidden layer and focus on a simplified setting where . In this case, can realize any square matrix and hence the global minimal value of is zero. Consequently, the Hessian of at a global minimizer takes the following form

 ∇2vec(W)g(W∗)=H(W∗)⊤H(W∗), (24)

where

 H⊤=[(X⊗I)σ′(diag(vec(W∗1X)))(I⊗(W∗2)⊤)σ(W∗1X)⊗I].

The following theorem characterizes the regularity condition for the square loss function of nonlinear neural networks.

###### Theorem 6.

Consider of one-hidden-layer nonlinear neural networks with and . Further consider a global minimizer of , and let . Then there exists a sufficiently small such that any point that satisfies

 ∥H(W∗)vec(W−W∗)∥2≥δ∥vec(W−W∗)∥2

and within the neighborhood of defined as satisfies

 ⟨∇vec(W)g(W),vec(W−W∗)⟩ (25)

where and for any .

Thus, nonlinear neural networks with one hidden layer also have an amenable landscape near the global minimizers that attracts gradient iterates to converge along the directions restricted by .

## 5 Conclusion

In this paper, we explored two landscape properties for square losses of three types of neural networks: linear, linear residual, and one-hidden-layer nonlinear networks. We focused on the scenario with square parameter matrices, which allows us to characterize the explicit form of the global minimizers for these networks up to an equivalence class. Moreover, the characterization of global minimizers helps to establish the gradient dominance condition and the regularity condition for these networks in the neighborhood of their global minimizers under certain conditions. Along this direction, many interesting questions can be further asked, and are worth exploration in the future. For example, can we generalize the existing results for deep linear and shallow nonlinear networks to deep nonlinear networks? How can we further exploit the information about the higher-order derivatives of loss functions to understand the landscape of loss functions? Furthermore, it is interesting to exploit these gradient dominance condition and regularity condition in the convergence analysis of optimization algorithms applied to deep learning networks. The ultimate goal is to develop the theory that effectively exploits the properties of loss functions to guide the design of optimization algorithms for deep neural networks in practice.

## Proof of Proposition 1

We first consider the two-layer linear network , where . Then the global minimizers are fully (sufficiently and necessarily) characterized by (Baldi and Hornik, 1989, Fact 4) as

 A=UC,B=C−1U⊤Σ⊤XYΣ−1XX, (26)

where is an arbitrary invertible matrix and is the matrix formed by the eigenvectors that correspond to the top eigenvalues of . Moreover, the global minimum value is given by , where are the top eigenvalues of .

Now consider the deep linear neural network with layers. Since all parameter matrices are square, and hence . Thus, for any , we can treat and and apply the characterization in eq. 26 (due to its uniqueness). We then obtain that any , that achieves the global optimum of is fully (sufficiently and necessarily) characterized by

 ∀k=2,…,l, Wl…Wk=UCk, (27) Wk−1…W1=C−1kU⊤Σ⊤XYΣ−1XX, (28)

where are arbitrary invertible matrices. Note that eq. 27 with gives

 Wl=UCl,WlWl−1=UCl−1, (29)

which further imply

 UClWl−1=UCl−1. (30)

Multiplying both sides by and noting that , one can solve for as . We then apply eq. 27 inductively and obtain that

 Wl=UCl,…,Wk=C−1k+1Ck,…,W1=C−12U⊤Σ⊤XYΣ−1XX. (31)

One can verify that such a solution also satisfy the conditions in eq. 28 for all , and hence fully characterizes global minimizers. Clearly, the global minimizers characterized by eq. 31 belong to a unique equivalence class.

## Proof of Proposition 2

Consider any full-rank global minimizer . Since , the global minimal value of is zero, and the fact that minimizes implies that the corresponding minimizes . Thus, by the characterization in Proposition 1, we conclude that is of full rank, and must be characterized as

 W∗l=ˆUˆCl,…,W∗k=ˆC−1k+1ˆCk,…,W∗1=ˆC2−1ˆU⊤ΣXYΣ−1XX, (32)

where are arbitrary invertible matrices and is the matrix formed by all the eigenvectors of .

Since by definition , minimizes the loss of a linear network. And is of full rank since all are of full rank. We want to apply Proposition 1 again to characterize the form of for . Note that may not have distinct eigenvalues. Hence by the remark in the proof of (Baldi and Hornik, 1989, Fact 4), the matrix formed by the eigenvectors should be generalized to , where

is block diagonal with each block being an orthogonal matrix (where the dimension is determined by the multiplicities of the eigenvalues). Then we can characterize

as