# Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods

The optimization problem behind neural networks is highly non-convex. Training with stochastic gradient descent and variants requires careful parameter tuning and provides no guarantee to achieve the global optimum. In contrast we show under quite weak assumptions on the data that a particular class of feedforward neural networks can be trained globally optimal with a linear convergence rate with our nonlinear spectral method. Up to our knowledge this is the first practically feasible method which achieves such a guarantee. While the method can in principle be applied to deep networks, we restrict ourselves for simplicity in this paper to one and two hidden layer networks. Our experiments confirm that these models are rich enough to achieve good performance on a series of real-world datasets.

## Authors

• 6 publications
• 15 publications
• 67 publications
10/29/2018

### On the Convergence Rate of Training Recurrent Neural Networks

Despite the huge success of deep learning, our understanding to how the ...
10/04/2018

### Gradient Descent Provably Optimizes Over-parameterized Neural Networks

One of the mystery in the success of neural networks is randomly initial...
02/26/2017

### Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs

Deep learning models are often successfully trained using gradient desce...
04/26/2017

### The loss surface of deep and wide neural networks

While the optimization problem behind deep neural networks is highly non...
06/25/2021

### Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent

Although the optimization objectives for learning neural networks are hi...
11/15/2020

### Coresets for Robust Training of Neural Networks against Noisy Labels

Modern neural networks have the capacity to overfit noisy labels frequen...
03/26/2018

### A Provably Correct Algorithm for Deep Learning that Actually Works

We describe a layer-by-layer algorithm for training deep convolutional n...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

is currently the state of the art machine learning technique in many application areas such as computer vision or natural language processing. While the theoretical foundations of neural networks have been explored in depth see e.g.

BarAnt1998 , the understanding of the success of training deep neural networks is a currently very active research area ChoEtAl2015 ; DanFroSin2016 ; HarRecSin2015 . On the other hand the parameter search for stochastic gradient descent and variants such as Adagrad and Adam can be quite tedious and there is no guarantee that one converges to the global optimum. In particular, the problem is even for a single hidden layer in general NP hard, see Sim2002 and references therein. This implies that to achieve global optimality efficiently one has to impose certain conditions on the problem.

A recent line of research has directly tackled the optimization problem of neural networks and provided either certain guarantees AroEtAl2014 ; LivShaSha2014 in terms of the global optimum or proved directly convergence to the global optimum HaeVid2015 ; JanSedAna2015 . The latter two papers are up to our knowledge the first results which provide a globally optimal algorithm for training neural networks. While providing a lot of interesting insights on the relationship of structured matrix factorization and training of neural networks, Haeffele and Vidal admit themselves in their paper HaeVid2015 that their results are “challenging to apply in practice”. In the work of Janzamin et al. JanSedAna2015

they use a tensor approach and propose a globally optimal algorithm for a feedforward neural network with one hidden layer and squared loss. However, their approach requires the computation of the score function tensor which uses the density of the data-generating measure. However, the data generating measure is unknown and also difficult to estimate for high-dimensional feature spaces. Moreover, one has to check certain non-degeneracy conditions of the tensor decomposition to get the global optimality guarantee.

In contrast our nonlinear spectral method just requires that the data is nonnegative which is true for all sorts of count data such as images, word frequencies etc. The condition which guarantees global optimality just depends on the parameters of the architecture of the network and boils down to the computation of the spectral radius of a small nonnegative matrix. The condition can be checked without running the algorithm. Moreover, the nonlinear spectral method has a linear convergence rate and thus the globally optimal training of the network is very fast. The two main changes compared to the standard setting are that we require nonnegativity on the weights of the network and we have to minimize a modified objective function which is the sum of loss and the negative total sum of the outputs. While this model is non-standard, we show in some first experimental results that the resulting classifier is still expressive enough to create complex decision boundaries. As well, we achieve competitive performance on some UCI datasets. As the nonlinear spectral method requires some non-standard techniques, we use the main part of the paper to develop the key steps necessary for the proof. However, some proofs of the intermediate results are moved to the supplementary material.

## 2 Main result

In this section we present the algorithm together with the main theorem providing the convergence guarantee. We limit the presentation to one hidden layer networks to improve the readability of the paper. Our approach can be generalized to feedforward networks of arbitrary depth. In particular, we present in Section 4.1 results for two hidden layers.

We consider in this paper multi-class classification where is the dimension of the feature space and is the number of classes. We use the negative cross-entropy loss defined for label and classifier as

 L(y,f(x))=−log⎛⎝efy(x)∑Kj=1efj(x)⎞⎠=−fy(x)+log(K∑j=1efj(x)).

The function class we are using is a feedforward neural network with one hidden layer with

hidden units. As activation functions we use real powers of the form of a generalized polyomial, that is for

with , , we define:

 fr(x)=fr(w,u)(x)=n1∑l=1wrl(d∑m=1ulmxm)αl, (1)

where and , are the parameters of the network which we optimize. The function class in (1) can be seen as a generalized polynomial in the sense that the powers do not have to be integers. Polynomial neural networks have been recently analyzed in LivShaSha2014

. Please note that a ReLU activation function makes no sense in our setting as we require the data as well as the weights to be nonnegative. Even though nonnegativity of the weights is a strong constraint, one can model quite complex decision boundaries (see Figure

1, where we show the outcome of our method for a toy dataset in ).

In order to simplify the notation we use for the output units , . All output units and the hidden layer are normalized. We optimize over the set

 S+={(w,u)∈RK×n1+×Rn1×d+ ∣∣  ∥u∥pu=ρu, ∥wi∥pw=ρw, ∀i=1,…,K}.

We also introduce where one replaces with . The final optimization problem we are going to solve is given as

 max(w,u)∈S+Φ(w,u)with (2) Φ(w,u)

where , is the training data. Note that this is a maximization problem and thus we use minus the loss in the objective so that we are effectively minimizing the loss. The reason to write this as a maximization problem is that our nonlinear spectral method is inspired by the theory of (sub)-homogeneous nonlinear eigenproblems on convex cones NB which has its origin in the Perron-Frobenius theory for nonnegative matrices. In fact our work is motivated by the closely related Perron-Frobenius theory for multi-homogeneous problems developed in GTH . This is also the reason why we have nonnegative weights, as we work on the positive orthant which is a convex cone. Note that in the objective can be chosen arbitrarily small and is added out of technical reasons.

In order to state our main theorem we need some additional notation. For , we let be the Hölder conjugate of , and . We apply

to scalars and vectors in which case the function is applied componentwise. For a square matrix

we denote its spectral radius by . Finally, we write (resp. ) to denote the gradient of with respect to (resp. ) at . The mapping

 GΦ(w,u)=(ρwψp′w(∇w1Φ(w,u))∥ψp′w(∇w1Φ(w,u))∥pw,…,ρwψp′w(∇wKΦ(w,u))∥ψp′w(∇wKΦ(w,u))∥pw,ρuψp′u(∇uΦ(w,u))∥ψp′u(∇uΦ(w,u))∥pu), (3)

defines a sequence converging to the global optimum of (2). Indeed, we prove:

###### Theorem 1.

Let , , , and with for every . Define as , , and let be defined as

 Al,m=4(p′w−1)ξ1,Al,K+1=2(p′w−1)(2ξ2+∥α∥∞),AK+1,m=2(p′u−1)(2ξ1+1),AK+1,K+1=2(p′u−1)(2ξ2+∥α∥∞−1),∀m,l∈[K].

If the spectral radius of satisfies , then (2) has a unique global maximizer . Moreover, for every , there exists such that

 limk→∞(wk,uk)=(w∗,u∗)and∥(wk,uk)−(w∗,u∗)∥∞≤Rρ(A)k∀k∈N,

where for every .

Note that one can check for a given model (number of hidden units , choice of , , , , ) easily if the convergence guarantee to the global optimum holds by computing the spectral radius of a square matrix of size . As our bounds for the matrix are very conservative, the “effective” spectral radius is typically much smaller, so that we have very fast convergence in only a few iterations, see Section 5 for a discussion. Up to our knowledge this is the first practically feasible algorithm to achieve global optimality for a non-trivial neural network model. Additionally, compared to stochastic gradient descent, there is no free parameter in the algorithm. Thus no careful tuning of the learning rate is required. The reader might wonder why we add the second term in the objective, where we sum over all outputs. The reason is that we need that the gradient of is strictly positive in , this is why we also have to add the third term for arbitrarily small . In Section 5 we show that this model achieves competitive results on a few UCI datasets.

#### Choice of α:

It turns out that in order to get a non-trivial classifier one has to choose so that for every with . The reason for this lies in certain invariance properties of the network. Suppose that we use a permutation invariant componentwise activation function , that is for any permutation matrix and suppose that are globally optimal weight matrices for a one hidden layer architecture, then for any permutation matrix ,

 Aσ(Bx)=APTPσ(Bx)=APTσ(PBx),

which implies that and yield the same function and thus are also globally optimal. In our setting we know that the global optimum is unique and thus it has to hold that, and for all permutation matrices . This implies that both and have rank one and thus lead to trivial classifiers. This is the reason why one has to use different for every unit.

#### Dependence of ρ(A) on the model parameters:

Let and assume for every , then , see Corollary 3.30 Plemmons . It follows that in Theorem 1 is increasing w.r.t. and the number of hidden units . Moreover, is decreasing w.r.t. and in particular, we note that for any fixed architecture it is always possible to find large enough so that . Indeed, we know from the Collatz-Wielandt formula (Theorem 8.1.26 in Horn ) that for any . We use this to derive lower bounds on that ensure . Let , then for every guarantees and is equivalent to

 pw>4(K+1)ξ1+3andpu>2(K+1)(∥α∥∞+2ξ2)−1, (4)

where are defined as in Theorem 1. However, we think that our current bounds are sub-optimal so that this choice is quite conservative. Finally, we note that the constant in Theorem 1 can be explicitly computed when running the algorithm (see Theorem 3).

#### Proof Strategy:

The following main part of the paper is devoted to the proof of the algorithm. For that we need some further notation. We introduce the sets

 V+ =RK×n1+×Rn1×d+,V++=RK×n1++×Rn1×d++ B+ ={(w,u)∈V+ ∣∣ ∥u∥pu≤ρu,∥wi∥pw≤ρw,∀i=1,…,K},

and similarly we define replacing by in the definition. The high-level idea of the proof is that we first show that the global maximum of our optimization problem in (2) is attained in the “interior” of , that is . Moreover, we prove that any critical point of (2) in is a fixed point of the mapping . Then we proceed to show that there exists a unique fixed point of in and thus there is a unique critical point of (2) in . As the global maximizer of (2) exists and is attained in the interior, this fixed point has to be the global maximizer.

Finally, the proof of the fact that has a unique fixed point follows by noting that maps into and the fact that is a complete metric space with respect to the Thompson metric. We provide a characterization of the Lipschitz constant of and in turn derive conditions under which is a contraction. Finally, the application of the Banach fixed point theorem yields the uniqueness of the fixed point of and the linear convergence rate to the global optimum of (2). In Section 4 we show the application of the established framework for our neural networks.

## 3 From the optimization problem to fixed point theory

###### Lemma 1.

Let be differentiable. If for every , then the global maximum of on is attained in .

###### Proof.

First note that as is a continuous function on the compact set the global minimum and maximum are attained. A boundary point of is characterized by the fact that at least one of the variables has a zero component. Suppose w.l.o.g. that the subset of components of are zero, that is . The normal vector of the -sphere at is given by . The set of tangent directions is thus given by

 T={v∈Rn1|⟨ν,v⟩=0}.

Note that if is a local maximum, then

 (5)

where is the set of “positive” tangent directions, that are pointing inside the set . Otherwise there would exist a direction of ascent which leads to a feasible point. Now note that has non-negative components as . Thus

 T+={v∈Rn1+|vi=0 if i∉J}.

However, by assumption is a vector with strictly positive components and thus (5) can never be fulfilled as contains only vectors with non-negative components and at least one of the components is strictly positive as . Finally, as the global maximum is attained in and no local maximum exists at the boundary, the global maximum has to be attained in . ∎

We now identify critical points of the objective in with fixed points of in .

###### Lemma 2.

Let be differentiable. If for all , then is a critical point of in if and only if it is a fixed point of .

###### Proof.

The Lagrangian of constrained to the unit sphere is given by

 L(w,u,λ)=Φ(w,u)−λK+1(∥u∥pu−ρu)−K∑j=1λi(∥wj∥pw−ρw).

A necessary and sufficient condition Ber1999 for being a critical point of is the existence of with

 ∇wjΦ(w,u)=λjψpw(wj)∀j∈[K]and∇uΦ(w,u)=λK+1ψpu(u). (6)

Note that as and the gradients are strictly positive in the , have to be strictly positive. Noting that , we get

 ψp′w(∇wjΦ(w,u))=λp′w−1jwj∀j∈[K]andψp′u(∇uΦ(w,u))=λp′u−1K+1u. (7)

In particular, is a critical point of in if and only if it satisfies (7). Finally, note that

as the gradient is strictly positive on and thus defined in (3) is well-defined and if is a critical point, then by (7) it holds . On the other hand if , then

 ρwψp′w(∇wjΦ(w∗,u∗))∥ψp′w(∇wjΦ(w∗,u∗))∥pw

and thus there exists , and such that (7) holds and thus is a critical point of in . ∎

Our goal is to apply the Banach fixed point theorem to . We recall this theorem for the convenience of the reader.

###### Theorem 2 (Banach fixed point theorem e.g. KirKha2001 ).

Let be a complete metric space with a mapping such that for and all . Then has a unique fixed-point in , that is and the sequence defined as with converges with linear convergence rate

 d(xn,x∗)≤qn1−qd(x1,x0).

So, we need to endow with a metric so that

is a complete metric space. A popular metric for the study of nonlinear eigenvalue problems on the positive orthant is the so-called Thompson metric

Thompson defined as

 d(z,~z)=∥ln(z)−ln(~z)∥∞whereln(z)=(ln(z1),…,ln(zm)).

Using the known facts that is a complete metric space and its topology coincides with the norm topology (see e.g. Corollary 2.5.6 and Proposition 2.5.2 NB ), we prove:

###### Lemma 3.

For and , is a complete metric space.

###### Proof.

Let be a Cauchy sequence w.r.t. to the metric . We know from Proposition 2.5.2 in NB that is a complete metric space and thus there exists such that converge to w.r.t. . Corollary 2.5.6 in NB implies that the topology of coincide with the norm topology implying that w.r.t. the norm topology. Finally, since is a continuous function, we get , i.e. which proves our claim. ∎

Now, the idea is to see as a product of such metric spaces. For , let and for some constant . Furthermore, let and . Then is a complete metric space for every and . It follows that is a complete metric space with defined as

 μ((w,u),(~w,~u))=K∑i=1γi∥ln(wi)−ln(~wi)∥∞+γK+1∥ln(u)−ln(~u)∥∞.

The motivation for introducing the weights is given by the next theorem. We provide a characterization of the Lipschitz constant of a mapping with respect to . Moreover, this Lipschitz constant can be minimized by a smart choice of . For , we write and to denote the components of such that .

###### Lemma 4.

Suppose that and satisfies

 ⟨|∇wkFwi,j(w,u)|,wk⟩≤Ai,kFwi,j(w,u),⟨|∇uFwi,j(w,u)|,u⟩≤Ai,K+1Fwi,j(w,u)

and

 ⟨|∇wkFuab(w,u)|,wk⟩≤AK+1,kFuab(w,u),⟨|∇uFuab(w,u)|,u⟩≤AK+1,K+1Fuab(w,u)

for all , , and . Then, for every it holds

 μ(F(w,u),F(~w,~u))≤Uμ((w,u),(~w,~u))withU=maxk∈[K+1](ATγ)kγk.
###### Proof.

Let and . First, we show that for , one has

 ∣∣ln(Fwi,j(w,u))−ln(Fwi,j(~w,~u))∣∣≤K∑s=1Ai,s∥ln(ws)−ln(~ws)∥∞+Ai,K+1∥ln(u)−ln(~u)∥∞.

Let . If , there is nothing to prove. So, suppose that . Let and with

 g(¯¯¯¯w,¯¯¯u)=ln(Fwi,j(exp(¯¯¯¯w,¯¯¯u)))whereexp(¯¯¯¯w,¯¯¯u)=(e¯¯¯w1,1,…,e¯¯¯wK,n1,e¯¯¯u11,…,e¯¯¯un1d).

Note that is continuously differentiable because it is the composition of continuously differentiable mappings. Set and . By the mean value theorem, there exists , such that satisfies

 g(v,x)−g(y,z)=⟨∇g(^w,^u),(v,x)−(y,z)⟩.

Let . It follows that

 ln(Fwi,j(w,u))−ln(Fwi,j(~w,~u)) =⟨∇g(^w,^u),(v,x)−(y,z)⟩ =⟨(∇Fwi,j(¯¯¯¯w,¯¯¯u))∘(¯¯¯¯w,¯¯¯u),(v,x)−(y,z)⟩Fwi,j(¯¯¯¯w,¯¯¯u)

where denotes the Hadamard product. Note that by convexity of the exponential and we have as

 ∥¯¯¯¯w∥pw =∥etv+(1−t)y∥pw≤∥tev+(1−t)ey∥pw≤t∥ev∥pw+(1−t)∥ey∥pw =t∥w∥pw+(1−t)∥~w∥pw≤ρw,

where we have used in the second step convexity of the exponential and that the -norms are order preserving, if for all . A similar argument shows . Hence

 ∣∣ln(Fwi,j(w,u))−ln(Fwi,j(~w,~u))∣∣ ≤K∑k=1∣∣⟨(∇wkFwi,j(¯¯¯¯w,¯¯¯u))∘¯¯¯¯wk,vk−yk⟩∣∣Fwi,j(¯¯¯¯w,¯¯¯u)+|⟨(∇uFwi,j(¯¯¯¯w,¯¯¯u))∘¯¯¯u,x−z⟩|Fwi,j(¯¯¯¯w,¯¯¯u) ≤K∑k=1∥(∇wkFwi,j(¯¯¯¯w,¯¯¯u))∘¯¯¯¯wk∥1Fwi,j(¯¯¯¯w,¯¯¯u)∥vk−yk∥∞+∥(∇uFwi,j(¯¯¯¯w,¯¯¯u))∘¯¯¯u∥1Fwi,j(¯¯¯¯w,¯¯¯u)∥x−z∥∞ ≤K∑k=1Ai,k∥ln(wk)−ln(~wk)∥∞+Ai,K+1∥ln(u)−ln(~u)∥∞.

In particular, taking the maximum over shows that for every we have

A similar argument shows that is upper bounded by

 K∑s=1AK+1,s∥ln(ws)−ln(~ws)∥∞+AK+1,K+1∥ln(u)−ln(~u)∥∞.

So, we finally get

 μ(F(w,u),F(~w,~u)) =K∑i=1γi∥ln(Fwi(w,u))−ln(Fwi(~w,~u))∥∞ +γK+1∥ln(Fu(w,u))−ln(Fu(~w,~u))∥∞ ≤K∑s=1(ATγ)s∥ln(ws)−ln(~ws)∥∞+(ATγ)K+1∥ln(u)−ln(~u)∥∞

Note that, from the Collatz-Wielandt ratio for nonnegative matrices, we know that the constant in Lemma 4 is lower bounded by the spectral radius of . Indeed, by Theorem 8.1.31 in Horn , we know that if

has a positive eigenvector

, then

 maxi∈[K+1](ATγ)iγi=ρ(A)=min~γ∈RK+1++maxi∈[K+1](AT~γ)i~γi. (8)

Therefore, in order to obtain the minimal Lipschitz constant in Lemma 4, we choose the weights of the metric to be the components of . A combination of Theorem 2, Lemma 4 and this observation implies the following result.

###### Theorem 3.

Let with . Let be defined as in (3). Suppose that there exists a matrix such that and satisfies the assumptions of Lemma 4 and