is currently the state of the art machine learning technique in many application areas such as computer vision or natural language processing. While the theoretical foundations of neural networks have been explored in depth see e.g.BarAnt1998 , the understanding of the success of training deep neural networks is a currently very active research area ChoEtAl2015 ; DanFroSin2016 ; HarRecSin2015 . On the other hand the parameter search for stochastic gradient descent and variants such as Adagrad and Adam can be quite tedious and there is no guarantee that one converges to the global optimum. In particular, the problem is even for a single hidden layer in general NP hard, see Sim2002 and references therein. This implies that to achieve global optimality efficiently one has to impose certain conditions on the problem.
A recent line of research has directly tackled the optimization problem of neural networks and provided either certain guarantees AroEtAl2014 ; LivShaSha2014 in terms of the global optimum or proved directly convergence to the global optimum HaeVid2015 ; JanSedAna2015 . The latter two papers are up to our knowledge the first results which provide a globally optimal algorithm for training neural networks. While providing a lot of interesting insights on the relationship of structured matrix factorization and training of neural networks, Haeffele and Vidal admit themselves in their paper HaeVid2015 that their results are “challenging to apply in practice”. In the work of Janzamin et al. JanSedAna2015
they use a tensor approach and propose a globally optimal algorithm for a feedforward neural network with one hidden layer and squared loss. However, their approach requires the computation of the score function tensor which uses the density of the data-generating measure. However, the data generating measure is unknown and also difficult to estimate for high-dimensional feature spaces. Moreover, one has to check certain non-degeneracy conditions of the tensor decomposition to get the global optimality guarantee.
In contrast our nonlinear spectral method just requires that the data is nonnegative which is true for all sorts of count data such as images, word frequencies etc. The condition which guarantees global optimality just depends on the parameters of the architecture of the network and boils down to the computation of the spectral radius of a small nonnegative matrix. The condition can be checked without running the algorithm. Moreover, the nonlinear spectral method has a linear convergence rate and thus the globally optimal training of the network is very fast. The two main changes compared to the standard setting are that we require nonnegativity on the weights of the network and we have to minimize a modified objective function which is the sum of loss and the negative total sum of the outputs. While this model is non-standard, we show in some first experimental results that the resulting classifier is still expressive enough to create complex decision boundaries. As well, we achieve competitive performance on some UCI datasets. As the nonlinear spectral method requires some non-standard techniques, we use the main part of the paper to develop the key steps necessary for the proof. However, some proofs of the intermediate results are moved to the supplementary material.
2 Main result
In this section we present the algorithm together with the main theorem providing the convergence guarantee. We limit the presentation to one hidden layer networks to improve the readability of the paper. Our approach can be generalized to feedforward networks of arbitrary depth. In particular, we present in Section 4.1 results for two hidden layers.
We consider in this paper multi-class classification where is the dimension of the feature space and is the number of classes. We use the negative cross-entropy loss defined for label and classifier as
The function class we are using is a feedforward neural network with one hidden layer with
hidden units. As activation functions we use real powers of the form of a generalized polyomial, that is forwith , , we define:
where and , are the parameters of the network which we optimize. The function class in (1) can be seen as a generalized polynomial in the sense that the powers do not have to be integers. Polynomial neural networks have been recently analyzed in LivShaSha2014
. Please note that a ReLU activation function makes no sense in our setting as we require the data as well as the weights to be nonnegative. Even though nonnegativity of the weights is a strong constraint, one can model quite complex decision boundaries (see Figure1, where we show the outcome of our method for a toy dataset in ).
In order to simplify the notation we use for the output units , . All output units and the hidden layer are normalized. We optimize over the set
We also introduce where one replaces with . The final optimization problem we are going to solve is given as
where , is the training data. Note that this is a maximization problem and thus we use minus the loss in the objective so that we are effectively minimizing the loss. The reason to write this as a maximization problem is that our nonlinear spectral method is inspired by the theory of (sub)-homogeneous nonlinear eigenproblems on convex cones NB which has its origin in the Perron-Frobenius theory for nonnegative matrices. In fact our work is motivated by the closely related Perron-Frobenius theory for multi-homogeneous problems developed in GTH . This is also the reason why we have nonnegative weights, as we work on the positive orthant which is a convex cone. Note that in the objective can be chosen arbitrarily small and is added out of technical reasons.
In order to state our main theorem we need some additional notation. For , we let be the Hölder conjugate of , and . We apply
to scalars and vectors in which case the function is applied componentwise. For a square matrixwe denote its spectral radius by . Finally, we write (resp. ) to denote the gradient of with respect to (resp. ) at . The mapping
defines a sequence converging to the global optimum of (2). Indeed, we prove:
Let , , , and with for every . Define as , , and let be defined as
If the spectral radius of satisfies , then (2) has a unique global maximizer . Moreover, for every , there exists such that
where for every .
Note that one can check for a given model (number of hidden units , choice of , , , , ) easily if the convergence guarantee to the global optimum holds by computing the spectral radius of a square matrix of size . As our bounds for the matrix are very conservative, the “effective” spectral radius is typically much smaller, so that we have very fast convergence in only a few iterations, see Section 5 for a discussion. Up to our knowledge this is the first practically feasible algorithm to achieve global optimality for a non-trivial neural network model. Additionally, compared to stochastic gradient descent, there is no free parameter in the algorithm. Thus no careful tuning of the learning rate is required. The reader might wonder why we add the second term in the objective, where we sum over all outputs. The reason is that we need that the gradient of is strictly positive in , this is why we also have to add the third term for arbitrarily small . In Section 5 we show that this model achieves competitive results on a few UCI datasets.
Choice of :
It turns out that in order to get a non-trivial classifier one has to choose so that for every with . The reason for this lies in certain invariance properties of the network. Suppose that we use a permutation invariant componentwise activation function , that is for any permutation matrix and suppose that are globally optimal weight matrices for a one hidden layer architecture, then for any permutation matrix ,
which implies that and yield the same function and thus are also globally optimal. In our setting we know that the global optimum is unique and thus it has to hold that, and for all permutation matrices . This implies that both and have rank one and thus lead to trivial classifiers. This is the reason why one has to use different for every unit.
Dependence of on the model parameters:
Let and assume for every , then , see Corollary 3.30 Plemmons . It follows that in Theorem 1 is increasing w.r.t. and the number of hidden units . Moreover, is decreasing w.r.t. and in particular, we note that for any fixed architecture it is always possible to find large enough so that . Indeed, we know from the Collatz-Wielandt formula (Theorem 8.1.26 in Horn ) that for any . We use this to derive lower bounds on that ensure . Let , then for every guarantees and is equivalent to
where are defined as in Theorem 1. However, we think that our current bounds are sub-optimal so that this choice is quite conservative. Finally, we note that the constant in Theorem 1 can be explicitly computed when running the algorithm (see Theorem 3).
The following main part of the paper is devoted to the proof of the algorithm. For that we need some further notation. We introduce the sets
and similarly we define replacing by in the definition. The high-level idea of the proof is that we first show that the global maximum of our optimization problem in (2) is attained in the “interior” of , that is . Moreover, we prove that any critical point of (2) in is a fixed point of the mapping . Then we proceed to show that there exists a unique fixed point of in and thus there is a unique critical point of (2) in . As the global maximizer of (2) exists and is attained in the interior, this fixed point has to be the global maximizer.
Finally, the proof of the fact that has a unique fixed point follows by noting that maps into and the fact that is a complete metric space with respect to the Thompson metric. We provide a characterization of the Lipschitz constant of and in turn derive conditions under which is a contraction. Finally, the application of the Banach fixed point theorem yields the uniqueness of the fixed point of and the linear convergence rate to the global optimum of (2). In Section 4 we show the application of the established framework for our neural networks.
3 From the optimization problem to fixed point theory
Let be differentiable. If for every , then the global maximum of on is attained in .
First note that as is a continuous function on the compact set the global minimum and maximum are attained. A boundary point of is characterized by the fact that at least one of the variables has a zero component. Suppose w.l.o.g. that the subset of components of are zero, that is . The normal vector of the -sphere at is given by . The set of tangent directions is thus given by
Note that if is a local maximum, then
where is the set of “positive” tangent directions, that are pointing inside the set . Otherwise there would exist a direction of ascent which leads to a feasible point. Now note that has non-negative components as . Thus
However, by assumption is a vector with strictly positive components and thus (5) can never be fulfilled as contains only vectors with non-negative components and at least one of the components is strictly positive as . Finally, as the global maximum is attained in and no local maximum exists at the boundary, the global maximum has to be attained in . ∎
We now identify critical points of the objective in with fixed points of in .
Let be differentiable. If for all , then is a critical point of in if and only if it is a fixed point of .
The Lagrangian of constrained to the unit sphere is given by
A necessary and sufficient condition Ber1999 for being a critical point of is the existence of with
Note that as and the gradients are strictly positive in the , have to be strictly positive. Noting that , we get
In particular, is a critical point of in if and only if it satisfies (7). Finally, note that
and thus there exists , and such that (7) holds and thus is a critical point of in . ∎
Our goal is to apply the Banach fixed point theorem to . We recall this theorem for the convenience of the reader.
Theorem 2 (Banach fixed point theorem e.g. KirKha2001 ).
Let be a complete metric space with a mapping such that for and all . Then has a unique fixed-point in , that is and the sequence defined as with converges with linear convergence rate
So, we need to endow with a metric so that
is a complete metric space. A popular metric for the study of nonlinear eigenvalue problems on the positive orthant is the so-called Thompson metricThompson defined as
Using the known facts that is a complete metric space and its topology coincides with the norm topology (see e.g. Corollary 2.5.6 and Proposition 2.5.2 NB ), we prove:
For and , is a complete metric space.
Let be a Cauchy sequence w.r.t. to the metric . We know from Proposition 2.5.2 in NB that is a complete metric space and thus there exists such that converge to w.r.t. . Corollary 2.5.6 in NB implies that the topology of coincide with the norm topology implying that w.r.t. the norm topology. Finally, since is a continuous function, we get , i.e. which proves our claim. ∎
Now, the idea is to see as a product of such metric spaces. For , let and for some constant . Furthermore, let and . Then is a complete metric space for every and . It follows that is a complete metric space with defined as
The motivation for introducing the weights is given by the next theorem. We provide a characterization of the Lipschitz constant of a mapping with respect to . Moreover, this Lipschitz constant can be minimized by a smart choice of . For , we write and to denote the components of such that .
Suppose that and satisfies
for all , , and . Then, for every it holds
Let and . First, we show that for , one has
Let . If , there is nothing to prove. So, suppose that . Let and with
Note that is continuously differentiable because it is the composition of continuously differentiable mappings. Set and . By the mean value theorem, there exists , such that satisfies
Let . It follows that
where denotes the Hadamard product. Note that by convexity of the exponential and we have as
where we have used in the second step convexity of the exponential and that the -norms are order preserving, if for all . A similar argument shows . Hence
In particular, taking the maximum over shows that for every we have
A similar argument shows that is upper bounded by
So, we finally get
Note that, from the Collatz-Wielandt ratio for nonnegative matrices, we know that the constant in Lemma 4 is lower bounded by the spectral radius of . Indeed, by Theorem 8.1.31 in Horn , we know that if
has a positive eigenvector, then
Therefore, in order to obtain the minimal Lipschitz constant in Lemma 4, we choose the weights of the metric to be the components of . A combination of Theorem 2, Lemma 4 and this observation implies the following result.