1 Introduction
We consider the problem of finding a structured^{1}^{1}1Throughout this work, we aim to find saddles that satisfy a particular (local) minmax structure in the input parameters. saddle point of a smooth objective, namely solving an optimization problem of the form
(1) 
Here, we assume that is smooth in and but not necessarily convex in or concave in . This particular problem arises in many applications, such as generative adversarial networks (GAN) [15], robust optimization [4]
, and game theory
[37, 23]. Solving the saddle point problem in Eq. (1) is equivalent to finding a point such that(2) 
holds for all and . For a non convexconcave function , finding such a saddle point is computationally infeasible. Instead of finding a global saddle point for Eq. (1), we aim for a more modest goal: finding a locally optimal saddle point, i.e. a point for which the condition in Eq. (2) holds true in a local neighbourhood around .
There is a rich literature on saddle point optimization for the particular class of convexconcave functions, i.e. when is convex in and concave in
. Although this type of objective function is commonly encountered in applications such as constrained convex minimization, many saddle point problems of interest do not satisfy the convexconcave assumption. Two popular examples that recently emerged in machine learning are distributionally robust optimization
[12, 38], as well as training generative adversarial networks [15]. These applications can be framed as saddle point optimization problems which  due to the complex functional representation of the neural networks used as models  do not fulfill the convexityconcavity condition.
Firstorder methods are commonly used to solve problem (1) as they have a cheap periteration cost and are therefore easily scalable. One particular method of choice is simultaneous gradient descent/ascent, which performs the following iterative updates,
(3)  
where is a chosen step size which can, e.g., decrease with time or be a bounded constant (i.e. ). The convergence analysis of the above iterate sequence is typically tied to a strong/strict convexityconcavity property of the objective function defining the dynamics. Under such conditions, the gradient method is guaranteed to converge to a desired saddle point [3]. These conditions can also be relaxed to some extent, which will be further discussed in Section 2.
It is known that the gradient method is locally asymptotically stable [26]; but stability alone is not sufficient to guarantee convergence to a locally optimal saddle point. Through an example, we will later illustrate that the gradient method is indeed stable at some undesired stationary points, at which the structural minmax property ^{2}^{2}2This property refers to the function being a local minimum in and a maximum in . is not met. This is in clear contrast to minimization problems where all stable stationary points of the gradient dynamics are local minima. The stability of these undesired stationary points is therefore an additional difficulty that one has to consider for escaping from such saddles. While a standard trick for escaping saddles in minimization problems consists of adding a small perturbation to the gradient, we will demonstrate that this does not guarantee avoiding undesired stationary points.
Throughout the paper, we will refer to a desired local saddle point as a local minimum in and maximum in . This characterization implies that the Hessian matrix at does not have a negative curvature direction in
(which corresponds to an eigenvector of
with a negative associated eigenvalue) and a positive curvature direction in
(which corresponds to an eigenvector of with a positive associated eigenvalue). In that regard, curvature information can be used to certify whether the desired minmax structure is met.In this work, we propose the first saddle point optimization that exploits curvature to guide the gradient trajectory towards the desired saddle points that respect the minmax structure. Since our approach only makes use of the eigenvectors corresponding to the maximum and minimum eigenvalue (rather than the whole eigenspace), we will refer to it as
extreme curvature exploitation. We will prove that this type of curvature exploitation avoids convergence to undesired saddles–albeit not guarantees convergence on a general nonconvexconcave saddle point problem. Our contribution is linked to the recent research area of stability analysis for gradientbased optimization in general saddle point problems. Nagarajan et al. [27] have shown that the gradient method is stable at locally optimal saddles. Here, we complete the picture by showing that this method is unfavourably stable at some points that are not locally optimal. Our empirical results also confirm the advantage of curvature exploitation in saddle point optimization.2 Related Work
Asymptotical Convergence
In the context of optimizing a Lagrangian, the pioneering works of [20, 3] popularized the use of the primaldual dynamics to arrive at the saddle points of the objective. The work of [3] analyzed the stability of this method in continuous time proving global stability results under strict convexconcave assumptions. This result was extended in [39] for a discretetime version of the subgradient method with a constant step size rule, proving that the iterates converge to a neighborhood of a saddle point. Results for a decreasing step size were provided in [14, 25] while [32] analyzed an adaptive step size rule with averaged parameters. The work of [7] has shown that the conditions of the objective can be relaxed, proving asymptotic stability to the set of saddle points is guaranteed if either the convexity or concavity properties are strict, and convergence is pointwise. They also proved that the strictness assumption can be dropped under other linearity assumptions or assuming strongly joint quasiconvexquasiconcave saddle functions.
However, for problems where the function considered is not strictly convexconcave, convergence to a saddle point is not guaranteed, with the gradient dynamics leading instead to oscillatory solutions [16]. These oscillations can be addressed by averaging the iterates [32] or using the extragradient method (a perturbed version of the gradient method) [19, 13].
There are also instances of saddle point problems that do not satisfy the various conditions required for convergence. A notable example are generative adversarial networks (GANs) for which the work of [27] proved local asymptotic stability under certain suitable conditions on the representational power of the two players (called discriminator and generator). Despite these recent advances, the convergence properties of GANs are still not well understood.
Nonasymptotical Convergence
An explicit convergence rate for the subgradient method with a constant stepsize was proved in [30] for reaching an approximate saddle point, as opposed to asymptotically exact solutions. Assuming the function is convexconcave, they proved a sublinear rate of convergence. Rates of convergence have also been derived for the extragradient method [19] as well as for mirror descent [31].
In the context of GANs, [34] showed that a singlestep gradient method converges to a saddle point in a neighborhood around the saddle point in which the function is strongly convexconcave. The work of [24] studied the theory of nonasymptotic convergence to a local Nash equilibrium. They prove that–assuming local strong convexityconcavity–simultaneous gradient descent achieves an exponential rate of convergence near a stable local Nash equilibrium. They also extended this result to other discretetime saddle point dynamics such as optimistic mirror descent or predictive methods.
Negative Curvature Exploitation
The presence of negative curvature in the objective function indicates the existence of a potential descent direction, which is commonly exploited in order to escape saddle points and reach a local minimizer. Among these approaches are trustregion methods that guarantee convergence to a secondorder stationary point [8, 33, 6]
. While a naïve implementation of these methods would require the computation and inversion of the Hessian of the objective, this can be avoided by replacing the computation of the Hessian by Hessianvector products that can be computed efficiently in
[35]. This is applied e.g. using matrixfree Lanczos iterations [9] or online variants such as Oja’s algorithm [1]. Subsampling the Hessian can furthermore reduce the dependence on by using various sampling schemes [18, 40]. Finally, [2, 41] showed that firstorder information can act as a noisy Power method allowing to find a negative curvature direction.In contrast to these classical results that "blindly" try to escape any type of saddlepoint, our aim is to exploit curvature information to reach a specific type of stationary point that satisfies the minmax condition required at the optimum of the objective function.
3 Preliminaries
Definition: Locally Optimal Saddles
Let us define a neighbourhood around the point as
(4) 
with a sufficiently small . Throughout the paper, we follow a common approach, see e.g. [26, 28], for this type of problem and relax the condition of Eq. (2) to hold only in a local neighbourhood.
Definition 1.
Assumptions
For the sake of further analysis, we require the function to be sufficiently smooth, and its second order derivatives with respect to the parameters and to be nondegenerate at the optimum .
Assumption 2 (Smoothness).
We assume that is a function, and that its gradient and Hessian are Lipschitz with respect to the parameters and , i.e. we assume that the following inequalities hold:
(6)  
(7)  
(8)  
(9)  
(10)  
(11) 
Moreover, we assume bounded gradients, i.e.
(12) 
Assumption 3 (Nondegenerate Hessian at ).
We assume that the matrices and are nondegenerate for all locally optimal points as defined in Def. 1.
With the use of Assumption 3, we are able to establish sufficient conditions on to be a locally optimal saddle point.
Lemma 4.
Suppose that satisfies assumption 3; then, is a locally optimal saddle point on if and only if the gradient with respect to is zero, i.e.
(13) 
and the second derivative at is positive definite in and negative definite in ^{3}^{3}3In the game theory literature, such point is commonly referred to as local Nash equilibrium, see e.g. [24]., i.e. there exist such that
(14) 
4 Undesired Stability
Asymptotic Scenarios
There are three different asymptotic scenarios for the gradient iterations in Eq. (3): (i) divergence (i.e. ), (ii) being trapped in a loop (corresponding to ), and (iii) convergence to a stationary point of the gradient updates (i.e. ). To the best of our knowledge, there is no convergence guarantee for general saddle point optimization. Typical convergence guarantees require convexityconcavity or somewhat relaxed conditions such as quasiconvexityquasiconcavity of [7]. This paper focuses on the third outline case and investigates the theoretical guarantees for a convergent series. We will show that gradientbased optimization can converge to some undesired stationary points and propose an optimizer that uses extreme curvature information to alleviate this problem. We specifically highlight that we do not provide any convergence guarantee. Rather, we investigate if a convergent sequence is guaranteed to yield a valid solution to the local saddle point problem, i.e., if it always converges to a locally optimal saddle point as defined in Def. 1.
Local Stability
A stationary point of the gradient iterations can be either stable or unstable. The notion of stability characterizes the behavior of the gradient iterations in a local region around the stationary point. In the neighborhood of a stable stationary point, successive iterations of the method are not able to escape the region. Conversely, we consider a stationary point to be unstable if it is not stable [17]. The stationary point (for which holds) is a locally stable point of the gradient iterations in Eq. 3, if the Jacobian of its dynamics has only eigenvalues within the unit disk, i.e.
(15) 
Definition 5 (Stable Stationary Point of Gradient Dynamics).
A point is a stable stationary point of the gradient dynamics in Eq. (3) (for an arbitrarily small step size ) if and if the eigenvalues of the matrix
(16) 
only have eigenvalues with negative realpart.
Random Initialization
In the following, we will use the notion of stability to analyze the asymptotic behavior of the gradient method. We start with a lemma extending known results for general minimization problems that prove that gradient descent with random initialization almost surely converges to a stable stationary point [22].
Undesired Stable Stationary Point
If all stable stationary points of the gradient dynamics would be locally optimal saddle points, then the result of Lemma 6 guarantees almost sure convergence to a solution of the saddle point problem in Eq. (1). Previous work by [26, 27] has shown that every locally optimal saddle point is a stable stationary point of the gradient dynamics. While for minimization problems, the set of stable stationary points is the same as the set of local minima, this might not be the case for the problem we consider here. Indeed, the gradient dynamics might introduce additional stable points that are not locally optimal saddle points. We illustrate this claim in the next example.
Example
Consider the following twodimensional saddle point problem^{4}^{4}4To guarantee smoothness, one can restrict the domain of to a bounded set.
(17) 
with . The critical points of the function, i.e. points for which , are
(18) 
Evaluating the Hessians at the three critical points gives rise to the following three matrices:
(19) 
We see that only is a locally optimal saddle point, namely that and , whereas the two other points are both a local minimum in the parameter , rather than a maximum. However, figure 0(a) illustrates gradient steps converging to the undesired stationary point because it is a locally stable point of the dynamics^{5}^{5}5This can be easily shown by observing that the realpart of the eigenvalues of the matrix in Eq. 16, evaluated at , are all negative.. Hence, even small perturbations of the gradients in each step can not avoid convergence to this point (see Figure 0(b)).
5 Extreme Curvature Exploitation
The previous example has shown that gradient iterations on the saddle point problem introduce undesired stable points. In this section, we propose a strategy to escape from these points. Our approach is based on exploiting curvature information as in [9].
Extreme Curvature Direction
Let be the minimum eigenvalue of with its associated eigenvector , and be the maximum eigenvalue of with its associated eigenvector . Then, we define
(20)  
(21) 
where is the sign function. Using the above vectors, we define as the extreme curvature direction at .
Algorithm
Using the extreme curvature direction, we modify the gradient steps as follows:
(22)  
This new update step is constructed by adding the extreme curvature direction to the gradient method of Eq. (3). From now on, we will refer to this modified update as the Cesp (curvature exploitation for the saddle point problem) method. Note that the algorithm reduces to gradientbased optimization in regions where there are only positive eigenvalues in and negative eigenvalues in as the extreme curvature vector is zero. This includes the region around any locally optimal saddle point.
Stability
Extreme curvature exploitation has already been used for escaping from unstable stationary points (i.e. saddle points) of gradient descent for minimization problems [9]. In saddle point problems, curvature exploitation is advantageous not only for escaping unstable stationary points but also for escaping undesired stable stationary points of the gradient iterates. The upcoming two lemmas prove that the set of stable stationary points of the Cesp dynamics and the set of locally optimal saddle points are the same – therefore, the optimizer only converges to a solution of the local saddle point problem. The issue of gradient based optimization as well as the theoretical guarantees of the Cesp method are visualized in the Venn diagram in figure 2: while for minimization problems the set of stable points of gradient descent equals the set of local minima, we see that for saddle point problems gradientbased optimization introduces additional stable points outside of the set of locally optimal solutions. However, by exploiting extreme curvatures with our proposed Cesp method, all points outside of the set of locally optimal saddles become nonstationary. Hence, every convergent sequence of the Cesp method yields a solution to the local saddle point problem.
Lemma 7.
A point is a stationary point of the iterates in Eq. (22) if and only if is a locally optimal saddle point.
We can conclude from the result of Lemma 7 that every stationary point of the Cesp dynamics is a locally optimal saddle point. The next Lemma establishes the stability of these points.
Escaping From Undesired Saddles
Extreme curvature exploitation allows us to escape from undesired saddles. In the next lemma, we show that the optimization trajectory of Cesp stays away from all undesired stationary points of the gradient dynamics.
Lemma 9.
Suppose that is an undesired stationary point of the gradient dynamics, namely
(25) 
Consider the iterates of Eq. (22) starting from in a neighbourhood of . After one step the iterates escape the neighbourhood of , i.e.
(26) 
for a sufficiently small .
Implementation with Hessianvector products
Since storing and computing the Hessian in high dimensions is very costly, we need to find a way to efficiently extract the extreme curvature direction. The most common approach for obtaining the eigenvector corresponding to the largest absolute eigenvalue, (and the eigenvalue itself) of is to run power iterations on as
(27) 
where is a random vector and is normalized after every iteration. The parameter is chosen such that . Since this method only requires implicit Hessian computation through a Hessianvector product, it can be implemented as efficiently as gradient evaluations [35]. The results of [21] provide a bound on the number of required iterations to extract the extreme curvature: for the case , iterations suffice to find a vector such that
with probability
(cf. [22]).Comparison to secondorder optimization
We would like to draw the attention of the reader to the fact that the Cesp method only uses extreme curvature which makes it conceptually different from secondorder Newtontype optimization. Although there is a rich literature on secondorder optimization for variational inequalities and convexconcave saddle point problems, to the best of our knowledge, there is neither theoretical nor practical evidence for success of these methods on general smooth saddle point problems.
6 Curvature Exploitation for LinearTransformed Gradient Steps
LinearTransformed Gradient Optimization
Applying a linear transformation to the gradient updates is commonly used to accelerate optimization for various types of problems. The resulting updates can be written in the general form
(28) 
where is a symmetric, blockdiagonal matrix. Different optimization methods use a different linear transformation . Table 1 in section B in the appendix illustrates the choice of for different optimizers. Adagrad [11], one of the most popular optimization methods in machine learning, belongs to this category.
Extreme Curvature Exploitation
We can adapt Cesp to the lineartransformed variant:
(29) 
where we choose the linear transformation matrix to be positive definite. This variant of Cesp is also able to filter out the undesired stable stationary points of the gradient method for the saddle point problem. The following lemma proves that it has the same properties as the nontransformed version.
Lemma 10.
A direct implication of Lemma 10 is that we can also use curvature exploitation for Adagrad. Later, we will experimentally show the advantage of using curvature exploitation for this method.
7 Experiments
7.1 Escaping From Undesired Stationary Points of the Toy Example
Previously, we saw that for the two dimensional saddle point problem on the function of Eq. (17), gradient iterates may converge to an undesired stationary point that is not locally optimal. As shown in Figure 3, Cesp solves this issue. In this example, simultaneous gradient iterates converge to the undesired stationary point for many different initialization parameters, whereas our method always converges to the locally optimal saddle point. A plot of the basin of attraction of the two different optimizers on this example is presented in Figure 6 in the appendix.
7.2 Robust Optimization
Although robust optimization [4] is often formulated as a convexconcave saddle point problem, we consider robust optimization on neural networks that do not fulfill this assumption. The optimization problem that we target here is an application of robust optimization in empirical risk minimization [29], namely solving
(30) 
where denotes the cost function to minimize, the data, and a divergence measure between the true data distribution and the empirical data distribution .
We use this framework on the Wisconsin breast cancer data set, which is a binary classification task with 30 attributes and 569 samples, and choose a multilayer perceptron with a nonconvex sigmoid activation as the classifier. Due to the relatively small sample size, we can compute the gradient exactly in this case. We choose the objective
in this setting to be(31) 
where we add a regularization term with to enforce the divergence constraint. Figure 4 shows the comparison of the gradient method (GD) and our CESP optimizer on this problem in terms of the minimum eigenvalue of . Note that is concave with respect to and therefore its Hessian is constant negative. The results indicate the anticipated behavior that Cesp is able to more reliably drive a convergent series towards a solution where the minimum eigenvalue of is positive.
7.3 Generative Adversarial Networks
This experiment evaluates the performance of the Cesp method for training a Generative Adversarial Network (GAN), which reduces to solving the saddle point problem
(32) 
where the functions and are represented by neural networks parameterized with the variables and , respectively. We use the MNIST data set and a simple GAN architecture with 1 hidden layer and 100 units. More details about the network architecture and parameters are summarized in table 2 in the Appendix.
We investigate the advantage of curvature exploitation for Adagrad, which is a member of the class of lineartransformed gradient methods often used for saddle point problems. Moreover, we make use of Power iterations as described in section 5 to efficiently approximate the extreme curvature vector. Note that since we’re using mini batches in this experiment, we do not have access to the correct gradient information but also rely on an approximation here.
As before, we evaluate the efficacy of the negative curvature step in terms of the spectrum of at a (approximately) convergent solution . We compare Cesp to the vanilla Adagrad optimizer. Since we are interested in a solution that gives rise to a locally optimal saddle point, we track (an approximation of) the smallest eigenvalue of and the largest eigenvalue of
through the optimization. Using these estimates, we can evaluate if a method has converged to a locally optimal saddle point. The results are shown in figure
5. The decrease in terms of the squared norm of the gradients indicates that both methods converge to a solution. Moreover, both fulfill the condition for a locally optimal saddle point for the parameter , i.e. the maximum eigenvalue of is negative. However, the graph of the minimum eigenvalue of shows that Cesp converges faster, and with less frequent and severe spikes, to a solution where the minimum eigenvalue is zero. Hence, the negative curvature step seems to be able to drive the optimization procedure to regions that yield points closer to a locally optimal saddle point.Even though this empirical result highlights the benefits of using curvature exploitation, we observe zero eigenvalues of for convergent solutions, which violates the conditions required for our analysis. This observation is in accordance with recent empirical evidence [36]
showing that the Hessian is actually degenerate for common deep learning architectures. The phenomenon itself as well as approaches to address it are left as future work. One potential direction would be to investigate if highorder derivatives could be used at points where the Hessian is degenerate.
8 Conclusion
We focused our study on reaching a solution to the local saddle point problem. First, we have shown that gradient methods have stable stationary points that are not locally optimal, which is a problem exclusively arising in saddle point optimization. Second, we proposed a novel approach that exploits extreme curvature information to avoid the undesired stationary points. We believe this work highlights the benefits of using curvature information for saddle point problems and might open the door to other novel algorithms with stronger global convergence guarantees.
References
 [1] Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017.
 [2] Zeyuan AllenZhu and Yuanzhi Li. Neon2: Finding local minima via firstorder oracles. arXiv preprint arXiv:1711.06673, 2017.

[3]
Kenneth Joseph Arrow, Leonid Hurwicz, Hirofumi Uzawa, and Hollis Burnley
Chenery.
Studies in linear and nonlinear programming.
1958.  [4] Aharon BenTal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton University Press, 2009.
 [5] Michele Benzi, Gene H. Golub, and Jörg Liesen. Numerical solution of saddle point problems. ACTA NUMERICA, 14:1–137, 2005.
 [6] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
 [7] Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddlepoint dynamics: conditions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization, 55(1):486–511, 2017.
 [8] Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.
 [9] Frank E Curtis and Daniel P Robinson. Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412, 2017.
 [10] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2933–2941. Curran Associates, Inc., 2014.
 [11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Technical Report UCB/EECS201024, EECS Department, University of California, Berkeley, Mar 2010.
 [12] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
 [13] Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon LacosteJulien. A variational inequality perspective on generative adversarial nets. arXiv preprint arXiv:1802.10551, 2018.
 [14] EG Golshtein. Generalized gradient method for finding saddlepoints. Matekon, 10(3):36–52, 1974.
 [15] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 [16] Thomas Holding and Ioannis Lestas. On the convergence to saddle points of concaveconvex functions, the gradient method and emergence of oscillations. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 1143–1148. IEEE, 2014.
 [17] H.K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002.
 [18] Jonas Moritz Kohler and Aurelien Lucchi. Subsampled cubic regularization for nonconvex optimization. In International Conference on Machine Learning, 2017.
 [19] GM Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
 [20] T Kose. Solutions of saddle value problems by differential equations. Econometrica, Journal of the Econometric Society, pages 59–70, 1956.
 [21] J. Kuczyński and H. Woźniakowski. Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, 1992.
 [22] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016.
 [23] Kevin LeytonBrown and Yoav Shoham. Essentials of Game Theory: A Concise, Multidisciplinary Introduction. Morgan and Claypool Publishers, 1st edition, 2008.
 [24] Tengyuan Liang and James Stokes. Interaction matters: A note on nonasymptotic local convergence of generative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.
 [25] D Maistroskii. Gradient methods for finding saddle points. Matekon, 14(1):3–22, 1977.
 [26] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. arXiv preprint arXiv:1705.10461, 2017.
 [27] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pages 5591–5600, 2017.
 [28] Vaishnavh Nagarajan and J. Zico Kolter. Gradient descent GAN optimization is locally stable. CoRR, abs/1706.04156, 2017.
 [29] Hongseok Namkoong and John C. Duchi. Variancebased regularization with convex objectives. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 2975–2984, 2017.
 [30] Angelia Nedić and Asuman Ozdaglar. Subgradient methods for saddlepoint problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
 [31] Arkadi Nemirovski. Proxmethod with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
 [32] AS Nemirovskii and DB Yudin. Cezare convergence of gradient method approximation of saddle points for convexconcave functions. Doklady Akademii Nauk SSSR, 239:1056–1059, 1978.
 [33] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 [34] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
 [35] Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6:147–160, 1994.
 [36] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.

[37]
Satinder Singh, Michael Kearns, and Yishay Mansour.
Nash convergence of gradient dynamics in generalsum games.
In
Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
, pages 541–548. Morgan Kaufmann Publishers Inc., 2000.  [38] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. 2018.
 [39] Hirofumi Uzawa. Iterative methods for concave programming. Studies in linear and nonlinear programming, 6:154–165, 1958.
 [40] Peng Xu, Farbod RoostaKhorasani, and Michael W Mahoney. Newtontype methods for nonconvex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164, 2017.
 [41] Yi Xu and Tianbao Yang. Firstorder stochastic algorithms for escaping from saddle points in almost linear time. arXiv preprint arXiv:1711.01944, 2017.
Appendix A Theoretical Analysis
a.1 Lemma 4
Lemma 4.
Suppose that satisfies assumption 3; then, is a locally optimal saddle point on if and only if the gradient is zero, i.e.
(33) 
and the second derivative at is positive definite in and negative definite in , i.e., there exist such that
(34) 
Proof.
From definition 1 follows that a locally optimal saddle point is a point for which the following two conditions hold:
(35) 
Hence, is a local minimizer of and is a local maximizer. We therefore, without loss of generality, prove the statement of the lemma only for the minimizer , namely that


.
The proof for the maximizer directly follows from this.

If we assume that , then there exists a feasible direction such that , and we can find a step size for s.t. with . Using the smoothness assumptions (Assumption 2), we arrive at the following inequality
(36) Hence, it holds that:
(37) By choosing the gradient descent direction (with s.t. ), we can find a step size such that ,
which contradicts that is a local minimizer. Hence, is a necessary condition for a local minimizer.

To prove the second statement, we again make use of inequality (36) coming from the smoothness assumption and the update s.t. with . From (i) we know that and, therefore, we obtain:
(38) (39) If is not positive semidefinite, then there exists at least one eigenvector with negative curvature, i.e. . This implies that for following the curvature vector decreases the function value, i.e., . This contradicts that is a local minimizer which proves the sufficient condition
(40)
∎
a.2 Lemma 6
The following Lemma 11 proves that the gradientbased mapping for the saddle point problem is a diffeomorphism which will be needed in the proof for Lemma 6.
Lemma 11.
Suppose that assumption 2 holds; then the gradient mapping for the saddle point problem
(41) 
with step size is a diffeomorphism.
Proof.
The following proof is very much based on the proof of proposition 4.5 from [22].
A necessary condition for a diffeomorphism is bijectivity. Hence, we need to check that is (i) injective, and (ii) surjective for .

Consider two points for which
(42) holds. Then, we have that
(43) (44) Note that
(45) (46) from which follows that
(47) (48) For this means , and therefore is injective.

We will show that is surjective by constructing an explicit inverse function for both optimization problems individually. As suggested by [22], we make use of the proximal point algorithm on the function for the parameters , individually.
For the parameter the proximal point mapping of centered at is given by(49) Moreover, note that is strongly convex in if :
(50) (51) Hence, the function has a unique minimizer, given by
(52) (53) which means that there is a unique mapping from to under the gradient mapping if .
The same line of reasoning can be applied to the parameter with the negative proximal point mapping of centered at , i.e.(54) Similarly as before, we can observe that is strictly concave for and that the unique minimizer of yields the update step of . This let’s us conclude that the mapping is surjective for if
Observing that for , is continuously differentiable concludes the proof that is a diffeomorphism. ∎
Lemma 6 (Random Initialization).
a.3 Lemma 7
Lemma 7.
The point is a stationary point of the iterates in Eq. (22) if and only if is a locally optimal saddle point.
Proof.
The point is a stationary point of the iterates if and only if . Let’s consider w.l.o.g. only the stationary point condition with respect to , i.e.
(55) 
We prove that the above equation holds only if . This can be proven by a simple contradiction; suppose that , then multiplying both sides of the above equation by yields
(56) 
Since the lefthand side is negative and the righthand side is positive, the above equation leads to a contradiction. Therefore, and . This means that and and therefore according to lemma 4, is a locally optimal saddle point. ∎
a.4 Lemma 8
Lemma 8.
Proof.
The proof is based on a simple idea: in a neighborhood of a locally optimal saddle point, can not have extreme curvatures, i.e., . Hence, within the update of Eq. (22) reduces to the gradient update in Eq. (3), which is stable according to [27, 26].
To prove our claim that negative curvature doesn’t exist in , we make use of the smoothness assumption. Suppose that , then the smoothness assumption 2 implies
(59)  
(60)  
(61)  
(62)  
(63) 
Similarly, one can show that
(64) 
Therefore, the extreme curvature direction is zero according to the definition in Eq. (20). ∎
a.5 Lemma 9
Lemma 9.
Suppose that is an undesired stationary point of the gradient dynamics, namely
(65) 
Consider the iterates of Eq. (22) starting from in a neighbourhood of . After one step the iterates escape the neighbourhood of , i.e.
(66) 
for a sufficiently small .
Proof.
Preliminaries: Consider compact notations
(67)  
(68)  
(69) 
Characterizing extreme curvature: The choice of ensures that
(70) 
holds. Since lies in a neighbourhood of , we can use the smoothness of to relate the negative curvature at to negative curvature in :
Comments
There are no comments yet.