Two-player games as extensions of minimization problems achieve wide applications especially in economics [echenique2003equilibrium]
and machine learning (e.g., generative adversarial networks (GANs)[goodfellow2014generative, arjovsky2017wasserstein, park2019sphere], adversarial learning [shaham2018understanding, madry2018towards]dai2018sbeed] etc.). We merely focus on differentiable two-player zero-sum games which are mathematically formulated as the following min-max optimization problem:
where and are two players and is the payoff function.
There are mainly two types of two-player games, i.e., simultaneous games and sequential games. Taking words literally, the difference between two types of games lies in the order of actions two players take. In two-player simultaneous games, and have the same position that they are blind to other’s actions before they finish at each step. It further implies that the payoff function is convex-concave (i.e., ) in simultaneous learning. Sequential games strictly require the order of two players’ actions. The variable plays the role of a leader who aims to reduce the loss (pay) while the follower tries to maximize his gains after observing the leader’s action.
Most of previous works stabilized GANs training through regularization and specific modeling. Generators and discriminators in generative adversarial networks (GANs) act as the leader and the follower respectively. Training GANs is usually equivalent to solving a sequential min-max optimization problem [goodfellow2014generative, arjovsky2017towards, arjovsky2017wasserstein, gulrajani2017improved, park2019sphere]. Optimization difficulties of GANs, especially the instabilities and nonconvergence, have been emphasized and discussed for a long time, since GANs was first proposed [salimans2016improved, goodfellow2016nips, arjovsky2017wasserstein]
. From the model’s point of view, some proper regularization terms added into the loss function numerically stabilize training[goodfellow2016nips, gulrajani2017improved, brockLRW17, miyato2018spectral, cao2018improving]. Brock et al. [brockLRW17] encouraged weights to be orthonormal to mitigate the instabilities of GANs by introducing an additional regularization term. With 1-Lipschitz constraints on discriminators, Wasserstein GAN achieves much better performances than vanilla GAN in terms of generation quality and training stability [arjovsky2017wasserstein, gulrajani2017improved, wei2018improving, adler2018banach]
. However, they are still suffering from the same difficulties using gradient descent ascent algorithms. To escape the optimization difficulties, some recent works of generative models aim to estimate the score of the target distribution by minimizing the fisher divergence[song2019generative, song2020sliced]
. No free lunch in the world that it is hard to sample high quality and high dimensional data from the score, although score-based models are more easier to be optimized, because score loses some information compared with the density function. Therefore, it is urgent to develop efficient algorithms for GANs training (sequential games).
Gradient descent ascent (GDA) is the extension of gradient descent from minimization problems to min-max problems. Unfortunately, GDA has been shown to suffer from undesirable convergence and strong rotation around fixed points [daskalakis2018limit]. To overcome the mentioned drawbacks, several variants are proposed. Two Time-Scale GDA [heusel2017gans] and GDA-k [goodfellow2014generative] are two variants of GDA and are widely used in training GANs. Extra gradient (EG) [korpelevich1976extragradient, gidel2018a], optimistic GDA (OGDA) [daskalakis2018training] and consensus optimization (CO) [mescheder2017numerics], extended from algorithms solving minimization problems, improve the convergence of GDA. Unfortunately, they are designed for solving convex-concave problems (simultaneous problems).
Recently, Jin et al. [jin2020local] defined the equilibrium (i.e. local minimax) for differentiable sequential games which is more appropriate than Nash equilibrium, based on the works from Evtushenko [evtushenko1974some, evtushenko1974iterative]. Local minimax takes into account the sequential structure and makes use of the Schur complement of the Hessian matrix rather than merely the block diagonal. Based on the definition of local minimax, FR [Wang2020On] and TGDA [fiez2020implicit] are proposed recently and locally converge to local minimax. Furthermore, to accelerate the convergence, Newton-type methods are proposed in [zhang2020newton]
. However, some of them may be not applicable in machine learning (deep learning) tasks. For more details, we will discuss later inSection 3.
In this paper, we propose a novel algorithm, namely HessianFR, which has better convergence than FR [Wang2020On] by adding Hessian information to in each update but without additional computation. Mathematically, the Hessian information reduces the condition number of Jacobian matrix and thus accelerates the convergence. Overall, equipped with the perspective, here are our contributions:
A new algorithm is proposed and is theoretically guaranteed to locally converge and only converge to local minimax with proper learning rates.
We theoretically and numerically study several fast computation methods including diagonal methods as well as conjugate gradient for Hessian inverse and stochastic learning for lower computation costs in each update. Both diagonal methods and conjugate gradient perform well in practice.
Finally, we apply our algorithm in training generative adversarial networks on synthetic dataset to show the superiority of HessianFR than other algorithms in terms of iterations and seconds for convergence. Furthermore, we test our algorithm in stochastic setting on large scale image datasets (e.g., MNIST, CIFAR-10 and CelebA). According to numerical results, the proposed HessianFR outperforms other algorithms in terms of the image generation quality.
Notation. In this paper, we use
to represent the euclidean norm of both vector and the corresponding matrix spectral norm. Concisely, we rewriteand the Hessian matrix
Sometimes, we may use to highlight the spatial and temporal location of to avoid ambiguity of notations. Analogous notations holds for , and so on. Usually, the spatial and temporal location is omitted if it is clear and straightforward. To highlight the Hessian matrix evaluated at a given point which is within our interest, we denote it as
, etc. We denote the maximal eigenvalue, the minimal eigenvalue and the spectral radius of a matrix by, and respectively.
Differentiable two-player zero-sum sequential games are mathematically formulated as solving the following min-max optimization problems:
where and are two players and is the payoff function. Note that the payoff function is nonconvex-nonconcave in sequential setting. In this paper, we mainly focus on solving min-max problem for training generative adversarial networks.
2.1 Why Minimax Optimization
Most of previous works define local Nash equilibrium for min-max problems in training generative adversarial networks. We first review the definition and some properties for (local) Nash equilibrium here. Then, we demonstrate that it is overly strict to define Nash equilibrium in training generative adversarial networks.
Definition 1 (Local Nash equilibrium)
A point is a local Nash equilibrium for min-max problem Eq. 1 if there exists such that
for all satisfying and .
Proposition 1 (Necessary conditions for local Nash equilibrium)
The local Nash equilibrium for a twice differentiable payoff function must satisfy the following conditions: (1) It is a critical point (i.e., and ); (2) .
Proposition 2 (Sufficient conditions for local Nash equilibrium)
For a twice differentiable min-max problem Eq. 1, if the critical point satisfies , then it is a (strict) local Nash equilibrium.
Local Nash equilibroum implies that the payoff function is locally convex-concave although it is not for simultaneous games. In other words, the local Nash equilibrium for min-max problem is also a local Nash equilibrium for max-min problem . The Nash equilibrium is overly strict for equilibrium of sequential games. Moreover, it is hard to say whether Nash equilibrium exists for nonconvex-nonconcave min-max problems. For example, we consider a two-dimensional payoff function . We can easily verify that no equilibrium exists by Proposition 1. The necessary condition of Nash equilibrium requires the Hessian matrix and to be semi positive definite and semi negative definite respectively, regardless of the correlation term , which further implies the uncorrelation of variables and w.r.t. at the local Nash equilibrium .
Generative adversarial networks, first proposed by Goodfellow et al. [goodfellow2014generative], achieve much attention and are widely applied in various fields and applications [bowman2015generating, odena2017conditional]. In training generative adversarial networks, and
represent the trainable parameters of neural networks for generators and discriminators respectively. In JS-GAN (vanilla GAN)[goodfellow2014generative], the discriminator parameterized by measures the JS divergence between the generated distribution and the target distribution. Wasserstein GAN [arjovsky2017wasserstein] adopts a weaker but softer metric which is numerically represented by 1-Lipschitz neural networks parameterized by . The discriminator in Sphere GAN [park2019sphere] first projects data to a unit sphere and then measures the data distance on the manifold. Suppose that which measures the distance between the generated distribution and the target distribution, it is not necessary to be a local maximizer for all satisfying .
Based on above analysis, it is inappropriate to depict the solution of sequential games by Nash equilibrium. Jin et al. [jin2020local] defined local minimax for the equilibrium of differentiable sequential games Eq. 1 which is intuitively and theoretically more feasible.
Definition 2 (Local minimax)
A point is a local minimax for min-max problem Eq. 1 if there exists and a continuous function with as , such that
for all , and .
By implicit function theorem, the definition can be further clarified [Wang2020On]. (1) is a local maximum of ; (2) is a local minimum of where is an implicit function defined by in a neighborhood of with .
Here, local minimax does not require to be the local maximum of for all in a neighborhood of . Note that the local maximum of is allowed to change slightly (by properties of ) with . For more details, please refer to [jin2020local]. Similar to local Nash equilibrium, necessary and sufficient conditions for local minimax are established in [jin2020local].
Proposition 3 (Necessary conditions for local minimax)
The local minimax for a twice differentiable payoff function must satisfies the following conditions: (1) It is a critical point (i.e., and ); (2) .
Proposition 4 (Sufficient conditions for local minimax)
For a twice differentiable min-max problem Eq. 1, if the critical point satisfies , then it is a (strict) local minimax.
Comparing Proposition 1 and Proposition 2 with Proposition 3 and Proposition 4, the main difference lies in that local minimax utilizes the Schur complement of the Hessian matrix while local Nash equilibrium merely focus on the block diagonal matrix. Intuitively, the Schur complement has more information than the block diagonal matrix because the latter ignores the correlation of two variables while the former adopts the whole Hessian information.
2.2 Why not GDA and its Variants
Two Time-Scale GDA (i.e., TTUR in [heusel2017gans]) and GDA-k (adopted in most of GANs training [goodfellow2014generative, arjovsky2017wasserstein, gulrajani2017improved]) are two most popular variants of GDA. The global convergence of Two Time-Scale GDA and GDA-k is less than satisfactory. Two Time-Scale GDA perhaps converges to an undesired point which is neither Nash equilibrium nor local minimax. And GDA-k is convergent if it satisfies Max-Oracle which is extremely strict in practice [jin2020local]. Fortunately, GDA has local convergence properties [zhang2020newton] for local minimax, which may explain the success of GANs trained by GDA in varies tasks. We further find that Extra Gradient (EG) [korpelevich1976extragradient] which is derived for solving convex-concave problems (simultaneous games), is not suitable for solving sequential min-max problem. For more details of (local) convergence of GDA (and its variants) to local minimax, please see Appendix A.
2.3 Follow-the-Ridge (FR) Algorithm
Follow-the-Ridge (FR) algorithm [Wang2020On] is the first work studying min-max problem Eq. 1 based on local minimax. We briefly introduce the main idea of FR algorithm here. Suppose that is on the ridge, i.e., and where is the implicit function defined in Definition 2. In each step , is updated by gradient descent, i.e., . The leader always cannot foresee the follower’s action, that’s why simple gradient descent is adopted for updating . However, the follower witnesses the update of and he hopes to take advantage of the additional information and stay on the ridge (i.e., satisfying ). With the current state that and the known update , the follower hopes to satisfy . Taylor expansion for implies that
Therefore, the correction term brings to with staying on the ridge. If is not on the ridge, then gradient ascent update for is necessary for moving closer to the ridge. In conclusion, FR algorithm has an additional correction term compared with GDA to keep parallel to or on the ridge. The algorithm is listed as follows:
where and are learning rates for and respectively.
In this section, we first develop the HessianFR algorithm based on the FR algorithm and the Newton method. Theoretically, we show that the proposed HessianFR algorithm has greater convergence rates to strict local minimax than the FR algorithm. To reduce the computational costs in large scale problems, we extend the deterministic HessianFR to the stochastic HessianFR with theoretical convergence guarantees. Furthermore, we discuss several computation methods for Hessian inverse for the proposed HessianFR.
3.1 Deterministic Algorithm
In this part, we introduce the motivation for the proposed algorithm and then local convergence is developed. Theoretical analysis conveys that the proposed HessianFR is better than FR [Wang2020On] when is ill-conditioned and is not worse than GDN [zhang2020newton]. However, HessianFR is more computationally friendly than GDN in implementation.
According to the convergence analysis of FR algorithm in [Wang2020On], the theoretical convergence rate is related to the condition number of and , i.e., the Jacobian matrix of FR at local minimax is similar to
which implies that the maximal eigenvalue is
and . How to improve the convergence of FR without extra computation (or accepted computation costs)? A direct way is to decrease the condition number of two block diagonal of the Jacobian matrix, i.e., and . Furthermore, we hope the follower to be close to the optimal one, i.e., . Actions of is essential for min-max problem and fast convergence is expected. In training GANs, weak discriminators will lead to disasters and failures. Therefore, instead of improving the updates for , acceleration of seems to be more reasonable. Observe that the correction term for involves the inverse of Hessian . It reminds us of the Newton method that it outperforms gradient-based (first-order) methods theoretically and numerically under some regularization conditions. Taking the advantage of the Newton method, an extra correction term (Newton step) can accelerate the convergence of , i.e.,
where and are learning rates for gradient ascent term and Newton correction term respectively. Moreover, compared with FR, HessianFR merely requires additional computation of . The extra computation costs are acceptable and can be ignored in practice. The algorithm is named ”HessianFR” because it has Hessian (Newton) step in the context of FR algorithm. The pseudocode for HessianFR is shown in Algorithm 1. Here, we use finite difference to compute , i.e.,
when . As for the computation of , we will discuss it in Section 3.3.
3.1.2 Relation to other algorithms
HessianFR is actually an generalized algorithm of FR [Wang2020On] and GDN [zhang2020newton]. It is equivalent to FR if we set . The update rule for GDN is as follows:
Note that by Taylor expansion, we have
which is a special case of (equivalent to) HessianFR if and . Theoretical and numerical comparisons of those three related algorithms will be discussed later.
3.1.3 Convergence Analysis
We first prove the local convergence properties of HessianFR that it converges and only converges to a local minimax. Then, the theoretical convergence rates are compared for HessianFR, FR and GDN. With a proper choice of learning rates, our HessianFR is better than FR and is comparable with GDN in theory.
Here, we first introduce the definition of strict stable points of an algorithm, which is useful for convergence analysis in optimization.
Definition 3 (Strict stable point of an algorithm)
For an algorithm defined as , we call a point is a strict stable point of the algorithm if
where is the Jacobian matrix at and denotes the spectral radius.
Theorem 1 (Local convergence of HessianFR)
With a proper choice of learning rates, all strict local minimax points are strict stable fixed points of HessianFR. Moreover, any strict stable fixed point is a strict local minimax.
Let and . The Jacobian matrix of HessianFR at local minimax is
Observe that the matrix
is invertible, then is similar to
Therefore, and share the same eigenvalues. By conditions (Proposition 4) for strict local minimax, we have and . To guarantee the convergence of HessianFR, we require , i.e.,
which implies that
Note that the spectral radius is the infimum of all norms and according to [horn2012matrix], there exist a matrix norm and a constant such that the Jacobbian matrix
Consider the Taylor expansion of HessianFR defined as , we have
There exist a small enough neighborhood of with radius such that if , we have . Therefore,
which complete the proof of local convergence of HessianFR.
Conversely, if a point is a strictly stable fixed point of HessianFR, i.e., is a critical point and the Jacobian at satisfies that . It further implies that and for any and . Therefore, and , which conclude that is a strict local minimax.
Note that, the convergence rate of HessianFR can be roughly estimated by eigenvalues of that
Let , the Jacobian matrix of FR at the local minimax is similar to
and the corresponding spectral radius is
If is ill-conditioned that is large but is small, then is small. With proper choice of such that
then and thus which implies the improved theoretical convergence of HessianFR over FR.
Let and , then the Jacobian matrix of GDN at the local minimax is similar to
which implies that
With proper choice of and (always exist) such that
then the theoretical convergence rate of HessianFR is equal to that of GDN.
In conclusion, HessianFR is better than FR when is ill-conditioned and not worse than GDN theoretically. Note that setting learning rate to be (in GDN) may be infeasible in deep learning applications. Although GDN outperforms FR in the sense of theoretical local convergence, it requires strict pretraining which is hard to achieve in practice. We will discussed it in the numerical part.
Preconditioning is a popular method to accelerate the convergence in machine learning and numerical linear algebra. Suppose that in each step of HessianFR, is preconditioned by a pair of diagonal matrix that are positive definite and bounded, then the convergence properties of HessianFR in Theorem 1 still hold. In the numerical part of this paper, we adopt the same preconditioning strategy as Adam [kingma2014adam].
Proposition 5 (Convergence of HessianFR with preconditioning)
Suppose that the gradient in HessianFR is preconditioned by symmetric bounded positive definite matrix pairs , i.e., and are replaced by and respectively. Then, Theorem 1 still holds for HessianFR with preconditioning.
The update rule for HessianFR with preconditioning (HFR-P) is as follows:
The Jacobian for HessianFR with preconditioning is
which is similar to
Note that the matrix
is similar to
which is positive definite if is symmetric positive definite. Moreover, the matrix
is similar to
where both and are symmetric positive definite. We deduce that all eigenvalues of
which is similar to
are positive and real. Therefore, eigenvalues of two matrices
are all real and positive. Furthermore, Theorem 1 holds for HessianFR with preconditioning if
3.2 Stochastic Learning
Large scale data sets and parameters appear in deep learning. In generative adversarial networks, we usually need to estimate millions of parameters and the size of dataset can also be thousands (e.g., MNIST and CIFAR-10) and millions (e.g., CelebA). Instead of solving a deterministic optimization problem, stochastic (mini-batch) learning is adopted to lower computational costs and the storage, but can still approximate the exact solution. In this section, we mainly analyze the convergence properties of HessianFR in stochastic setting and the analysis can also be extended for FR.
We first derive the stochastic algorithm for training generative adversarial networks. For simplicity, GANs model is formulated as
where and are generators and discriminators respectively; is the set of noises (usually be Gaussian or uniform) while is the observed data from the target distribution. Let and , then the object function of GANs model is rewritten into
Similarly, we can easily check that JS-GAN [goodfellow2014generative], WGAN (WGAN-clip [arjovsky2017wasserstein], WGAN-GP [gulrajani2017improved], WGAN-spectral [miyato2018spectral], etc.) and other GAN variants satisfy the above property. Therefore, with finite training data, the optimization problem Eq. 1 is rewritten as
where is the payoff function for the -th training data. The stochastic payoff function in each step is
3.2.2 Convergence Analysis
We have proved the convergence of HessianFR in solving deterministic min-max problem Eq. 1. In this part, we derive a similar convergence property for stochastic HessianFR in solving Eq. 16 under some mild conditions (Section 3.2.2). Suppose that is a strict local minimax of the min-max problem Eq. 16 and is a neighborhood of .
[Standard assumptions for smoothness] The gradient and Hessian of each objective function are bounded in , i.e., we assume that the following inequalities hold,
for all and . Furthermore, by the definition that , also satisfies above inequalities, i.e.,