Training GANs with Centripetal Acceleration

02/24/2019 ∙ by Wei Peng, et al. ∙ Chinese Academy of Science NetEase, Inc 0

Training generative adversarial networks (GANs) often suffers from cyclic behaviors of iterates. Based on a simple intuition that the direction of centripetal acceleration of an object moving in uniform circular motion is toward the center of the circle, we present the Simultaneous Centripetal Acceleration (SCA) method and the Alternating Centripetal Acceleration (ACA) method to alleviate the cyclic behaviors. Under suitable conditions, gradient descent methods with either SCA or ACA are shown to be linearly convergent for bilinear games. Numerical experiments are conducted by applying ACA to existing gradient-based algorithms in a GAN setup scenario, which demonstrate the superiority of ACA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Nets (GANs)[7] are recognized as powerful generative models, which have successfully been applied to various fields such as image generation[8], representation learning[15]

and super resolution

[17]. The idea behind GANs is an adversarial game between a generator network (G-net) and a discriminator network (D-net). The G-net attempts to generate synthetic data from some noise to deceive the D-net while the D-net tries to discern between the synthetic data and the real data. The original GANs can be formulated as the min-max problem:

(1.1)

Though GANs are appealing, they are often hard to train. The main difficulty might be the associated gradient vector field rotating around a Nash equilibrium due to the existence of imaginary components in the Jacobian eigenvalues

[11], which results in the limit oscillatory behaviors. There are a series of studies focusing on developing fast and stable methods of training GANs. Using the Jacobian, consensus optimization[11] diverts gradient updates to the descent direction of the field magnitudes. More essentially, a differential game can always be decomposed into a potential game and a Hamiltonian game[1]. Potential games have been intensively studied [13] because gradient decent methods converge in these games. Hamiltonian games obey a conservation law such that iterates generated by gradient descent are likely to cycle or even diverge in these games. Therefore, Hamiltonian components might be the cause of cycling when gradient descent methods are applied. Based on the observations, the Symplectic Gradient Adjustment (SGA) method [1] modifies the associated vector field to guide the iterates to cross the curl of the Hamiltonian component of a differential game. [4] also uses the similar technique to cross the curl such that rotations are alleviated. By augmenting the Follow-the-Regularized-Leader algorithm with an regularizer[16] by adding an optimistic predictor of the next iteration gradient, Optimistic Mirror Descent (OMD) methods are presented in [3] and analysed in [5, 10, 9, 12]. The negative momentum is employed in [6] to deplete the kinetic energy of the cyclic motion such that iterates would fall towards the center. It is also observed in [6] that the alternating version of the negative momentum method is more stable.

Our idea is motivated by two aspects. Firstly and intuitively, we use the fact that the direction of centripetal acceleration of an object moving in uniform circular motion points to the center of the circle, which might guide iterates to cross the curl and escape from cycling traps. Secondly, we try to find a method to approximate the dynamics of consensus optimization or SGA to cross the curl without computing the Jacobian, which can reduce computational costs. Then we were inspired to present the centripetal acceleration methods, which can be used to adjust gradients in various methods such as SGD, RMSProp

[18] and Adam[2]. For stability and effectiveness, we are also motivated by [6] to study the alternating scheme, which could even work in a notorious GAN setup scenario.

The main contributions are as follows:

  1. From two different perspectives, we present centripetal acceleration methods to alleviate the cyclic behaviors in training GANs. Specifically, we propose the Simultaneous Centripetal Acceleration (SCA) method and the Alternating Centripetal Acceleration (ACA) method.

  2. For bilinear games, which are purely adversarial, we prove that gradient descent with either SCA or ACA is linearly convergent under suitable conditions.

  3. Primary numerical simulations are conducted in a GAN setup scenario, which show that the centripetal acceleration is useful while combining several gradient-based algorithms.

Outline. The rest of the paper is organized as follows. In Section 2, we present simultaneous and alternating centripetal acceleration methods and discuss them with closely related works. In Section 3, focusing on bilinear games, we prove the linear convergence of gradient descent combined with the two centripetal acceleration methods. In Section 4, we conduct numerical experiments to test the effectiveness of centripetal acceleration methods. Section 5 concludes the paper.

2 Centripetal Acceleration Methods

A differentiable two-player game involves two loss functions

and defined over a parameter space . Player 1 tries to minimize the loss while player 2 attempts to minimize the loss . The goal is to find a local Nash equilibrium of the game, i.e. a pair with the following two conditions holding in a neighborhood of :

The derivation of problem (1.1) leads to a two-player game. The G-net is parameterized as while the D-net is parameterized as . Then the problem becomes to find a local Nash equilibrium:

(2.1)

where

(2.2)

The simultaneous gradient descent method in training GANs [14] is

The alternating version is

However, directly applying gradient descent even fails to approach the saddle point in a toy model (See Fig. 2 in Section 4). By applying the Simultaneous Centripetal Acceleration (SCA) method, which will be explained later, to adjust gradients, we obtain the method of Gradient descent with SCA (Grad-SCA):

(2.3)
(2.4)
(2.5)
(2.6)

It can be seen that the gradient decent scheme is still employed in (2.4) and (2.6), while the gradients in (2.3) and (2.5) are adjusted by adding the directions of centripetal acceleration simultaneously. If adjusting the gradients by the Alternating Centripetal Acceleration (ACA) method, we obtain the following method of Gradient descent with ACA (Grad-ACA):

(2.7)
(2.8)
(2.9)
(2.10)

Grad-ACA also employs simple gradient descent steps but adjusts the gradients by adding the directions of centripetal acceleration alternatively. Nevertheless, the idea of centripetal acceleration can also be applied to other gradient-based methods, resulting in more efficient algorithms. For example, the RMSProp algorithm [18] with ACA, abbreviated by RMSProp-ACA, performs well in our numerical experiments (see Section 4.2).

The basic intuition behind employing centripetal acceleration is shown in Fig. 1. Consider the uniform circular motion. Let  denote the instantaneous velocity at time . Then the centripetal acceleration points to the origin. The cyclic behavior around a Nash equilibrium might be similar to the circular motion around the origin. Therefore, the centripetal acceleration provides a direction, along which the iterates can approach the target more quickly. Then the approximated centripetal acceleration term is applied to gradient descent as illustrated in Grad-SCA.

Figure 1: The basic intuition of centripetal acceleration methods.

The proposed centripetal acceleration methods are also inspired by the dynamics of consensus optimization. In a Hamiltonian game, the associated vector field conserves the Hamiltonian’s level sets because , which prevents iterates from approaching the equilibrium where . To illustrate the similarity between centripetal acceleration methods and consensus optimization in Hamiltonian games, we consider the -player differential game where each player has a loss function for . Then the simultaneous gradient is . The Jacobian of is

(2.11)

Let . Then the iteration scheme of consensus optimization is

(2.12)

and the corresponding continuous dynamics has the form:

(2.13)

When is small, the dynamics approximates

(2.14)

By rearranging the order, we obtain

(2.15)

Since the game is assumed to be Hamiltonian, i.e., , the dynamic equation (2.15) becomes

(2.16)

Note that . Then (2.16) is equivalent to

(2.17)

Discretizing the equation with stepsize , we obtain

(2.18)

which is exactly Grad-SCA. Furthermore, in Hamiltonian games, the dynamics of consensus optimization and SGA that plugs into gradient descent algorithms (Grad-SGA) are essentially the same. Therefore, the presented Grad-SCA could be regarded as a Jacobian-free approximation of consensus optimization or Grad-SGA.

Related works. Taking in Grad-SCA (2.3)-(2.6), the centripetal acceleration scheme reduces to OMD[3], which has the following form:

Very recently, from the perspective of generalizing OMD, [12] presented schemes similar to Grad-SCA and they studied its convergence under a unified proximal method framework. However, OMD is motivated by predicting the next iteration gradient to be the current gradient optimistically. Although the scheme of OMD coincides with Grad-SCA, we must stress that the motivations are essentially different and result in totally distinct parameter selection strategies. Due to the similar dynamics, the presented methods inherit parameter selection strategies of consensus optimization and SGA. For example, in the second experiment in Section 4, we take and . The magnitude of is quite larger than instead of an equality. Moreover, we analyze the alternating form (Grad-ACA) (2.7)-(2.10) and employed RMSProp-ACA in the numerical experiments. Therefore, the presented methods are not trivial generalizations of OMD and the idea of centripetal acceleration is quite useful.

Another similar scheme[5] is to extrapolate the gradient from the past:

It can be rewritten as

which is equivalent to OMD. The algorithm may also be closely related to the predictive methods with the following form:

A unified framework to analyze OMD and predictive methods is presented in [9].

Last but not least, our idea of using alternating scheme comes from negative momentum methods[6], which suggests alternating forms might be more stable and effective in practice.

3 Linear Convergence for Bilinear Games

In this section, we focus on the convergence of Grad-SCA and Grad-ACA in the bilinear game:

(3.1)

Any stationary point of the game satisfies the first order conditions:

(3.2)
(3.3)

It is obvious that a stationary point exists if and only if is in the range of and is in the range of . We suppose that such a pair exists. Without loss of generality, we shift to . Then the problem is reformulated as:

(3.4)

In the following two subsections, we analyze convergence properties of Grad-SCA and Grad-ACA, respectively. Technique details are postponed to appendices.

3.1 Linear Convergence of Grad-SCA

For the bilinear game, Grad-SCA is specified as

(3.5)
(3.6)

Define the matrix as

(3.7)

It is obvious that , where are generated by (3.5) and (3.6). For simplicity, we suppose that is square and nonsingular in Propositions 3.2 and 3.3 and Corollary 3.4. Then we prove the linear convergence for a general matrix in Proposition 3.5 and Corollary 3.6. We will employ the following well-known lemma to illustrate the linear convergence.

Lemma 3.1.

Suppose that has the spectral radius . Then the iterative system converges to linearly. Explicitly, , there exists a constant such that

(3.8)
Proposition 3.2.

Suppose that is square and nonsingular. The eigenvalues of are the roots of the fourth order polynomials:

(3.9)

where denotes the collection of all eigenvalues.

Next, we consider cases when and .

Proposition 3.3.

Suppose that is square and nonsingular. Then is linearly convergent to 0 if and satisfy

(3.10)

where and denote the largest and the smallest eigenvalues, respectively.

Consider the special case when Grad-SCA reduces to OMD. Then we have the following corollary. The corollary is slightly weaker than the existing result [9, Lemma 3.1].

Corollary 3.4.

Suppose that is square and nonsingular. If and , then is linearly convergent, i.e., , there exists such that

Now we do not assume to be square and nonsingular (). Instead, suppose has rank and the SVD decomposition is , where with , and . Denote by the null space of , which means , and by the null space of . Note that any is a stationary point and we define

where denotes the orthogonal projection onto while denotes the orthogonal projection onto .

Proposition 3.5.

Suppose that and . Then is linearly convergent.

With the analogous analysis, we have the following result for OMD.

Corollary 3.6.

If and , then is linearly convergent, i.e., , there exists a constant such that

3.2 Linear Convergence of Grad-ACA

In this subsection, we consider Grad-ACA for the bilinear game,

(3.11)
(3.12)

The update of can be rewritten as:

Thus we define the matrix

(3.13)

which immediately follows that .

Proposition 3.7.

Suppose that is square and nonsingular. Consider the special case where . If , then is linearly convergent to , i.e., there exists a constant such that

Next, we do not assume to be square and nonsingular. Employing the SVD decomposition and with the same techniques employed in Proposition 3.5, we have

Corollary 3.8.

Consider the special case where . If , Then is linearly convergent, i.e., there exists a constant such that

which implies that linearly converges to the stationary point .

4 Numerical Simulation

4.1 A Simple Bilinear Game

In the first experiment, we tested Grad-SCA and Grad-ACA on the following bilinear game

(4.1)

The unique stationary point is . The behaviors of the methods are presented in Fig. 2. Pure gradient descent steps do not converge to the origin in this simple game. However, with centripetal acceleration methods, both Grad-SCA and Grad-ACA converge to the origin.

We compared the effects of various step-sizes and acceleration coefficients in both simultaneous and alternating cases. Fig. 3 suggests that the alternating methods are preferable.

Figure 2: The effects of Grad-SCA and Grad-ACA in the simple bilinear game. Simultaneous gradient descent () diverges while the alternating gradient descent () keeps the iterates running on a closed trajectory. Instead, both Grad-SCA and Grad-ACA () converge to the origin linearly and the alternating version seems faster.

4.2 Mixture of Gaussians

In the second simulation111The code is available at https://github.com/dynames0098/GANsTrainingWithCenAcc

, we established a toy GAN model to compare several methods on learning eight Gaussians with standard deviation

. The ground truth is shown in Fig. 4.

Both the generator and the discriminator networks have four fully connected layers of neurons. Each of the four layers is activated by a ReLU layer. The generator has two output neurons to represent a generated point while the discriminator has one output which judges a sample. The random noise input for the generator is a 16-D Gaussian. We conducted the experiment on a server equipped with CPU i7 4790, GPU Titan Xp, 16GB RAM as well as TensorFlow (version 1.12) and Python (version 3.6.7).

Figure 3: Parameter selection in the simple bilinear game. We test Grad-SCA and Grad-ACA with varying parameters . Each grid point represents the logarithm of the squared distance to the origin after iterations. Note that the colormaps are different between the two images. Grad-ACA (left) converges in the entire parameter box while Grad-SCA (right) might diverge if the step-size is much larger than . In this simple experiment, a larger seems more preferred. Particularly, when , the Grad-SCA reduces to OMD.
Figure 4: Kernel density estimation on 2560 samples of the ground truth.

We compared the results of several algorithms as shown in Fig. 6. Five methods are included in the comparison:

  1. RMSProp: Simultaneous RMSPropOptimizer (learning rate: ) provided by TensorFlow.

  2. RMSProp-alt: Alternating RMSPropOptimizer (learning rate: ).

  3. ConOpt: Consensus optimizer ()[11].

  4. RMSProp-SGA: Symplectic gradient adjusted RMSPropOptimizer with sign alignment ()[1].

  5. RMSProp-ACA: RMSPropOptimizer with alternating centripetal acceleration method ().

To stress the effectiveness brought by parameter selection and alternating strategy regardless of the similar form with OMD, we also tested OMD on this simulation with searching a range of parameters (See Appendix B).

The centripetal acceleration methods have extra computation costs on computing the difference between successive gradients as well as storage costs to maintain previous gradients. The consensus optimization and SGA require extra computations on the Jacobian related steps. Fig. 5 shows a time consuming comparison. From these comparisons, RMSProp-ACA seems competitive to other methods.

Figure 5: Time consuming comparison. RMSProp-ACA consumes slightly more time than RMSProp. However, it takes far less time than ConOpt and RMSProp-SGA.

5 Conclusion

In this paper, to alleviate the difficulty in finding a local Nash equilibrium in a smooth two-player game, we were inspired to present several gradient-based methods, including Grad-SCA and Grad-ACA, which employ centripetal acceleration. The proposed methods can easily be plugged into other gradient-based algorithms like SGD, Adam or RMSProp in both simultaneous or alternating ways. From the theoretical viewpoint, we proved that both Grad-SCA and Grad-ACA have linear convergence for bilinear games under suitable conditions. We found that in a simple bilinear game, centripetal acceleration makes iterates converge to the Nash equilibrium stably; these examples also suggest that alternating methods are more preferred than simultaneous ones. In the GAN setup numerical simulations, we showed that the RMSProp-ACA can be competitive to consensus optimization and symplectic gradient adjustment methods.

However, we only consider the deterministic bilinear games theoretically and limited numerical simulations. In practical training of GANs or its variants, the associated games are much more complicated due to the randomness of computation, the online procedure and non-convexity. These issues still need further detailed studies.

Figure 6: Comparison among several algorithms on the mixture of Gaussians. Five methods are included in the comparison. Each row displays one method and each column shows samples generated by the G-net at iterations respectively.

References

Appendix A Proofs in Section 3

a.1 Proof of Proposition 3.2

Proof The characteristic polynomial of the matrix (3.7) is

(A.1)

which is equivalent to

(A.2)

Since is nonsingular and square, then or can not be the roots of A.2. Then the roots of (A.2) must be the roots of

(A.3)

It follows that the eigenvalues of must be the roots of the fourth order polynomials:

a.2 Proof of Proposition 3.3

Proof Given an eigenvalue of , using Proposition 3.2, we have

(A.4)

Denote and . Then the four roots of (A.4) are

Note that for a given complex number , the absolute value of the real part of is and the absolute value of the imaginary part of is . Therefore, since , all real parts of lie in the interval , where

(A.5)

and all imaginary parts of lie in the interval , where

(A.6)

Using the inequality

(A.7)

we have

(A.8)
(A.9)

Next, we discuss in and separately.
(1). In the first case, we suppose . Since for all , we have

Noting that , we obtain

(A.10)

Combining and (A.10) yields

which follows that

(A.11)
(A.12)

The inequality (A.11) follows by the fact that and the inequality (A.12) uses (A.7). The inequality above is equivalent to

Using (A.8) and (A.9), we obtain

(A.13)

Note that the equality of (A.7) holds if and only if . Thus the equality of (A.13) implies and . Since , we have the strict inequality , which leads to the linear convergence of .
(2). In the second case, assume . Since , using (A.5) and (A.6) directly, we have

(A.14)

which yields the linear convergence. ∎

a.3 Proof of Corollary 3.4

Proof For the special cases, we have and . From (A.8) and (A.9), we obtain

From Lemma 3.1 it follows that is linearly convergent. ∎

a.4 Proof of Proposition 3.5

Proof Using the SVD decomposition , we have

According to the definition of the diagonal matrix , the -th component to -th components of and are zeros. Therefore, we focus on the leading components of and , denoted by and respectively. Let be the matrix composed of the leading rows and columns of . Then we have

(A.15)
(A.16)

Define