LOGAN: Latent Optimisation for Generative Adversarial Networks

12/02/2019 ∙ by Yan Wu, et al. ∙ Google 6

Training generative adversarial networks requires balancing of delicate adversarial dynamics. Even with careful tuning, training may diverge or end up in a bad equilibrium with dropped modes. In this work, we introduce a new form of latent optimisation inspired by the CS-GAN and show that it improves adversarial dynamics by enhancing interactions between the discriminator and the generator. We develop supporting theoretical analysis from the perspectives of differentiable games and stochastic approximation. Our experiments demonstrate that latent optimisation can significantly improve GAN training, obtaining state-of-the-art performance for the ImageNet (128 x 128) dataset. Our model achieves an Inception Score (IS) of 148 and an Fréchet Inception Distance (FID) of 3.4, an improvement of 17 respectively, compared with the baseline BigGAN-deep model with the same architecture and number of parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Nets (GANs) are implicit generative models that can be trained to match a given data distribution. GANs were originally developed by Goodfellow et al. (2014) for image data. As the field of generative modelling has advanced, GANs remain at the frontier, generating high-fidelity images at large scale (Brock et al., 2018). However, despite growing insights into the dynamics of GAN training, most recent advances in large-scale image generation come from network architecture improvements (Radford et al., 2015; Zhang et al., 2019), or regularisation of particular parts of the model (Miyato et al., 2018; Miyato and Koyama, 2018). Inspired by the compressed sensing GAN (CS-GAN; Wu et al., 2019), we exploit the benefit of latent optimisation in adversarial games using natural gradient descent to optimise the latent variable at each step of training. This results in a scaleable and easy to implement approach that improves the dynamic interaction between the discriminator and the generator. For clarity, we unify these approaches as latent optimised GANs (LOGAN).

To summarise our contributions:

  1. We present a novel analysis of latent optimisation in GANs from the perspective of differentiable games and stochastic approximation (Balduzzi et al., 2018; Heusel et al., 2017). We argue that latent optimisation can improve the dynamics of adversarial training.

  2. Motivated by this analysis, we improve latent optimisation by taking advantage of efficient second-order updates.

  3. Our algorithm improves the state-of-the-art BigGAN-deep model (Brock et al., 2018) by a significant margin, without introducing any architectural change or additional parameters. The result is higher quality images and more diverse samples (Figure 1 and 2).

(a) (b)
Figure 1: Samples from BigGAN-deep (a) and LOGAN (b) with similarly high IS. Samples from the two panels were drawn from truncation levels corresponding to points C and D in figure 3 b respectively. (FID/IS: (a) 27.97/259.4, (b) 8.19/259.9)
(a) (b)
Figure 2: Samples from BigGAN-deep (a) and LOGAN (b) with similarly low FID. Samples from the two panels were drawn from truncation levels corresponding to points A and B in figure 3 b respectively. (FID/IS: (a) 5.04/126.8, (b) 5.09/217.0)

2 Background

2.1 Notation

We use and

to denote the vectors representing parameters of the generator and discriminator. We use

for images, and for the latent source generating an image. We use prime to denote a variable after one update step, e.g., . and denote the data distribution and source distribution respectively. indicates taking the expectation of function over the distribution .

2.2 Generative Adversarial Nets

BigGAN-Deep baseline LOGAN (GD) LOGAN (NGD)
FID
IS
Table 1: Comparison of model scores. BigGAN-deep results are reproduced from Brock et al. (2018). “baseline” indicates our reproduced BigGAN-deep with small modifications. The 3rd and 4th columns are from the gradient descent (GD, ablated) and natural gradient descent (NGD) versions of LOGAN respectively. We report the Inception Score (IS, higher is better, Salimans et al. 2016) and Fréchet Inception Distance (FID, lower is better, Heusel et al. 2017).

A GAN consists of a generator that generates image from a latent source , and a discriminator that scores the generated images as (Goodfellow et al., 2014). Training GANs involves an adversarial game: while the discriminator tries to distinguish generated samples from data , the generator tries to fool the discriminator. This procedure can be summarised as the following min-max game:

(1)

The exact form of

depends on the choice of loss function

(Goodfellow et al., 2014; Arjovsky et al., 2017; Nowozin et al., 2016). To simplify our presentation and analysis, we use the Wasserstein loss (Arjovsky et al., 2017), so that and . Our experiments with BigGAN-deep use the hinge loss (Lim and Ye, 2017; Tran et al., 2017), which is identical to this form in its linear regime. Our analysis can be generalised to other losses as in previous theoretical work (e.g., Arora et al. 2017). To simplify notation, we abbreviate , which may be further simplified as when the explicit dependency on and can be omitted.

Training GANs requires carefully balancing updates to and , and is sensitive to both architecture and algorithm choices (Salimans et al., 2016; Radford et al., 2015). A recent milestone is BigGAN (and BigGAN-deep, Brock et al. 2018), which pushed the boundary of high fidelity image generation by scaling up GANs to an unprecedented level. BigGANs use an architecture based on residual blocks (He et al., 2016), in combination with regularisation mechanisms and self-attention (Saxe et al., 2014; Miyato et al., 2018; Zhang et al., 2019).

Here we aim to improve the adversarial dynamics during training. We focus on the second term in eq. 1 which is at the heart of the min-max game. For clarity, we explicitly write the losses for as and as , so the total loss vector can be written as

(2)

Computing the gradients with respect to and gives the following gradient, which cannot be expressed as the gradient of any single function (Balduzzi et al., 2018):

(3)

The fact that is not the gradient of a function implies that gradient updates in GANs can exhibit cycling behaviour which can slow down or prevent convergence. Balduzzi et al. (2018) refer to vector fields of this form as the simultaneous gradient. Although many GAN models use alternating update rules (e.g., Goodfellow et al. 2014; Brock et al. 2018), following the gradient with respect to and alternatively in each step, they can still suffer from cycling, so we use the simpler simultaneous gradient (eq. 3) for our analysis (see also Mescheder et al. 2017, 2018).

(a) (b)
Figure 3: (a) Schematic of LOGAN. We first compute a forward pass through and with a sampled latent . Then, we use gradients from the generator loss (dashed red arrow) to compute an improved latent, . After we use this optimised latent code in a second forward pass, we compute gradients of the discriminator back through the latent optimisation into the model parameters , . We use these gradients to update the model. (b) Truncation curves illustrate the FID/IS trade-off for each model by altering the range of the noise source . GD: gradient descent. NGD: natural gradient descent. Points A, B, C, D correspond to samples shown in Figure 1 and 2.

2.3 Latent Optimised GANs

  Input: data distribution , latent distribution , , , learning rate , batch size
  repeat
     Initialise discriminator and generator parameters ,
     for  to  do
        Sample ,
        Compute the gradient and use it to obtain from eq. 4 (GD) or eq. 12 (NGD)
        Optimise the latent , indicates clipping the value between and
        Compute generator loss
        Compute discriminator loss
     end for
     Compute batch losses and
     Update and with the gradients ,
  until reaches the maximum training steps
Algorithm 1 Latent Optimised GANs with Automatic Differentiation

Inspired by compressed sensing (Candes et al., 2006; Donoho, 2006), Wu et al. (2019) introduced latent optimisation for GANs. We call this type of model latent-optimised GANs (LOGAN). Latent optimization has been shown to improve the stability of training as well as the final performance for medium-sized models such as DCGANs and Spectral Normalised GANs (Radford et al., 2015; Miyato et al., 2018). Latent optimisation exploits knowledge from to guide updates of . Intuitively, the gradient points in the direction that satisfies the discriminator , which implies better samples. Therefore, instead of using the randomly sampled , Wu et al. (2019) uses the optimised latent

(4)

in eq. 1 for training 111We use a single step of gradient-based optimisation during training, and justify this choice in section 3.. The general algorithm is summarised in Algorithm 1 and illustrated in Figure 3 a. We develop the natural gradient descent form of latent update in Section 4.

3 Analysis of the Algorithm

To understand how latent optimisation improves GAN training, we analyse LOGAN as a 2-player differentiable game following Balduzzi et al. (2018); Gemp and Mahadevan (2018); Letcher et al. (2019). The appendix provides a complementary analysis that relates LOGAN to unrolled GANs (Metz et al., 2016) and stochastic approximation (Heusel et al., 2017; Borkar, 1997).

An important problem with gradient-based optimization in GANs is that the vector-field generated by the losses of the discriminator and generator is not a gradient vector field. It follows that gradient descent is not guaranteed to find a local optimum and can cycle, which can slow down convergence or lead to phenomena like mode collapse and mode hopping. Balduzzi et al. (2018); Gemp and Mahadevan (2018) proposed Symplectic Gradient Adjustment (SGA) to improve the dynamics of gradient-based methods in adversarial games. For a game with gradient (eq. 3), we define the Hessian as the second order derivatives with respect to the parameters, . SGA uses the adjusted gradient

(5)

and is the anti-symmetric component of the Hessian. Applying SGA to GANs yields the adjusted updates (see Appendix B.1 for details):

(6)

Compared with in eq. 3, the adjusted gradient has second-order terms reflecting the interactions between and . SGA significantly improves GAN training in simple examples (Balduzzi et al., 2018), allowing faster and more robust convergence to stable fixed points (local Nash equilibria). In addition, eq. 6 is losely related to unrolled GANs (Metz et al., 2016), which we discuss in detail in Appendix B.2. Unfortunately, SGA is expensive to scale because computing the second-order derivatives with respect to all parameters is expensive.

We can explicitly compute the gradients for the discriminator and generator at after one step of latent optimisation by differentiating (where from eq. 4):

(7)
(8)

In both equations, the first terms represent how depends on the parameters directly and the second terms represent how depends on the parameters via the optimised latent source. For the second equality, we substitute as the gradient-based update of and use . Further differentiating results in the second-order terms and . The original GAN’s gradient (eq. 3) does not include any second-order term, since without latent optimisation. LOGAN computes these extra terms by automatic differentiation when back-propagating through the latent optimisation process (see Algorithm 1).

The SGA updates in eq. 6 and the LOGAN updates in eq. 8 are strikingly similar, suggesting that the latent step used by LOGAN reduces the negative effects of cycling by introducing a symplectic gradient adjustment into the optimization procedure. The role of the latent step can be formalized in terms of a third player, whose goal is to help the generator (see appendix B for details).

Crucially, latent optimisation approximates SGA using only second-order derivatives with respect to the latent and parameters of the discriminator and generator separately. The second order terms involving parameters of both the discriminator and the generator – which are extremely expensive to compute – are not used. For latent ’s with dimensions typically used in GANs (e.g., 128–256, orders of magnitude less than the number of parameters), the second order terms can be computed efficiently. In short, latent optimisation efficiently couples the gradients of the discriminator and generator, as prescribed by SGA, but using the much lower-dimensional latent source which makes the adjustment scalable.

Appendix B.3 further shows that latent optimisation accelerates the speed of updating relative to the speed of updating , facilitating convergence according to Heusel et al. (2017) (see also Figure 4 b). Intuitively, the generator requires less updating compared with to achieve the same reduction of loss because latent optimisation “helps” . In summary, our analysis suggests that latent optimisation can improve GAN training dynamics by allowing larger single steps towards the direction of without overshooting.

(a) (b)
Figure 4: (a) Scaling of gradients in natural gradient descent. We use in BigGAN-Deep experiments. (b) The update speed of the discriminator relative to the generator shown as the difference after each update step. Lines are smoothed with a moving average using window size 20 (in total, there are 3007, 1659 and 1768 data points for each curve). All curves oscillated strongly after training collapsed.

4 LOGAN with Natural Gradient Descent

Our analysis suggests that LOGAN benefits from strong optimisers for updating . In this work, we use natural gradient descent (NGD, Amari 1998) for latent optimisation. NGD is an approximate second-order optimisation method (Pascanu and Bengio, 2013; Martens, 2014), and has been applied successfully in many domains. By using the positive semi-definite (PSD) Gauss-Newton matrix to approximate the (possibly negative definite) Hessian, NGD often works even better than exact second-order methods. NGD is expensive in high dimensional parameter spaces, even with approximations (Martens, 2014). However, we demonstrate that it is efficient for latent optimisation, even in very large models.

Given the gradient of , , NGD computes the update as

(9)

where the Fisher information matrix is defined as

(10)

The log-likelihood function typically corresponds to commonly used error functions such as the cross entropy loss. This correspondence is not necessary when we interpret NGD as an approximate second-order method, as has long been done (Martens, 2014). Nevertheless, Appendix C provides a Poisson log-likelihood interpretation for the hinge loss commonly used in GANs (Lim and Ye, 2017; Tran et al., 2017). An important difference between latent optimisation and commonly seen senarios using NGD is that the expectation over the condition () is absent. Since each is only responsible for generating one image, it only minimises the loss for this particular instance.

More specifically, we use the empirical Fisher with Tikhonov damping, as in TONGA (Roux et al., 2008)

(11)

is cheaper to compute compared with the full Fisher, since is already available. The damping factor regularises the step size, which is important when only poorly approximates the Hessian or when the Hessian changes too much across the step. Using the Sherman-Morrison formula, the NGD update can be simplified into the following closed form:

(12)

which does not involve any matrix inversion. Thus, NGD adapts the step size according to the curvature estimate

. When is small, NGD normalises the gradient by its L2-norm, which is equivalent to whitening here since we assume all samples of are independent. Figure 4 a illustrates the scaling for different values of . NGD automatically smooths the scale of updates by down-scaling the gradients as their norm grows, which also contributes to the smoothed norms of updates (Figure 4 b). Since the NGD update remains proportional to , our analysis based on gradient descent in section 3 still holds.

(a) (b)
Figure 5: (a) The change from across training, in ’s output space and ’s Euclidean space. The distances are normalised by their standard derivations computed from a moving window of size ( data points in total). (b) Training curves from models with different “stop_gradient” operations. For reference, the training curve from an unablated model is plotted as the dashed line. All instances with stop_gradient collapsed (FID went up) early in training.

5 Experiments and Analysis

We focus on large scale GANs based on BigGAN-deep (Brock et al., 2018) trained on size images from the ImageNet dataset (Deng et al., 2009). In Appendix E, we present results from applying our algorithm on Spectral Normalised GANs trained with CIFAR dataset (Krizhevsky et al., 2009), which obtains state-of-the-art scores on this model.

5.1 Model Configuration

We used the standard BigGAN-deep architecture with three minor modifications: 1. We increased the size of the latent source from to , to compensate the randomness of the source lost when optimising

. 2. We use the uniform distribution

instead of the standard normal distribution

for to be consistent with the clipping operation (Algorithm 1). 3. We use

leaky ReLU

instead of ReLU as the non-linearity for smoother gradient flow for .

Consistent with detailed findings in Brock et al. (2018), our experiment with this baseline model obtains only slightly better scores compared with those in Brock et al. (2018) (Table 1, see also Figure 8). We computed the FID and IS as in Brock et al. (2018)

, and computed IS values from checkpoints with the lowest FIDs. Finally, we computed the means and standard deviations for both measures from 5 models with different random seeds.

To apply latent optimisation, we use a damping factor combined with a large step size of . As an additional way of damping, we only optimise of ’s dimensions. Optimising the entire population of was unstable in our experiments. Similar to Wu et al. (2019), we found it was helpful to regularise the Euclidean norm of weight-change , with a regulariser weight of . All other hyper-parameters, including learning rates and a large batch size of 2048, remain the same as in BigGAN-deep; we did not optimise these hyper-parameters. We call this model LOGAN (NGD).

5.2 Basic Results

Employing the same architecture and number of parameters as the BigGAN-deep baseline, LOGAN (NGD) achieved better FID and IS (Table 1). As observed by Brock et al. (2018), BigGAN training eventually collapsed in every experiment. Training with LOGAN also collapsed, perhaps due to higher-order dynamics beyond the scope we have analysed, but took significantly longer (600k steps versus 300k steps with BigGAN-deep).

During training, LOGAN was times slower per step compared with BigGAN-deep because of the additional forward and backward pass. We found that optimising during evaluation did not improve sample scores (even up to 10 steps), so we do not optimise for evaluation. Therefore, LOGAN has the same evaluation cost as original BigGAN-deep. To help understand this behaviour, we plot the change from during training in Figure 5 a. Although the movement in Euclidean space grew until training collapsed, the movement in ’s output space, measured as , remained unchanged (see Appendix D for details). As shown in our analysis, optimising improves the training dynamics, so LOGANs work well after training without requiring latent optimisation.

5.3 Ablation Studies

We verify our theoretical analysis in section 3 by examining key components of Algorithm 1 via ablation studies. First, we experiment with using basic GD to optimising , as in Wu et al. (2019), and call this model LOGAN (GD). A smaller step size of was required; larger values were unstable and led to premature collapse of training. As shown in Table 1, the scores from LOGAN (GD) were worse than LOGAN (NGD) and similar to the baseline model.

We then evaluate the effects of removing those terms depending on in eq. 8, which are not in the ordinary gradient (eq. 3). Since we computed these terms by back-propagating through the latent optimisation procedure, we removed them by selectively blocking back-propagation with “stop_gradient

” operations (e.g., in TensorFlow

Abadi et al. 2016). Figure 5 b shows the change of FIDs for the three models corresponding to removing , removing and removing both terms. As predicted by our analysis (section 3), both terms help stabilise training; training diverged early for all three ablations.

5.4 Truncation and Samples

Truncation is a technique introduced by Brock et al. (2018) to illustrate the trade-off between the FID and IS in a trained model. For a model trained with from a source distribution symmetric around , such as the standard normal distribution and the uniform distribution , down-scaling (truncating) the source with gives samples with higher visual quality but reduced diversity. We see this quantified in higher IS scores and lower FID when evaluating samples from truncated distributions.

Figure 3 b plots the truncation curves for the baseline BigGAN-deep model, LOGAN (GD) and LOGAN (NGD), obtained by varying the truncation (value of ) from (no truncation, upper-left ends of the curves) to (extreme truncation, bottom-right ends). Each curve shows the trade-off between FID and IS for an individual model; curves towards the upper-right corner indicate better overall sample quality. The relative positions of curves in Figure 3 (b) shows LOGAN (NGD) has the best sample quality. Interestingly, although LOGAN (GD) and the baseline model have similar scores without truncation (upper-left ends of the curves, see also Table 1), LOGAN (GD) was better behaved with increasing truncation, suggesting LOGAN (GD) still converged to a better equilibrium. For further reference, we plot truncation curves from additional baseline models in Figure 8.

Figure 1 and Figure 2 show samples from chosen points on the truncation curves. In the high IS domain, C and D on the truncation curves both have similarly high IS of near 260. Samples from batches with such high IS have almost photo-realistic image quality. Figure 1 show that while the baseline model produced nearly uniform samples, LOGAN (NGD) could still generate highly diverse samples. On the other hand, A and B from Figure 3 b have similarly low FID of near 5, indicating high sample diversity. Samples in Figure 2 b show higher quality compared with those in a (e.g., the interfaces between the elephants and ground, the contours around the pandas).

6 Conclusion

In this work, we present the LOGAN model which significantly improves the state-of-the-art on large scale GAN training for image generation by optimising the latent source . Our results illustrate improvements in quantitative evaluation and samples with higher quality and diversity. Moreover, our analysis suggests that LOGAN fundamentally improves adversarial training dynamics. We therefore expect our method to be useful in other tasks that involve adversarial training, including representation learning and inference (Donahue et al., 2017; Dumoulin et al., 2017)

, text generation

(Zhang et al., 2019), style learning (Zhu et al., 2017; Karras et al., 2019), audio generation (Donahue et al., 2018) and video generation (Vondrick et al., 2016; Clark et al., 2019).

References

  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    .
    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.3.
  • S. Amari (1998) Natural gradient works efficiently in learning. Neural computation 10 (2), pp. 251–276. Cited by: §4.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein GAN. arXiv preprint arXiv:1701.07875. Cited by: §2.2.
  • S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang (2017) Generalization and equilibrium in generative adversarial nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224–232. Cited by: §2.2.
  • D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel (2018) The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642. Cited by: §B.1, §B.2, item 1, §2.2, §2.2, §3, §3.
  • V. S. Borkar (1997) Stochastic approximation with two time scales. Systems & Control Letters 29 (5), pp. 291–294. Cited by: §B.3, §3.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: Figure 8, item 3, §1, §2.2, §2.2, Table 1, §5.1, §5.2, §5.4, §5.
  • E. J. Candes, J. K. Romberg, and T. Tao (2006) Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 59 (8), pp. 1207–1223. Cited by: §2.3.
  • A. Clark, J. Donahue, and K. Simonyan (2019) Efficient video generation on complex datasets. arXiv preprint arXiv:1907.06571. Cited by: §6.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §5.
  • C. Donahue, J. McAuley, and M. Puckette (2018) Adversarial audio synthesis. arXiv preprint arXiv:1802.04208. Cited by: §6.
  • J. Donahue, P. Krähenbühl, and T. Darrell (2017) Adversarial feature learning. In ICLR, Cited by: §6.
  • D. L. Donoho (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §2.3.
  • V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2017) Advesarially learned inference. In ICLR, Cited by: §6.
  • I. Gemp and S. Mahadevan (2018) Global Convergence to the Equilibrium of GANs using Variational Inequalities. In Arxiv:1808.01531, Cited by: §3, §3.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.2, §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §2.2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: item 2, §B.3, §B.3, §B.3, Table 2, item 1, Table 1, §3, §3.
  • M. W. Hirsch (1989) Convergent activation dynamics in continuous time networks. Neural networks 2 (5), pp. 331–349. Cited by: §B.3.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §6.
  • V. R. Konda and V. S. Borkar (1999)

    Actor-critic–type learning algorithms for markov decision processes

    .
    SIAM Journal on control and Optimization 38 (1), pp. 94–123. Cited by: §B.3.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.
  • A. Letcher, D. Balduzzi, S. Racanière, J. Martens, J. N. Foerster, K. Tuyls, and T. Graepel (2019) Differentiable game mechanics.. Journal of Machine Learning Research 20 (84), pp. 1–40. Cited by: §3.
  • J. H. Lim and J. C. Ye (2017) Geometric GAN. arXiv preprint arXiv:1705.02894. Cited by: §2.2, §4.
  • J. Martens (2014) New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193. Cited by: §4, §4.
  • L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. In International Conference on Machine Learning, pp. 3478–3487. Cited by: §2.2.
  • L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1825–1835. External Links: Link Cited by: §2.2.
  • L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein (2016) Unrolled generative adversarial networks. CoRR abs/1611.02163. External Links: Link, 1611.02163 Cited by: item 2, §B.2, §3, §3.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: Figure 8, Table 2, Appendix E, §1, §2.2, §2.3.
  • T. Miyato and M. Koyama (2018) cGANs with projection discriminator. arXiv preprint arXiv:1802.05637. Cited by: §1.
  • S. Nowozin, B. Cseke, and R. Tomioka (2016) -GAN: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §2.2.
  • R. Pascanu and Y. Bengio (2013) Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584. Cited by: §4.
  • B. A. Pearlmutter (1994) Fast exact multiplication by the hessian. Neural computation 6 (1), pp. 147–160. Cited by: §B.1.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1, §2.2, §2.3.
  • N. L. Roux, P. Manzagol, and Y. Bengio (2008) Topmoumoute online natural gradient algorithm. In Advances in neural information processing systems, pp. 849–856. Cited by: §4.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: Table 2, §2.2, Table 1.
  • A. M. Saxe, J. L. McClelland, and S. Ganguli (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR. Cited by: §2.2.
  • D. Tran, R. Ranganath, and D. Blei (2017) Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems, pp. 5523–5533. Cited by: §2.2, §4.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pp. 613–621. Cited by: §6.
  • Y. Wu, M. Rosca, and T. Lillicrap (2019) Deep compressed sensing. In International Conference on Machine Learning, pp. 6850–6860. Cited by: §B.1, Table 2, Appendix E, §1, §2.3, §5.1, §5.3.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In International Conference on Machine Learning, pp. 7354–7363. Cited by: Figure 8, §1, §2.2, §6.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    .
    In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §6.

Appendix A Additional Samples and Results

Figure 6 and 7 provide additional samples, organised similar to Figure 1 and 2. Figure 8 shows additional truncation curves.

(a) (b)
Figure 6: Samples from BigGAN-deep (a) and LOGAN (b) with the similarly high inception scores. Samples from the two panels were drawn from truncations correspond to points C, D in figure 3. (FID/IS: (a) 27.97/259.4, (b) 8.19/259.9)
(a) (b)
Figure 7: Samples from BigGAN-deep (a) and LOGAN (b) with the similarly low FID. Samples from the two panels were draw from truncations correspond to points A, B in figure 3b. (FID/IS: (a) 5.04/126.8, (b) 5.09/217.0)
Figure 8: Truncation curves with additional baselines. In addition to the truncation curves reported in Figure 3b, here we also include the Spectral-Normalised GAN (Miyato et al., 2018), Self-Attention GAN (Zhang et al., 2019), original BigGAN and BigGAN-deep as presented in Brock et al. (2018).

Appendix B Detailed Analysis of Latent Optimisation

In this section we present three complementary analyses of LOGAN. In particular, we show how the algorithm brings together ideas from symplectic gradient adjustment, unrolled GANs and stochastic approximation with two time scales.

b.1 Approximate Symplectic Gradient Adjustment

To analyse LOGAN as a differentiable game we treat the latent step as adding a third player to the original game played by the discriminator and generator. The third player’s parameter, , is optimised online for each . Together the three players (latent player, discriminator, and generator) have losses averaged over a batch of samples:

(13)

where ( is the batch size) reflects the fact that each is only optimised for a single sample , so its contribution to the total loss across a batch is small compared with and which are directly optimised for batch losses. This choice of is essential for the following derivation, and has important practical implication. It means that the per-sample loss , instead of the loss summed over a batch , should be the only loss function guiding latent optimisation. Therefore, when using natural gradient descent (Section 4), the Fisher information matrix should only be computed using the current sample .

The resulting simultaneous gradient is

(14)

Following Balduzzi et al. (2018), we can write the Hessian of the game as:

(15)

The presence of a non-zero anti-symmetric component in the Hessian

(16)

implies the dynamics have a rotational component which can cause cycling or slow down convergence. Since for typical batch sizes (e.g., for DCGAN and for BigGAN-deep), we abbreviate to simplify notations.

Symplectic gradient adjustment (SGA) counteracts the rotational force by adding an adjustment term to the gradient to obtain , which for the discriminator and generator has the form:

(17)
(18)

The gradient with respect to is ignored since the convergence of training only depends on and .

If we drop the last terms in eq.17 and 18, which are expensive to compute for large models with high-dimensional and , and use , the adjusted updates can be rewritten as

(19)
(20)

Because of the third player, there are still the terms depend on to adjust the gradients. Efficiently computing and is non-trivial (e.g., Pearlmutter 1994). However, if we introduce the local approximation

(21)

then the adjusted gradient becomes identical to 8 from latent optimisation.

In other words, automatic differentiation by commonly used machine learning packages can compute the adjusted gradient for and when back-propagating through the latent optimisation process. Despite the approximation involved in this analysis, both our experiments in section 5 and the results from Wu et al. (2019) verified that latent optimisation can significantly improve GAN training.

b.2 Relation with Unrolled GANs

Latent optimisation can be seen as unrolling GANs (Metz et al., 2016) in the space of the latent, rather than the parameters. Unrolling in the latent space has the advantages that:

  1. LOGAN is more scalable than Unrolled GANs because it avoids second-order derivatives over a potentially very large number of parameters.

  2. While unrolling the update of only affects the parameters of (as in Metz et al. 2016), latent optimisation effects both and as shown in eq. 8.

We next formally present this connection by showing that SGA can be seen as approximating Unrolled GANs (Metz et al., 2016). For the update , we have the Taylor expansion approximation at :

(22)

Substitute , and take the derivatives with respect to on both sides:

(23)

which is the same as eq. 18 (taking the negative sign). Compared with the exact gradient from the unroll:

(24)

The approximation in eq. 23 comes from using and as a result of the linear approximation.

At this point, unrolling update only affects . Although it is expensive to unroll both and , in principle, we can unroll update and compute the gradient of similarly using :

(25)

which gives us the same update rule as SGA (eq. 17). This correspondence based on first order Taylor expansion is unsurprising, as SGA is based on linearising the adversarial dynamics (Balduzzi et al., 2018).

b.3 Stochastic Approximation with Two Time Scales

Heusel et al. (2017) used the theory of stochastic approximation to analyse GAN training. Viewing the training process as stochastic approximation with two time scales (Borkar, 1997; Konda and Borkar, 1999), they suggest that the update of should be fast enough compared with that of . Under mild assumptions, Heusel et al. (2017) proved that such two time-scale update converges to local Nash equilibrium. Their analysis follows the idea of perturbation (Hirsch, 1989), where the slow updates () are interpreted as a small perturbation over the ODE describing the fast update (). Importantly, the size of perturbation is measured by the magnitude of parameter change, which is affected by both the learning rate and gradients.

Here we show, in accordance with Heusel et al. (2017), that LOGAN accelerates discriminator updates and slows down generator updates, thus helping the convergence of discriminator. We start from analysing the change of . We assume that, without LO, it takes to make a small constant amount of reduction in loss :

(26)

Now using the optimised , we assess the change required to achieve the same amount of reduction:

(27)

Intuitively, when “helps” to achieve the same goal of increasing by , the responsible of becomes smaller, so it does not need to change as much as , thus .

Formally, and have the following Taylor expansions around and :

(28)
(29)

Where ’s are higher order terms of the increments. Using the assumption of eq. 26 and 27, we can combine eq. 28 and 29:

(30)

where . Since in gradient descent (eq. 3),

(31)

Therefore, we have the inequality

(32)

If we further assume and

are obtained from stochastic gradient descent with identical learning rate,

(33)

substituting eq. 33 into eq. 32 gives

(34)

The same analysis applies to the discriminator. The similar intuition is that it takes the discriminator additional effort to compensate the exploitation from the optimised . We then obtain

(35)

However, since the adversarial loss , we have and taking the opposite signs of eq.33. For sufficiently small , and , is close to zero, so under our assumptions of small , and .

Importantly, the bigger the product is, the more robust the inequality is to the error from . Moreover, bigger steps increase the speed gap between updating D and G, further facilitating convergence (in accordance with Heusel et al. (2017)). Overall, our analysis suggests:

  1. More than one gradient descent step may not be helpful, since from multiple GD steps may deviate from the direction of .

  2. A large step of is helpful in facilitating convergence by widening the gap between D and G updates (Heusel et al., 2017).

  3. However, the step of cannot be too large. In addition to the linear approximation we used throughout our analysis, the approximate SGA breaks down when eq.21 is strongly violated when “overshoot” brings the gradients at to the opposite sign of .

Appendix C Poisson Likelihood from Hinge loss

Here we provide a probabilistic interpretation of the hinge loss for the generator, which leads naturally to the scenario of a family of discriminators. Although this interpretation is not necessary for our current algorithm, it may provides useful guidance for incorporating multiple discriminators.

We introduce the label for real data and fake samples. This section shows that the generator hinge loss

(36)

can be interpreted as a negative log-likelihood function:

(37)

Here

is the probability that the generated image

can fool the discriminator .

The original GAN’s discriminator can be interpreted as outputting a Bernoulli distribution

. In this case, if we parameterise , the generator loss is the negative log-likelihood

(38)

Bernoulli, however, is not the only valid choice as the discriminator’s output distribution. Instead of sampling “1” or “0”, we assume that there are many identical discriminators that can independently vote to reject an input sample as fake. The number of votes

in a given interval can be described by a Poisson distribution with parameter

with the following PMF:

(39)

The probability that a generated image can fool all the discriminators is the probability of receiving no vote for rejection

(40)

Therefore, we have the following negative log-likelihood as the generator loss if we parameterise :

(41)

This interpretation has a caveat that when the Poisson distribution is not well defined. However, in general the discriminator’s hinge loss

(42)

pushes via training.

Appendix D Details in Computing Distances in Figure 5 a

For a temporal sequence (e.g. the changes of or

at each training step in this paper), to normalise its variance while accounting for the non-stationarity, we process it as follows. We first compute the moving average and standard deviation over a window of size

:

(43)
(44)

Then normalise the sequence as:

(45)

The result in Figure 5a is robust to the choice of window size. Our experiments with from to yielded visually similar plots.

Appendix E Experiments with DCGAN and CIFAR

To test if latent optimisation works with models at more moderate scales, we apply it to SN-GANs (Miyato et al., 2018). Although our experiments on this model are less thorough than in the main paper with BigGAN-deep, we hope to provide basic guidelines for researchers interested in applying latent optimisation on smaller models.

The experiments follows the same basic setup and hyper-parameter settings as the CS-GAN in Wu et al. (2019). There is no class conditioning in this model. For NGD, we found a smaller damping factor , a regulariser weight of (other parameters are same as in Wu et al. 2019), combined with optimising of the latent source (instead of for BigGAN-deep) worked best for SN-GANs.

In addition, we found running extra latent optimisation steps benefited evaluation, so we use ten steps of latent optimisation in evaluation for results in this section, although the models were still trained with a single optimisation step. We reckon that smaller models might not be “over-parametrised” enough to fully amortise the computation from optimising , which can then further exploit the architecture in evaluation time. On the other hand, the overhead from running multiple iterations of latent optimisation is relatively small at this scale. We aim to further investigate this difference in future studies.

Table 2 shows the FID and IS alongside SN-GAN and CS-CAN which used the same architecture. Here we observe similarly significant improvement over the baseline SN-GAN model, with an improvement of in IS and in FID. Figure 9 shows random samples from these two models. Overall, samples from LOGAN (NGD) have higher contrasts and sharper contours.

SN-GAN CS-GAN LOGAN (NGD)
FID
IS
Table 2: Comparison of Scores. The first and second columns are reproduced from Miyato et al. (2018) and Wu et al. (2019) respectively. We report the Inception Score (IS, higher is better, Salimans et al. 2016) and Fréchet Inception Distance (FID, lower is better, Heusel et al. 2017).
(a) (b)
Figure 9: (a) Samples from SN-GAN. (b) Samples from LOGAN.