1 Introduction
Generative Adversarial Nets (GANs) are implicit generative models that can be trained to match a given data distribution. GANs were originally developed by Goodfellow et al. (2014) for image data. As the field of generative modelling has advanced, GANs remain at the frontier, generating highfidelity images at large scale (Brock et al., 2018). However, despite growing insights into the dynamics of GAN training, most recent advances in largescale image generation come from network architecture improvements (Radford et al., 2015; Zhang et al., 2019), or regularisation of particular parts of the model (Miyato et al., 2018; Miyato and Koyama, 2018). Inspired by the compressed sensing GAN (CSGAN; Wu et al., 2019), we exploit the benefit of latent optimisation in adversarial games using natural gradient descent to optimise the latent variable at each step of training. This results in a scaleable and easy to implement approach that improves the dynamic interaction between the discriminator and the generator. For clarity, we unify these approaches as latent optimised GANs (LOGAN).
To summarise our contributions:

Motivated by this analysis, we improve latent optimisation by taking advantage of efficient secondorder updates.
(a)  (b) 
(a)  (b) 
2 Background
2.1 Notation
We use and
to denote the vectors representing parameters of the generator and discriminator. We use
for images, and for the latent source generating an image. We use prime to denote a variable after one update step, e.g., . and denote the data distribution and source distribution respectively. indicates taking the expectation of function over the distribution .2.2 Generative Adversarial Nets
BigGANDeep  baseline  LOGAN (GD)  LOGAN (NGD)  

FID  
IS 
A GAN consists of a generator that generates image from a latent source , and a discriminator that scores the generated images as (Goodfellow et al., 2014). Training GANs involves an adversarial game: while the discriminator tries to distinguish generated samples from data , the generator tries to fool the discriminator. This procedure can be summarised as the following minmax game:
(1) 
The exact form of
depends on the choice of loss function
(Goodfellow et al., 2014; Arjovsky et al., 2017; Nowozin et al., 2016). To simplify our presentation and analysis, we use the Wasserstein loss (Arjovsky et al., 2017), so that and . Our experiments with BigGANdeep use the hinge loss (Lim and Ye, 2017; Tran et al., 2017), which is identical to this form in its linear regime. Our analysis can be generalised to other losses as in previous theoretical work (e.g., Arora et al. 2017). To simplify notation, we abbreviate , which may be further simplified as when the explicit dependency on and can be omitted.Training GANs requires carefully balancing updates to and , and is sensitive to both architecture and algorithm choices (Salimans et al., 2016; Radford et al., 2015). A recent milestone is BigGAN (and BigGANdeep, Brock et al. 2018), which pushed the boundary of high fidelity image generation by scaling up GANs to an unprecedented level. BigGANs use an architecture based on residual blocks (He et al., 2016), in combination with regularisation mechanisms and selfattention (Saxe et al., 2014; Miyato et al., 2018; Zhang et al., 2019).
Here we aim to improve the adversarial dynamics during training. We focus on the second term in eq. 1 which is at the heart of the minmax game. For clarity, we explicitly write the losses for as and as , so the total loss vector can be written as
(2) 
Computing the gradients with respect to and gives the following gradient, which cannot be expressed as the gradient of any single function (Balduzzi et al., 2018):
(3) 
The fact that is not the gradient of a function implies that gradient updates in GANs can exhibit cycling behaviour which can slow down or prevent convergence. Balduzzi et al. (2018) refer to vector fields of this form as the simultaneous gradient. Although many GAN models use alternating update rules (e.g., Goodfellow et al. 2014; Brock et al. 2018), following the gradient with respect to and alternatively in each step, they can still suffer from cycling, so we use the simpler simultaneous gradient (eq. 3) for our analysis (see also Mescheder et al. 2017, 2018).
(a)  (b) 
2.3 Latent Optimised GANs
Inspired by compressed sensing (Candes et al., 2006; Donoho, 2006), Wu et al. (2019) introduced latent optimisation for GANs. We call this type of model latentoptimised GANs (LOGAN). Latent optimization has been shown to improve the stability of training as well as the final performance for mediumsized models such as DCGANs and Spectral Normalised GANs (Radford et al., 2015; Miyato et al., 2018). Latent optimisation exploits knowledge from to guide updates of . Intuitively, the gradient points in the direction that satisfies the discriminator , which implies better samples. Therefore, instead of using the randomly sampled , Wu et al. (2019) uses the optimised latent
(4) 
in eq. 1 for training ^{1}^{1}1We use a single step of gradientbased optimisation during training, and justify this choice in section 3.. The general algorithm is summarised in Algorithm 1 and illustrated in Figure 3 a. We develop the natural gradient descent form of latent update in Section 4.
3 Analysis of the Algorithm
To understand how latent optimisation improves GAN training, we analyse LOGAN as a 2player differentiable game following Balduzzi et al. (2018); Gemp and Mahadevan (2018); Letcher et al. (2019). The appendix provides a complementary analysis that relates LOGAN to unrolled GANs (Metz et al., 2016) and stochastic approximation (Heusel et al., 2017; Borkar, 1997).
An important problem with gradientbased optimization in GANs is that the vectorfield generated by the losses of the discriminator and generator is not a gradient vector field. It follows that gradient descent is not guaranteed to find a local optimum and can cycle, which can slow down convergence or lead to phenomena like mode collapse and mode hopping. Balduzzi et al. (2018); Gemp and Mahadevan (2018) proposed Symplectic Gradient Adjustment (SGA) to improve the dynamics of gradientbased methods in adversarial games. For a game with gradient (eq. 3), we define the Hessian as the second order derivatives with respect to the parameters, . SGA uses the adjusted gradient
(5) 
and is the antisymmetric component of the Hessian. Applying SGA to GANs yields the adjusted updates (see Appendix B.1 for details):
(6) 
Compared with in eq. 3, the adjusted gradient has secondorder terms reflecting the interactions between and . SGA significantly improves GAN training in simple examples (Balduzzi et al., 2018), allowing faster and more robust convergence to stable fixed points (local Nash equilibria). In addition, eq. 6 is losely related to unrolled GANs (Metz et al., 2016), which we discuss in detail in Appendix B.2. Unfortunately, SGA is expensive to scale because computing the secondorder derivatives with respect to all parameters is expensive.
We can explicitly compute the gradients for the discriminator and generator at after one step of latent optimisation by differentiating (where from eq. 4):
(7)  
(8) 
In both equations, the first terms represent how depends on the parameters directly and the second terms represent how depends on the parameters via the optimised latent source. For the second equality, we substitute as the gradientbased update of and use . Further differentiating results in the secondorder terms and . The original GAN’s gradient (eq. 3) does not include any secondorder term, since without latent optimisation. LOGAN computes these extra terms by automatic differentiation when backpropagating through the latent optimisation process (see Algorithm 1).
The SGA updates in eq. 6 and the LOGAN updates in eq. 8 are strikingly similar, suggesting that the latent step used by LOGAN reduces the negative effects of cycling by introducing a symplectic gradient adjustment into the optimization procedure. The role of the latent step can be formalized in terms of a third player, whose goal is to help the generator (see appendix B for details).
Crucially, latent optimisation approximates SGA using only secondorder derivatives with respect to the latent and parameters of the discriminator and generator separately. The second order terms involving parameters of both the discriminator and the generator – which are extremely expensive to compute – are not used. For latent ’s with dimensions typically used in GANs (e.g., 128–256, orders of magnitude less than the number of parameters), the second order terms can be computed efficiently. In short, latent optimisation efficiently couples the gradients of the discriminator and generator, as prescribed by SGA, but using the much lowerdimensional latent source which makes the adjustment scalable.
Appendix B.3 further shows that latent optimisation accelerates the speed of updating relative to the speed of updating , facilitating convergence according to Heusel et al. (2017) (see also Figure 4 b). Intuitively, the generator requires less updating compared with to achieve the same reduction of loss because latent optimisation “helps” . In summary, our analysis suggests that latent optimisation can improve GAN training dynamics by allowing larger single steps towards the direction of without overshooting.
(a)  (b) 
4 LOGAN with Natural Gradient Descent
Our analysis suggests that LOGAN benefits from strong optimisers for updating . In this work, we use natural gradient descent (NGD, Amari 1998) for latent optimisation. NGD is an approximate secondorder optimisation method (Pascanu and Bengio, 2013; Martens, 2014), and has been applied successfully in many domains. By using the positive semidefinite (PSD) GaussNewton matrix to approximate the (possibly negative definite) Hessian, NGD often works even better than exact secondorder methods. NGD is expensive in high dimensional parameter spaces, even with approximations (Martens, 2014). However, we demonstrate that it is efficient for latent optimisation, even in very large models.
Given the gradient of , , NGD computes the update as
(9) 
where the Fisher information matrix is defined as
(10) 
The loglikelihood function typically corresponds to commonly used error functions such as the cross entropy loss. This correspondence is not necessary when we interpret NGD as an approximate secondorder method, as has long been done (Martens, 2014). Nevertheless, Appendix C provides a Poisson loglikelihood interpretation for the hinge loss commonly used in GANs (Lim and Ye, 2017; Tran et al., 2017). An important difference between latent optimisation and commonly seen senarios using NGD is that the expectation over the condition () is absent. Since each is only responsible for generating one image, it only minimises the loss for this particular instance.
More specifically, we use the empirical Fisher with Tikhonov damping, as in TONGA (Roux et al., 2008)
(11) 
is cheaper to compute compared with the full Fisher, since is already available. The damping factor regularises the step size, which is important when only poorly approximates the Hessian or when the Hessian changes too much across the step. Using the ShermanMorrison formula, the NGD update can be simplified into the following closed form:
(12) 
which does not involve any matrix inversion. Thus, NGD adapts the step size according to the curvature estimate
. When is small, NGD normalises the gradient by its L2norm, which is equivalent to whitening here since we assume all samples of are independent. Figure 4 a illustrates the scaling for different values of . NGD automatically smooths the scale of updates by downscaling the gradients as their norm grows, which also contributes to the smoothed norms of updates (Figure 4 b). Since the NGD update remains proportional to , our analysis based on gradient descent in section 3 still holds.(a)  (b) 
5 Experiments and Analysis
We focus on large scale GANs based on BigGANdeep (Brock et al., 2018) trained on size images from the ImageNet dataset (Deng et al., 2009). In Appendix E, we present results from applying our algorithm on Spectral Normalised GANs trained with CIFAR dataset (Krizhevsky et al., 2009), which obtains stateoftheart scores on this model.
5.1 Model Configuration
We used the standard BigGANdeep architecture with three minor modifications: 1. We increased the size of the latent source from to , to compensate the randomness of the source lost when optimising
. 2. We use the uniform distribution
instead of the standard normal distribution
for to be consistent with the clipping operation (Algorithm 1). 3. We useleaky ReLU
instead of ReLU as the nonlinearity for smoother gradient flow for .Consistent with detailed findings in Brock et al. (2018), our experiment with this baseline model obtains only slightly better scores compared with those in Brock et al. (2018) (Table 1, see also Figure 8). We computed the FID and IS as in Brock et al. (2018)
, and computed IS values from checkpoints with the lowest FIDs. Finally, we computed the means and standard deviations for both measures from 5 models with different random seeds.
To apply latent optimisation, we use a damping factor combined with a large step size of . As an additional way of damping, we only optimise of ’s dimensions. Optimising the entire population of was unstable in our experiments. Similar to Wu et al. (2019), we found it was helpful to regularise the Euclidean norm of weightchange , with a regulariser weight of . All other hyperparameters, including learning rates and a large batch size of 2048, remain the same as in BigGANdeep; we did not optimise these hyperparameters. We call this model LOGAN (NGD).
5.2 Basic Results
Employing the same architecture and number of parameters as the BigGANdeep baseline, LOGAN (NGD) achieved better FID and IS (Table 1). As observed by Brock et al. (2018), BigGAN training eventually collapsed in every experiment. Training with LOGAN also collapsed, perhaps due to higherorder dynamics beyond the scope we have analysed, but took significantly longer (600k steps versus 300k steps with BigGANdeep).
During training, LOGAN was times slower per step compared with BigGANdeep because of the additional forward and backward pass. We found that optimising during evaluation did not improve sample scores (even up to 10 steps), so we do not optimise for evaluation. Therefore, LOGAN has the same evaluation cost as original BigGANdeep. To help understand this behaviour, we plot the change from during training in Figure 5 a. Although the movement in Euclidean space grew until training collapsed, the movement in ’s output space, measured as , remained unchanged (see Appendix D for details). As shown in our analysis, optimising improves the training dynamics, so LOGANs work well after training without requiring latent optimisation.
5.3 Ablation Studies
We verify our theoretical analysis in section 3 by examining key components of Algorithm 1 via ablation studies. First, we experiment with using basic GD to optimising , as in Wu et al. (2019), and call this model LOGAN (GD). A smaller step size of was required; larger values were unstable and led to premature collapse of training. As shown in Table 1, the scores from LOGAN (GD) were worse than LOGAN (NGD) and similar to the baseline model.
We then evaluate the effects of removing those terms depending on in eq. 8, which are not in the ordinary gradient (eq. 3). Since we computed these terms by backpropagating through the latent optimisation procedure, we removed them by selectively blocking backpropagation with “stop_gradient
” operations (e.g., in TensorFlow
Abadi et al. 2016). Figure 5 b shows the change of FIDs for the three models corresponding to removing , removing and removing both terms. As predicted by our analysis (section 3), both terms help stabilise training; training diverged early for all three ablations.5.4 Truncation and Samples
Truncation is a technique introduced by Brock et al. (2018) to illustrate the tradeoff between the FID and IS in a trained model. For a model trained with from a source distribution symmetric around , such as the standard normal distribution and the uniform distribution , downscaling (truncating) the source with gives samples with higher visual quality but reduced diversity. We see this quantified in higher IS scores and lower FID when evaluating samples from truncated distributions.
Figure 3 b plots the truncation curves for the baseline BigGANdeep model, LOGAN (GD) and LOGAN (NGD), obtained by varying the truncation (value of ) from (no truncation, upperleft ends of the curves) to (extreme truncation, bottomright ends). Each curve shows the tradeoff between FID and IS for an individual model; curves towards the upperright corner indicate better overall sample quality. The relative positions of curves in Figure 3 (b) shows LOGAN (NGD) has the best sample quality. Interestingly, although LOGAN (GD) and the baseline model have similar scores without truncation (upperleft ends of the curves, see also Table 1), LOGAN (GD) was better behaved with increasing truncation, suggesting LOGAN (GD) still converged to a better equilibrium. For further reference, we plot truncation curves from additional baseline models in Figure 8.
Figure 1 and Figure 2 show samples from chosen points on the truncation curves. In the high IS domain, C and D on the truncation curves both have similarly high IS of near 260. Samples from batches with such high IS have almost photorealistic image quality. Figure 1 show that while the baseline model produced nearly uniform samples, LOGAN (NGD) could still generate highly diverse samples. On the other hand, A and B from Figure 3 b have similarly low FID of near 5, indicating high sample diversity. Samples in Figure 2 b show higher quality compared with those in a (e.g., the interfaces between the elephants and ground, the contours around the pandas).
6 Conclusion
In this work, we present the LOGAN model which significantly improves the stateoftheart on large scale GAN training for image generation by optimising the latent source . Our results illustrate improvements in quantitative evaluation and samples with higher quality and diversity. Moreover, our analysis suggests that LOGAN fundamentally improves adversarial training dynamics. We therefore expect our method to be useful in other tasks that involve adversarial training, including representation learning and inference (Donahue et al., 2017; Dumoulin et al., 2017)
(Zhang et al., 2019), style learning (Zhu et al., 2017; Karras et al., 2019), audio generation (Donahue et al., 2018) and video generation (Vondrick et al., 2016; Clark et al., 2019).References

Tensorflow: a system for largescale machine learning
. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.3.  Natural gradient works efficiently in learning. Neural computation 10 (2), pp. 251–276. Cited by: §4.
 Wasserstein GAN. arXiv preprint arXiv:1701.07875. Cited by: §2.2.
 Generalization and equilibrium in generative adversarial nets. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 224–232. Cited by: §2.2.
 The mechanics of nplayer differentiable games. arXiv preprint arXiv:1802.05642. Cited by: §B.1, §B.2, item 1, §2.2, §2.2, §3, §3.
 Stochastic approximation with two time scales. Systems & Control Letters 29 (5), pp. 291–294. Cited by: §B.3, §3.
 Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: Figure 8, item 3, §1, §2.2, §2.2, Table 1, §5.1, §5.2, §5.4, §5.
 Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 59 (8), pp. 1207–1223. Cited by: §2.3.
 Efficient video generation on complex datasets. arXiv preprint arXiv:1907.06571. Cited by: §6.
 ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, Cited by: §5.
 Adversarial audio synthesis. arXiv preprint arXiv:1802.04208. Cited by: §6.
 Adversarial feature learning. In ICLR, Cited by: §6.
 Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §2.3.
 Advesarially learned inference. In ICLR, Cited by: §6.
 Global Convergence to the Equilibrium of GANs using Variational Inequalities. In Arxiv:1808.01531, Cited by: §3, §3.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.2, §2.2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §2.2.  Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: item 2, §B.3, §B.3, §B.3, Table 2, item 1, Table 1, §3, §3.
 Convergent activation dynamics in continuous time networks. Neural networks 2 (5), pp. 331–349. Cited by: §B.3.
 A stylebased generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §6.

Actorcritic–type learning algorithms for markov decision processes
. SIAM Journal on control and Optimization 38 (1), pp. 94–123. Cited by: §B.3.  Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.
 Differentiable game mechanics.. Journal of Machine Learning Research 20 (84), pp. 1–40. Cited by: §3.
 Geometric GAN. arXiv preprint arXiv:1705.02894. Cited by: §2.2, §4.
 New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193. Cited by: §4, §4.
 Which training methods for gans do actually converge?. In International Conference on Machine Learning, pp. 3478–3487. Cited by: §2.2.
 The numerics of gans. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1825–1835. External Links: Link Cited by: §2.2.
 Unrolled generative adversarial networks. CoRR abs/1611.02163. External Links: Link, 1611.02163 Cited by: item 2, §B.2, §3, §3.
 Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: Figure 8, Table 2, Appendix E, §1, §2.2, §2.3.
 cGANs with projection discriminator. arXiv preprint arXiv:1802.05637. Cited by: §1.
 GAN: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §2.2.
 Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584. Cited by: §4.
 Fast exact multiplication by the hessian. Neural computation 6 (1), pp. 147–160. Cited by: §B.1.
 Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1, §2.2, §2.3.
 Topmoumoute online natural gradient algorithm. In Advances in neural information processing systems, pp. 849–856. Cited by: §4.
 Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: Table 2, §2.2, Table 1.
 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR. Cited by: §2.2.
 Hierarchical implicit models and likelihoodfree variational inference. In Advances in Neural Information Processing Systems, pp. 5523–5533. Cited by: §2.2, §4.
 Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pp. 613–621. Cited by: §6.
 Deep compressed sensing. In International Conference on Machine Learning, pp. 6850–6860. Cited by: §B.1, Table 2, Appendix E, §1, §2.3, §5.1, §5.3.
 Selfattention generative adversarial networks. In International Conference on Machine Learning, pp. 7354–7363. Cited by: Figure 8, §1, §2.2, §6.

Unpaired imagetoimage translation using cycleconsistent adversarial networks
. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §6.
Appendix A Additional Samples and Results
Figure 6 and 7 provide additional samples, organised similar to Figure 1 and 2. Figure 8 shows additional truncation curves.
(a)  (b) 
(a)  (b) 
Appendix B Detailed Analysis of Latent Optimisation
In this section we present three complementary analyses of LOGAN. In particular, we show how the algorithm brings together ideas from symplectic gradient adjustment, unrolled GANs and stochastic approximation with two time scales.
b.1 Approximate Symplectic Gradient Adjustment
To analyse LOGAN as a differentiable game we treat the latent step as adding a third player to the original game played by the discriminator and generator. The third player’s parameter, , is optimised online for each . Together the three players (latent player, discriminator, and generator) have losses averaged over a batch of samples:
(13) 
where ( is the batch size) reflects the fact that each is only optimised for a single sample , so its contribution to the total loss across a batch is small compared with and which are directly optimised for batch losses. This choice of is essential for the following derivation, and has important practical implication. It means that the persample loss , instead of the loss summed over a batch , should be the only loss function guiding latent optimisation. Therefore, when using natural gradient descent (Section 4), the Fisher information matrix should only be computed using the current sample .
The resulting simultaneous gradient is
(14) 
Following Balduzzi et al. (2018), we can write the Hessian of the game as:
(15) 
The presence of a nonzero antisymmetric component in the Hessian
(16) 
implies the dynamics have a rotational component which can cause cycling or slow down convergence. Since for typical batch sizes (e.g., for DCGAN and for BigGANdeep), we abbreviate to simplify notations.
Symplectic gradient adjustment (SGA) counteracts the rotational force by adding an adjustment term to the gradient to obtain , which for the discriminator and generator has the form:
(17)  
(18) 
The gradient with respect to is ignored since the convergence of training only depends on and .
If we drop the last terms in eq.17 and 18, which are expensive to compute for large models with highdimensional and , and use , the adjusted updates can be rewritten as
(19)  
(20) 
Because of the third player, there are still the terms depend on to adjust the gradients. Efficiently computing and is nontrivial (e.g., Pearlmutter 1994). However, if we introduce the local approximation
(21) 
then the adjusted gradient becomes identical to 8 from latent optimisation.
In other words, automatic differentiation by commonly used machine learning packages can compute the adjusted gradient for and when backpropagating through the latent optimisation process. Despite the approximation involved in this analysis, both our experiments in section 5 and the results from Wu et al. (2019) verified that latent optimisation can significantly improve GAN training.
b.2 Relation with Unrolled GANs
Latent optimisation can be seen as unrolling GANs (Metz et al., 2016) in the space of the latent, rather than the parameters. Unrolling in the latent space has the advantages that:

LOGAN is more scalable than Unrolled GANs because it avoids secondorder derivatives over a potentially very large number of parameters.
We next formally present this connection by showing that SGA can be seen as approximating Unrolled GANs (Metz et al., 2016). For the update , we have the Taylor expansion approximation at :
(22) 
Substitute , and take the derivatives with respect to on both sides:
(23) 
which is the same as eq. 18 (taking the negative sign). Compared with the exact gradient from the unroll:
(24) 
The approximation in eq. 23 comes from using and as a result of the linear approximation.
At this point, unrolling update only affects . Although it is expensive to unroll both and , in principle, we can unroll update and compute the gradient of similarly using :
(25) 
which gives us the same update rule as SGA (eq. 17). This correspondence based on first order Taylor expansion is unsurprising, as SGA is based on linearising the adversarial dynamics (Balduzzi et al., 2018).
b.3 Stochastic Approximation with Two Time Scales
Heusel et al. (2017) used the theory of stochastic approximation to analyse GAN training. Viewing the training process as stochastic approximation with two time scales (Borkar, 1997; Konda and Borkar, 1999), they suggest that the update of should be fast enough compared with that of . Under mild assumptions, Heusel et al. (2017) proved that such two timescale update converges to local Nash equilibrium. Their analysis follows the idea of perturbation (Hirsch, 1989), where the slow updates () are interpreted as a small perturbation over the ODE describing the fast update (). Importantly, the size of perturbation is measured by the magnitude of parameter change, which is affected by both the learning rate and gradients.
Here we show, in accordance with Heusel et al. (2017), that LOGAN accelerates discriminator updates and slows down generator updates, thus helping the convergence of discriminator. We start from analysing the change of . We assume that, without LO, it takes to make a small constant amount of reduction in loss :
(26) 
Now using the optimised , we assess the change required to achieve the same amount of reduction:
(27) 
Intuitively, when “helps” to achieve the same goal of increasing by , the responsible of becomes smaller, so it does not need to change as much as , thus .
Formally, and have the following Taylor expansions around and :
(28)  
(29) 
Where ’s are higher order terms of the increments. Using the assumption of eq. 26 and 27, we can combine eq. 28 and 29:
(30) 
where . Since in gradient descent (eq. 3),
(31) 
Therefore, we have the inequality
(32) 
If we further assume and
are obtained from stochastic gradient descent with identical learning rate,
(33) 
substituting eq. 33 into eq. 32 gives
(34) 
The same analysis applies to the discriminator. The similar intuition is that it takes the discriminator additional effort to compensate the exploitation from the optimised . We then obtain
(35) 
However, since the adversarial loss , we have and taking the opposite signs of eq.33. For sufficiently small , and , is close to zero, so under our assumptions of small , and .
Importantly, the bigger the product is, the more robust the inequality is to the error from . Moreover, bigger steps increase the speed gap between updating D and G, further facilitating convergence (in accordance with Heusel et al. (2017)). Overall, our analysis suggests:

More than one gradient descent step may not be helpful, since from multiple GD steps may deviate from the direction of .

A large step of is helpful in facilitating convergence by widening the gap between D and G updates (Heusel et al., 2017).

However, the step of cannot be too large. In addition to the linear approximation we used throughout our analysis, the approximate SGA breaks down when eq.21 is strongly violated when “overshoot” brings the gradients at to the opposite sign of .
Appendix C Poisson Likelihood from Hinge loss
Here we provide a probabilistic interpretation of the hinge loss for the generator, which leads naturally to the scenario of a family of discriminators. Although this interpretation is not necessary for our current algorithm, it may provides useful guidance for incorporating multiple discriminators.
We introduce the label for real data and fake samples. This section shows that the generator hinge loss
(36) 
can be interpreted as a negative loglikelihood function:
(37) 
Here
is the probability that the generated image
can fool the discriminator .The original GAN’s discriminator can be interpreted as outputting a Bernoulli distribution
. In this case, if we parameterise , the generator loss is the negative loglikelihood(38) 
Bernoulli, however, is not the only valid choice as the discriminator’s output distribution. Instead of sampling “1” or “0”, we assume that there are many identical discriminators that can independently vote to reject an input sample as fake. The number of votes
in a given interval can be described by a Poisson distribution with parameter
with the following PMF:(39) 
The probability that a generated image can fool all the discriminators is the probability of receiving no vote for rejection
(40) 
Therefore, we have the following negative loglikelihood as the generator loss if we parameterise :
(41) 
This interpretation has a caveat that when the Poisson distribution is not well defined. However, in general the discriminator’s hinge loss
(42) 
pushes via training.
Appendix D Details in Computing Distances in Figure 5 a
For a temporal sequence (e.g. the changes of or
at each training step in this paper), to normalise its variance while accounting for the nonstationarity, we process it as follows. We first compute the moving average and standard deviation over a window of size
:(43)  
(44) 
Then normalise the sequence as:
(45) 
The result in Figure 5a is robust to the choice of window size. Our experiments with from to yielded visually similar plots.
Appendix E Experiments with DCGAN and CIFAR
To test if latent optimisation works with models at more moderate scales, we apply it to SNGANs (Miyato et al., 2018). Although our experiments on this model are less thorough than in the main paper with BigGANdeep, we hope to provide basic guidelines for researchers interested in applying latent optimisation on smaller models.
The experiments follows the same basic setup and hyperparameter settings as the CSGAN in Wu et al. (2019). There is no class conditioning in this model. For NGD, we found a smaller damping factor , a regulariser weight of (other parameters are same as in Wu et al. 2019), combined with optimising of the latent source (instead of for BigGANdeep) worked best for SNGANs.
In addition, we found running extra latent optimisation steps benefited evaluation, so we use ten steps of latent optimisation in evaluation for results in this section, although the models were still trained with a single optimisation step. We reckon that smaller models might not be “overparametrised” enough to fully amortise the computation from optimising , which can then further exploit the architecture in evaluation time. On the other hand, the overhead from running multiple iterations of latent optimisation is relatively small at this scale. We aim to further investigate this difference in future studies.
Table 2 shows the FID and IS alongside SNGAN and CSCAN which used the same architecture. Here we observe similarly significant improvement over the baseline SNGAN model, with an improvement of in IS and in FID. Figure 9 shows random samples from these two models. Overall, samples from LOGAN (NGD) have higher contrasts and sharper contours.
SNGAN  CSGAN  LOGAN (NGD)  

FID  
IS 
(a)  (b) 
Comments
There are no comments yet.