This repository implements the latent optimization using automatic differentiation from the paper LOGAN.
Training generative adversarial networks requires balancing of delicate adversarial dynamics. Even with careful tuning, training may diverge or end up in a bad equilibrium with dropped modes. In this work, we introduce a new form of latent optimisation inspired by the CS-GAN and show that it improves adversarial dynamics by enhancing interactions between the discriminator and the generator. We develop supporting theoretical analysis from the perspectives of differentiable games and stochastic approximation. Our experiments demonstrate that latent optimisation can significantly improve GAN training, obtaining state-of-the-art performance for the ImageNet (128 x 128) dataset. Our model achieves an Inception Score (IS) of 148 and an Fréchet Inception Distance (FID) of 3.4, an improvement of 17 respectively, compared with the baseline BigGAN-deep model with the same architecture and number of parameters.READ FULL TEXT VIEW PDF
This repository implements the latent optimization using automatic differentiation from the paper LOGAN.
The unofficial implementation of LOGAN: Latent Optimisation for Generative Adversarial Networks
Album Cover Gan
Generative Adversarial Nets (GANs) are implicit generative models that can be trained to match a given data distribution. GANs were originally developed by Goodfellow et al. (2014) for image data. As the field of generative modelling has advanced, GANs remain at the frontier, generating high-fidelity images at large scale (Brock et al., 2018). However, despite growing insights into the dynamics of GAN training, most recent advances in large-scale image generation come from network architecture improvements (Radford et al., 2015; Zhang et al., 2019), or regularisation of particular parts of the model (Miyato et al., 2018; Miyato and Koyama, 2018). Inspired by the compressed sensing GAN (CS-GAN; Wu et al., 2019), we exploit the benefit of latent optimisation in adversarial games using natural gradient descent to optimise the latent variable at each step of training. This results in a scaleable and easy to implement approach that improves the dynamic interaction between the discriminator and the generator. For clarity, we unify these approaches as latent optimised GANs (LOGAN).
To summarise our contributions:
Motivated by this analysis, we improve latent optimisation by taking advantage of efficient second-order updates.
We use and
to denote the vectors representing parameters of the generator and discriminator. We usefor images, and for the latent source generating an image. We use prime to denote a variable after one update step, e.g., . and denote the data distribution and source distribution respectively. indicates taking the expectation of function over the distribution .
|BigGAN-Deep||baseline||LOGAN (GD)||LOGAN (NGD)|
A GAN consists of a generator that generates image from a latent source , and a discriminator that scores the generated images as (Goodfellow et al., 2014). Training GANs involves an adversarial game: while the discriminator tries to distinguish generated samples from data , the generator tries to fool the discriminator. This procedure can be summarised as the following min-max game:
The exact form of
depends on the choice of loss function(Goodfellow et al., 2014; Arjovsky et al., 2017; Nowozin et al., 2016). To simplify our presentation and analysis, we use the Wasserstein loss (Arjovsky et al., 2017), so that and . Our experiments with BigGAN-deep use the hinge loss (Lim and Ye, 2017; Tran et al., 2017), which is identical to this form in its linear regime. Our analysis can be generalised to other losses as in previous theoretical work (e.g., Arora et al. 2017). To simplify notation, we abbreviate , which may be further simplified as when the explicit dependency on and can be omitted.
Training GANs requires carefully balancing updates to and , and is sensitive to both architecture and algorithm choices (Salimans et al., 2016; Radford et al., 2015). A recent milestone is BigGAN (and BigGAN-deep, Brock et al. 2018), which pushed the boundary of high fidelity image generation by scaling up GANs to an unprecedented level. BigGANs use an architecture based on residual blocks (He et al., 2016), in combination with regularisation mechanisms and self-attention (Saxe et al., 2014; Miyato et al., 2018; Zhang et al., 2019).
Here we aim to improve the adversarial dynamics during training. We focus on the second term in eq. 1 which is at the heart of the min-max game. For clarity, we explicitly write the losses for as and as , so the total loss vector can be written as
Computing the gradients with respect to and gives the following gradient, which cannot be expressed as the gradient of any single function (Balduzzi et al., 2018):
The fact that is not the gradient of a function implies that gradient updates in GANs can exhibit cycling behaviour which can slow down or prevent convergence. Balduzzi et al. (2018) refer to vector fields of this form as the simultaneous gradient. Although many GAN models use alternating update rules (e.g., Goodfellow et al. 2014; Brock et al. 2018), following the gradient with respect to and alternatively in each step, they can still suffer from cycling, so we use the simpler simultaneous gradient (eq. 3) for our analysis (see also Mescheder et al. 2017, 2018).
Inspired by compressed sensing (Candes et al., 2006; Donoho, 2006), Wu et al. (2019) introduced latent optimisation for GANs. We call this type of model latent-optimised GANs (LOGAN). Latent optimization has been shown to improve the stability of training as well as the final performance for medium-sized models such as DCGANs and Spectral Normalised GANs (Radford et al., 2015; Miyato et al., 2018). Latent optimisation exploits knowledge from to guide updates of . Intuitively, the gradient points in the direction that satisfies the discriminator , which implies better samples. Therefore, instead of using the randomly sampled , Wu et al. (2019) uses the optimised latent
in eq. 1 for training 111We use a single step of gradient-based optimisation during training, and justify this choice in section 3.. The general algorithm is summarised in Algorithm 1 and illustrated in Figure 3 a. We develop the natural gradient descent form of latent update in Section 4.
To understand how latent optimisation improves GAN training, we analyse LOGAN as a 2-player differentiable game following Balduzzi et al. (2018); Gemp and Mahadevan (2018); Letcher et al. (2019). The appendix provides a complementary analysis that relates LOGAN to unrolled GANs (Metz et al., 2016) and stochastic approximation (Heusel et al., 2017; Borkar, 1997).
An important problem with gradient-based optimization in GANs is that the vector-field generated by the losses of the discriminator and generator is not a gradient vector field. It follows that gradient descent is not guaranteed to find a local optimum and can cycle, which can slow down convergence or lead to phenomena like mode collapse and mode hopping. Balduzzi et al. (2018); Gemp and Mahadevan (2018) proposed Symplectic Gradient Adjustment (SGA) to improve the dynamics of gradient-based methods in adversarial games. For a game with gradient (eq. 3), we define the Hessian as the second order derivatives with respect to the parameters, . SGA uses the adjusted gradient
and is the anti-symmetric component of the Hessian. Applying SGA to GANs yields the adjusted updates (see Appendix B.1 for details):
Compared with in eq. 3, the adjusted gradient has second-order terms reflecting the interactions between and . SGA significantly improves GAN training in simple examples (Balduzzi et al., 2018), allowing faster and more robust convergence to stable fixed points (local Nash equilibria). In addition, eq. 6 is losely related to unrolled GANs (Metz et al., 2016), which we discuss in detail in Appendix B.2. Unfortunately, SGA is expensive to scale because computing the second-order derivatives with respect to all parameters is expensive.
We can explicitly compute the gradients for the discriminator and generator at after one step of latent optimisation by differentiating (where from eq. 4):
In both equations, the first terms represent how depends on the parameters directly and the second terms represent how depends on the parameters via the optimised latent source. For the second equality, we substitute as the gradient-based update of and use . Further differentiating results in the second-order terms and . The original GAN’s gradient (eq. 3) does not include any second-order term, since without latent optimisation. LOGAN computes these extra terms by automatic differentiation when back-propagating through the latent optimisation process (see Algorithm 1).
The SGA updates in eq. 6 and the LOGAN updates in eq. 8 are strikingly similar, suggesting that the latent step used by LOGAN reduces the negative effects of cycling by introducing a symplectic gradient adjustment into the optimization procedure. The role of the latent step can be formalized in terms of a third player, whose goal is to help the generator (see appendix B for details).
Crucially, latent optimisation approximates SGA using only second-order derivatives with respect to the latent and parameters of the discriminator and generator separately. The second order terms involving parameters of both the discriminator and the generator – which are extremely expensive to compute – are not used. For latent ’s with dimensions typically used in GANs (e.g., 128–256, orders of magnitude less than the number of parameters), the second order terms can be computed efficiently. In short, latent optimisation efficiently couples the gradients of the discriminator and generator, as prescribed by SGA, but using the much lower-dimensional latent source which makes the adjustment scalable.
Appendix B.3 further shows that latent optimisation accelerates the speed of updating relative to the speed of updating , facilitating convergence according to Heusel et al. (2017) (see also Figure 4 b). Intuitively, the generator requires less updating compared with to achieve the same reduction of loss because latent optimisation “helps” . In summary, our analysis suggests that latent optimisation can improve GAN training dynamics by allowing larger single steps towards the direction of without overshooting.
Our analysis suggests that LOGAN benefits from strong optimisers for updating . In this work, we use natural gradient descent (NGD, Amari 1998) for latent optimisation. NGD is an approximate second-order optimisation method (Pascanu and Bengio, 2013; Martens, 2014), and has been applied successfully in many domains. By using the positive semi-definite (PSD) Gauss-Newton matrix to approximate the (possibly negative definite) Hessian, NGD often works even better than exact second-order methods. NGD is expensive in high dimensional parameter spaces, even with approximations (Martens, 2014). However, we demonstrate that it is efficient for latent optimisation, even in very large models.
Given the gradient of , , NGD computes the update as
where the Fisher information matrix is defined as
The log-likelihood function typically corresponds to commonly used error functions such as the cross entropy loss. This correspondence is not necessary when we interpret NGD as an approximate second-order method, as has long been done (Martens, 2014). Nevertheless, Appendix C provides a Poisson log-likelihood interpretation for the hinge loss commonly used in GANs (Lim and Ye, 2017; Tran et al., 2017). An important difference between latent optimisation and commonly seen senarios using NGD is that the expectation over the condition () is absent. Since each is only responsible for generating one image, it only minimises the loss for this particular instance.
More specifically, we use the empirical Fisher with Tikhonov damping, as in TONGA (Roux et al., 2008)
is cheaper to compute compared with the full Fisher, since is already available. The damping factor regularises the step size, which is important when only poorly approximates the Hessian or when the Hessian changes too much across the step. Using the Sherman-Morrison formula, the NGD update can be simplified into the following closed form:
which does not involve any matrix inversion. Thus, NGD adapts the step size according to the curvature estimate. When is small, NGD normalises the gradient by its L2-norm, which is equivalent to whitening here since we assume all samples of are independent. Figure 4 a illustrates the scaling for different values of . NGD automatically smooths the scale of updates by down-scaling the gradients as their norm grows, which also contributes to the smoothed norms of updates (Figure 4 b). Since the NGD update remains proportional to , our analysis based on gradient descent in section 3 still holds.
We focus on large scale GANs based on BigGAN-deep (Brock et al., 2018) trained on size images from the ImageNet dataset (Deng et al., 2009). In Appendix E, we present results from applying our algorithm on Spectral Normalised GANs trained with CIFAR dataset (Krizhevsky et al., 2009), which obtains state-of-the-art scores on this model.
We used the standard BigGAN-deep architecture with three minor modifications: 1. We increased the size of the latent source from to , to compensate the randomness of the source lost when optimising
. 2. We use the uniform distribution
instead of the standard normal distributionfor to be consistent with the clipping operation (Algorithm 1). 3. We use
leaky ReLUinstead of ReLU as the non-linearity for smoother gradient flow for .
Consistent with detailed findings in Brock et al. (2018), our experiment with this baseline model obtains only slightly better scores compared with those in Brock et al. (2018) (Table 1, see also Figure 8). We computed the FID and IS as in Brock et al. (2018)
, and computed IS values from checkpoints with the lowest FIDs. Finally, we computed the means and standard deviations for both measures from 5 models with different random seeds.
To apply latent optimisation, we use a damping factor combined with a large step size of . As an additional way of damping, we only optimise of ’s dimensions. Optimising the entire population of was unstable in our experiments. Similar to Wu et al. (2019), we found it was helpful to regularise the Euclidean norm of weight-change , with a regulariser weight of . All other hyper-parameters, including learning rates and a large batch size of 2048, remain the same as in BigGAN-deep; we did not optimise these hyper-parameters. We call this model LOGAN (NGD).
Employing the same architecture and number of parameters as the BigGAN-deep baseline, LOGAN (NGD) achieved better FID and IS (Table 1). As observed by Brock et al. (2018), BigGAN training eventually collapsed in every experiment. Training with LOGAN also collapsed, perhaps due to higher-order dynamics beyond the scope we have analysed, but took significantly longer (600k steps versus 300k steps with BigGAN-deep).
During training, LOGAN was times slower per step compared with BigGAN-deep because of the additional forward and backward pass. We found that optimising during evaluation did not improve sample scores (even up to 10 steps), so we do not optimise for evaluation. Therefore, LOGAN has the same evaluation cost as original BigGAN-deep. To help understand this behaviour, we plot the change from during training in Figure 5 a. Although the movement in Euclidean space grew until training collapsed, the movement in ’s output space, measured as , remained unchanged (see Appendix D for details). As shown in our analysis, optimising improves the training dynamics, so LOGANs work well after training without requiring latent optimisation.
We verify our theoretical analysis in section 3 by examining key components of Algorithm 1 via ablation studies. First, we experiment with using basic GD to optimising , as in Wu et al. (2019), and call this model LOGAN (GD). A smaller step size of was required; larger values were unstable and led to premature collapse of training. As shown in Table 1, the scores from LOGAN (GD) were worse than LOGAN (NGD) and similar to the baseline model.
We then evaluate the effects of removing those terms depending on in eq. 8, which are not in the ordinary gradient (eq. 3). Since we computed these terms by back-propagating through the latent optimisation procedure, we removed them by selectively blocking back-propagation with “stop_gradient
” operations (e.g., in TensorFlowAbadi et al. 2016). Figure 5 b shows the change of FIDs for the three models corresponding to removing , removing and removing both terms. As predicted by our analysis (section 3), both terms help stabilise training; training diverged early for all three ablations.
Truncation is a technique introduced by Brock et al. (2018) to illustrate the trade-off between the FID and IS in a trained model. For a model trained with from a source distribution symmetric around , such as the standard normal distribution and the uniform distribution , down-scaling (truncating) the source with gives samples with higher visual quality but reduced diversity. We see this quantified in higher IS scores and lower FID when evaluating samples from truncated distributions.
Figure 3 b plots the truncation curves for the baseline BigGAN-deep model, LOGAN (GD) and LOGAN (NGD), obtained by varying the truncation (value of ) from (no truncation, upper-left ends of the curves) to (extreme truncation, bottom-right ends). Each curve shows the trade-off between FID and IS for an individual model; curves towards the upper-right corner indicate better overall sample quality. The relative positions of curves in Figure 3 (b) shows LOGAN (NGD) has the best sample quality. Interestingly, although LOGAN (GD) and the baseline model have similar scores without truncation (upper-left ends of the curves, see also Table 1), LOGAN (GD) was better behaved with increasing truncation, suggesting LOGAN (GD) still converged to a better equilibrium. For further reference, we plot truncation curves from additional baseline models in Figure 8.
Figure 1 and Figure 2 show samples from chosen points on the truncation curves. In the high IS domain, C and D on the truncation curves both have similarly high IS of near 260. Samples from batches with such high IS have almost photo-realistic image quality. Figure 1 show that while the baseline model produced nearly uniform samples, LOGAN (NGD) could still generate highly diverse samples. On the other hand, A and B from Figure 3 b have similarly low FID of near 5, indicating high sample diversity. Samples in Figure 2 b show higher quality compared with those in a (e.g., the interfaces between the elephants and ground, the contours around the pandas).
In this work, we present the LOGAN model which significantly improves the state-of-the-art on large scale GAN training for image generation by optimising the latent source . Our results illustrate improvements in quantitative evaluation and samples with higher quality and diversity. Moreover, our analysis suggests that LOGAN fundamentally improves adversarial training dynamics. We therefore expect our method to be useful in other tasks that involve adversarial training, including representation learning and inference (Donahue et al., 2017; Dumoulin et al., 2017)et al., 2019), style learning (Zhu et al., 2017; Karras et al., 2019), audio generation (Donahue et al., 2018) and video generation (Vondrick et al., 2016; Clark et al., 2019).
Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.3.
Actor-critic–type learning algorithms for markov decision processes. SIAM Journal on control and Optimization 38 (1), pp. 94–123. Cited by: §B.3.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §6.
In this section we present three complementary analyses of LOGAN. In particular, we show how the algorithm brings together ideas from symplectic gradient adjustment, unrolled GANs and stochastic approximation with two time scales.
To analyse LOGAN as a differentiable game we treat the latent step as adding a third player to the original game played by the discriminator and generator. The third player’s parameter, , is optimised online for each . Together the three players (latent player, discriminator, and generator) have losses averaged over a batch of samples:
where ( is the batch size) reflects the fact that each is only optimised for a single sample , so its contribution to the total loss across a batch is small compared with and which are directly optimised for batch losses. This choice of is essential for the following derivation, and has important practical implication. It means that the per-sample loss , instead of the loss summed over a batch , should be the only loss function guiding latent optimisation. Therefore, when using natural gradient descent (Section 4), the Fisher information matrix should only be computed using the current sample .
The resulting simultaneous gradient is
Following Balduzzi et al. (2018), we can write the Hessian of the game as:
The presence of a non-zero anti-symmetric component in the Hessian
implies the dynamics have a rotational component which can cause cycling or slow down convergence. Since for typical batch sizes (e.g., for DCGAN and for BigGAN-deep), we abbreviate to simplify notations.
Symplectic gradient adjustment (SGA) counteracts the rotational force by adding an adjustment term to the gradient to obtain , which for the discriminator and generator has the form:
The gradient with respect to is ignored since the convergence of training only depends on and .
Because of the third player, there are still the terms depend on to adjust the gradients. Efficiently computing and is non-trivial (e.g., Pearlmutter 1994). However, if we introduce the local approximation
then the adjusted gradient becomes identical to 8 from latent optimisation.
In other words, automatic differentiation by commonly used machine learning packages can compute the adjusted gradient for and when back-propagating through the latent optimisation process. Despite the approximation involved in this analysis, both our experiments in section 5 and the results from Wu et al. (2019) verified that latent optimisation can significantly improve GAN training.
Latent optimisation can be seen as unrolling GANs (Metz et al., 2016) in the space of the latent, rather than the parameters. Unrolling in the latent space has the advantages that:
LOGAN is more scalable than Unrolled GANs because it avoids second-order derivatives over a potentially very large number of parameters.
We next formally present this connection by showing that SGA can be seen as approximating Unrolled GANs (Metz et al., 2016). For the update , we have the Taylor expansion approximation at :
Substitute , and take the derivatives with respect to on both sides:
which is the same as eq. 18 (taking the negative sign). Compared with the exact gradient from the unroll:
The approximation in eq. 23 comes from using and as a result of the linear approximation.
At this point, unrolling update only affects . Although it is expensive to unroll both and , in principle, we can unroll update and compute the gradient of similarly using :
which gives us the same update rule as SGA (eq. 17). This correspondence based on first order Taylor expansion is unsurprising, as SGA is based on linearising the adversarial dynamics (Balduzzi et al., 2018).
Heusel et al. (2017) used the theory of stochastic approximation to analyse GAN training. Viewing the training process as stochastic approximation with two time scales (Borkar, 1997; Konda and Borkar, 1999), they suggest that the update of should be fast enough compared with that of . Under mild assumptions, Heusel et al. (2017) proved that such two time-scale update converges to local Nash equilibrium. Their analysis follows the idea of perturbation (Hirsch, 1989), where the slow updates () are interpreted as a small perturbation over the ODE describing the fast update (). Importantly, the size of perturbation is measured by the magnitude of parameter change, which is affected by both the learning rate and gradients.
Here we show, in accordance with Heusel et al. (2017), that LOGAN accelerates discriminator updates and slows down generator updates, thus helping the convergence of discriminator. We start from analysing the change of . We assume that, without LO, it takes to make a small constant amount of reduction in loss :
Now using the optimised , we assess the change required to achieve the same amount of reduction:
Intuitively, when “helps” to achieve the same goal of increasing by , the responsible of becomes smaller, so it does not need to change as much as , thus .
Formally, and have the following Taylor expansions around and :
where . Since in gradient descent (eq. 3),
Therefore, we have the inequality
If we further assume and
are obtained from stochastic gradient descent with identical learning rate,
The same analysis applies to the discriminator. The similar intuition is that it takes the discriminator additional effort to compensate the exploitation from the optimised . We then obtain
However, since the adversarial loss , we have and taking the opposite signs of eq.33. For sufficiently small , and , is close to zero, so under our assumptions of small , and .
Importantly, the bigger the product is, the more robust the inequality is to the error from . Moreover, bigger steps increase the speed gap between updating D and G, further facilitating convergence (in accordance with Heusel et al. (2017)). Overall, our analysis suggests:
More than one gradient descent step may not be helpful, since from multiple GD steps may deviate from the direction of .
A large step of is helpful in facilitating convergence by widening the gap between D and G updates (Heusel et al., 2017).
However, the step of cannot be too large. In addition to the linear approximation we used throughout our analysis, the approximate SGA breaks down when eq.21 is strongly violated when “overshoot” brings the gradients at to the opposite sign of .
Here we provide a probabilistic interpretation of the hinge loss for the generator, which leads naturally to the scenario of a family of discriminators. Although this interpretation is not necessary for our current algorithm, it may provides useful guidance for incorporating multiple discriminators.
We introduce the label for real data and fake samples. This section shows that the generator hinge loss
can be interpreted as a negative log-likelihood function:
is the probability that the generated imagecan fool the discriminator .
The original GAN’s discriminator can be interpreted as outputting a Bernoulli distribution. In this case, if we parameterise , the generator loss is the negative log-likelihood
Bernoulli, however, is not the only valid choice as the discriminator’s output distribution. Instead of sampling “1” or “0”, we assume that there are many identical discriminators that can independently vote to reject an input sample as fake. The number of votes
in a given interval can be described by a Poisson distribution with parameterwith the following PMF:
The probability that a generated image can fool all the discriminators is the probability of receiving no vote for rejection
Therefore, we have the following negative log-likelihood as the generator loss if we parameterise :
This interpretation has a caveat that when the Poisson distribution is not well defined. However, in general the discriminator’s hinge loss
pushes via training.
For a temporal sequence (e.g. the changes of or
at each training step in this paper), to normalise its variance while accounting for the non-stationarity, we process it as follows. We first compute the moving average and standard deviation over a window of size:
Then normalise the sequence as:
The result in Figure 5a is robust to the choice of window size. Our experiments with from to yielded visually similar plots.
To test if latent optimisation works with models at more moderate scales, we apply it to SN-GANs (Miyato et al., 2018). Although our experiments on this model are less thorough than in the main paper with BigGAN-deep, we hope to provide basic guidelines for researchers interested in applying latent optimisation on smaller models.
The experiments follows the same basic setup and hyper-parameter settings as the CS-GAN in Wu et al. (2019). There is no class conditioning in this model. For NGD, we found a smaller damping factor , a regulariser weight of (other parameters are same as in Wu et al. 2019), combined with optimising of the latent source (instead of for BigGAN-deep) worked best for SN-GANs.
In addition, we found running extra latent optimisation steps benefited evaluation, so we use ten steps of latent optimisation in evaluation for results in this section, although the models were still trained with a single optimisation step. We reckon that smaller models might not be “over-parametrised” enough to fully amortise the computation from optimising , which can then further exploit the architecture in evaluation time. On the other hand, the overhead from running multiple iterations of latent optimisation is relatively small at this scale. We aim to further investigate this difference in future studies.
Table 2 shows the FID and IS alongside SN-GAN and CS-CAN which used the same architecture. Here we observe similarly significant improvement over the baseline SN-GAN model, with an improvement of in IS and in FID. Figure 9 shows random samples from these two models. Overall, samples from LOGAN (NGD) have higher contrasts and sharper contours.