1 Introduction
Generative Adversarial Networks (GANs) gan are one of the most common tools for learning complex distributions. However, the original GAN is tricky to train and often suffers from problems such as mode collapse gan and nonconvergence. When mode collapse occurs, the generator maps different inputs to the same output.
In this paper, we view GANs training as a continual learning problem and show that catastrophic forgetting is present in GANs. Catastrophic forgetting in neural networks catastrophicConnectionist
is the phenomenon where the learning of a new skill catastrophically damages the performance of the previously learned skills. In the machine learning literature, mode collapse and catastrophic forgetting are usually studied independently. A number of methods have been proposed to address mode collapse
wgan ; wgangp ; dcgan ; improveGeneralization and nonconvergence ttur ; whichGANConverge ; improveGeneralization in GANs, and catastrophic forgetting ewcpp ; ewc ; continualDeepGenReplay ; synapticIntel in neural nets as independent problems. We show that catastrophic forgetting and mode collapse are interrelated. Their combined effect can make the training nonconvergent. Our view allows the application of continual learning techniques to GANs. Experiments on synthetic and MNIST datasets confirm that continual learning helps GANs converge to better equilibria, i.e., equilibria with less mode collapse.Our contributions are:

We propose a novel view of GANs training as a continual learning problem.

We establish a sufficient condition for GANs to converge. The condition explains the effectiveness of a number of methods in stabilizing GANs.

We show how catastrophic forgetting can make the discriminator violate our convergence condition, making the training of GANs nonconvergent.

We propose a way to leverage continual learning algorithms in training GANs to avoid catastrophic forgetting and improve convergence.
2 Preliminaries: Training Dynamics of GANs in the Data Space
Several previous works such as those in whichGANConverge ; numericGAN analyzed the convergence of GANs in the parameter space with simplified models where the generator and the discriminator are linear, single parameter functions. In this paper, we take a different approach where we examine the convergence in the data space. This allows us to work with more complicated networks of higher capacity.
The training of GANs only takes a finite number of iterations in practice. The set of noise vectors, therefore, is a finite set. Let
be the generator, be the discriminator. We denote the set of noise vectors as , the set of generated samples as , and the set of real samples as .Let be a fake sample and be the gradient of the discriminator’s loss w.r.t. . Note that when the gradient backpropagates from to , it only passes through the discriminator, the generator has not been involved yet. only depends on the discriminator’s parameters, not the generator’s parameters. Updating the generator using SGD with small enough learning rate moves in the direction of , by a distance proportional to . In Fig. 1, the opposite of the gradient at each datapoint is shown by an arrow. If the discriminator is fixed, then updating the generator with SGD moves along the associated integral curve in the direction of the arrows and increases the value of . As fake samples should converge to real samples from the target distribution, integral curves should converge to real samples.
In practice, gradients are averaged over a minibatch so an individual fake sample in the minibatch does not move along its integral curve. However, if the generator has large enough capacity, it can move each fake sample along a path which is close to the integral curve. For simplicity, we assume that minibatches of size 1 are used in training and updating the generator with the gradient from a fake sample does not affect other fake samples.
3 Catastrophic Forgetting in GANs
In GANs, there are two scenarios where catastrophic forgetting can hurt the generative performance: (1) GANs are used to learn a set of distributions that are introduced sequentially; and (2) GANs are used to learn a single distribution .
The first scenario is a standard continual learning problem. The discriminator at task does not have access to distributions . , thus, forgets about previous target distributions and cannot teach the generator to generate samples from these distributions. In order to maximally deceive , only produces samples from . This setting has been studied in continualGAN , which used Elastic Weight Consolidation (EWC) algorithm ewc to protect important information about old target distributions. With EWC, at task , still remembers old target distributions so cannot fool by generating samples from only and must generate samples from all target distributions.
The second scenario is the main subject of this paper. It can be viewed as a continual learning problem in which has to discriminate a sequence of model distributions from the target distribution , and has to deceive a sequence of discriminators . At step , still has access to samples from the target distribution , but it cannot access to samples from previous model distributions . As a result, is biased toward discriminating the current model distribution from the target distribution, forgetting previous model distributions. Furthermore, is likely to focus on separating fake samples from nearby real samples, ignoring distant real samples (blue box in Fig. 1). In Fig. 1, assigns higher score to datapoints that are further away from fake samples, regardless of the true labels of these points. Datapoints on the right of red boxes in Fig. 1 have higher score than real datapoints located around coordinates and . Moving from Fig. 1 to Fig. 1, the direction of almost all vectors changes when fake samples move, suggesting that the discriminator has little memory of past (real and fake) samples.
Gradients w.r.t. real samples that has forgotten point in wrong directions (top red box in Fig. 1) or have small norms (blue boxes in Figs. 1, 1). This implies that overemphasizes on the current fake samples and nearby real samples while lowering the importance of distant real samples. Note that distant real samples are present in every minibatch as large minibatches of size 256 were used. Catastrophic forgetting happens not only because old data are no longer accessible, but also because of the way the network distributes its capacity (see Section 5.3 for possible solutions).
If has mode collapse, then only focuses on a small set of modes of that are close to samples produced by . That worsens catastrophic forgetting in , making focuses its capacity into a smaller region of the space and forgets other regions, i.e. assign wrong values to datapoints in these regions. Scores of real datapoints in the blue box in Fig.1 are close to while scores of fake datapoints in the black box are close to . could also fool by turning back to an old state which has forgotten. When this situation occurs, and could fall into a loop and do not settle at an equilibrium. The vector field/discriminator and the generated distribution/generator at iteration 20000 (Fig. 1) are very similar to these at iteration 3000 (Fig. 1), implying that and have completed cycle(s) of the learning loop. For the experiment in the first row of Fig. 1, the loop continues for many cycles without any sign of breaking.
are 2 hidden layer MLPs with 64 hidden neurons. Unless specified otherwise, SGD with learning rate of 0.03 was used for both the generator and discriminator.
LABEL:, LABEL:, LABEL: The vector field of standard GAN at iteration 3000, 3500, and 20000. LABEL:, LABEL:, LABEL: The averaged vector field with at iteration 3000, 3500, and 20000. LABEL:, LABEL:, LABEL: Dense EWCGAN with at iteration 5000, 10000, and 20000. LABEL:, LABEL: GAN+Adam optimizer at iteration 400 and 600, Adam with was used in this experiment. LABEL: GAN+TTUR with at iteration 10000, figure reproduced with permission from improveGeneralization .4 When do GANs converge?
4.1 A sufficient condition for convergence
Consider the gradient field generated by a converged discriminator in Fig. 1. Gradients w.r.t. datapoints near a real datapoint point toward that real datapoint. We call such point a sink of the discriminator. Sinks are local maxima of the discriminator’s function. If a fake sample falls into the basin of attraction of a sink , it will be attracted toward .
In Fig. 1, real datapoints are sinks of the discriminator. If the discriminator is fixed (and therefore, the vector field), then a generator trained to maximize the score of fake samples will converge to some of these sinks (real datapoints).
Proposition 1 (Sufficient condition for convergence).
Given a GAN with generator and discriminator . If has some sinks at some fixed locations, fake samples are located in the basin of attraction of these sinks, and has large enough capacity so it could move any fake datapoint independently of all other fake datapoints, then and converge to an equilibrium.
Proof.
See Section Appendix A. ∎
We analyze how likely for the condition to be satisfied in practice. (1) has some sinks at some fixed locations: this requirement is the hardest to satisfy. As shown in Section 4.2, catastrophic forgetting tends to remove local maxima/sinks from , making GANs unable to converge. Locations of sinks could also change during training. Because the goal of is generating samples from , we want sinks to be real samples. In Section 5, we introduce and compare a number of ways to make real samples fixed sinks. (2) Fake samples are located near sinks: As tries to improve fake samples’ scores, it moves them toward local maxima/sinks of . Therefore, fake samples are likely to be in the basin of attraction of some sinks (if they exist) as the training progresses. (3) has enough capacity to move any fake sample independently of all other fake samples: such generator can be created by associating a set of parameters to each fake sample. That generator, in turn, can be approximated using a 1hidden layer MLP of parameters (Appendix C). In practice, generators are large deep neural nets which usually have enough capacity to move distant fake samples in independent paths. To make GANs satisfy our condition, the key is to create fixed sinks.
4.2 Catastrophic forgetting in the Dirac GAN
and Leaky ReLU activation function. The blue line represents the discriminator’s function. The real datapoint
and the fake datapoint are shown by blue and red dots. LABEL:  LABEL:: Dirac GAN trained with two fake samples: old fake sample on the left and current fake sample on the right. LABEL:: empirically optimal Dirac discriminator trained on one fake sample.To see the effect of catastrophic forgetting on GANs and to motivate our solution, let us consider the Dirac GAN by Mescheder et al. whichGANConverge . Dirac GAN learns a 1 dimensional Dirac distribution located at the origin. The datasets are , . The discriminator and the generator are Lipschitz linear functions: , . The training objective is defined as
(1) 
where is a real valued function. For the original GAN, , and for the Wasserstein GAN wgan . Dirac GAN is simple to analyze and results for Dirac GAN generalize well to more complicated GANs in higher dimensional space whichGANConverge . The original Dirac GAN cannot satisfy the condition in Proposition 1 because the discriminator is always a monotonic function with no local extrema. We consider here Dirac GANs with high capacity multilayer discriminators.
4.2.1 Catastrophic forgetting in high capacity Dirac GAN
Consider a high capacity discriminator of the form: where and is a monotonic activation function such as ReLU, Leaky ReLU, Sigmoid, or Tanh. This discriminator can have local extrema (see, for example, Fig. 2(b)). However, if the training dataset contains only 1 fake datapoint , the empirically optimal discriminator will be a monotonic function with parameters (see Fig. 2(e) for an example of and Appendix B for the calculation of ). uses all of its capacity to maximize the difference between and . Optimizing the performance on the current fake distribution pushes toward and tends to make a monotonic function. If is fixed and monotonic, will move the fake datapoint pass the real datapoint and diverge to infinity. When alternate gradient descent is used, forgetting makes and oscillate around the equilibrium (Fig. 5 in Appendix D). For Dirac GAN to converge, must have high performance on old fake distributions, i.e. it must not catastrophically forget old fake distributions.
Catastrophic forgetting removes extrema from the Dirac discriminator. The same phenomenon is observed for higher dimensional discriminators. There are no sinks in the vector fields in Figs. 1, 1, and 1. Because does not remember that datapoints on the right of the red boxes in Fig. 1 are fake, it focuses its capacity on the current fake datapoints and assigns higher scores to datapoints that are further away from the current fake datapoints. Moving from left to right of the red boxes, scores of datapoints monotonically increase. Catastrophic forgetting prevents from making real datapoints its local maxima.
To check whether a real MNIST image is/near a local maximum, we compute the gradient of the discriminator w.r.t. , , and visualize the function where . If is/near a local maximum then will have a local maximum at/near . Otherwise, will have no local maxima. The result is shown in Fig. 3. For most images, is saturated with value close to 1. For some other images, has a very sharp peak at/near , implying that overfits to these images and does not generalize well. The result suggests that the discriminator forgets most of the real images and overfits to others. Because images in the same class are similar, forgetting occurs at class level: in Fig. 3, only some classes have local maxima and can only generate samples from these classes. Catastrophic forgetting and mode collapse are interrelated.


4.2.2 Old data helps GANs to converge
To alleviate catastrophic forgeting, a solution is to keep fake samples from previous training iterations and reintroduce them to the current discriminator ganActorCritic . The effect of using old fake samples on Dirac GAN is shown in Figs. 2(a)  2(d). When there are two fake samples on different sides of the real sample, the optimal discriminator must have a local maximum at the real datapoint (ideally, must have high performance on all old fake distributions. However, enforcing that is nontrivial and for Dirac GAN, one old fake distribution is enough to make it converge). and converge to an equilibrium where can generate the real datapoint.
Although this method improves the convergence of GANs by creating local maxima, it requires additional memory to keep fake samples from previous iterations. For one dimensional data, only one old fake datapoint on the other side of the real datapoint is enough. For two dimensional data in Fig. 1
, fake datapoints around a real datapoint are required to make the real datapoint a local maxima. Because of the curse of dimensionality, the number of fake samples grows
exponentiallywith the dimensionality of a sample. This method, therefore, is not effective for high dimensional data. The experiment in Fig.
6 in Appendix E confirms that this method has little effect on GANs trained on MNIST. In Section 5, we show how to use continual learning techniques to achieve result similar to storing information about all previous fake distributions without storing any old fake samples.5 Improving GANs with Continual Learning
5.1 Method
The discriminator at step is likely to be the best discriminator to separate from . The ensemble of will likely to have good performance on all distributions. We would like to detect parameters that are crucial to the performance of on the th task and transfer that knowledge to subsequent discriminators to create an approximation of the ensemble.
Continual learning algorithms such as EWC ewc prevent parameters that are important to a task from deviating too far from their optimal values. For each old task
, a regularization term of the following form is added to the current loss function:
, where is the Fisher information matrix calculated at the end of task , is the current parameters, is task ’s optimal parameters, and controls the relative importance of task to the current task. Naive application EWC to GANs requires the computation of the Fisher information matrix at every iteration and the number of regularization terms is equal to the number of iterations. Online EWC algorithm progressCompress removes the need for multiple regularization terms by keeping only the running sum of Fisher information matrix and the last optimal parameters. The regularizer in online EWC takes the following form: .We apply online EWC to GANs. We also note that too old fake samples have lower quality than more recent ones and could add noise to the training. A simple solution is to exponentially forget old samples. In our method, this is implemented by exponentially forgetting old Fisher information matrices. Because consecutive discriminators are similar and tends to forget the same set of samples, we sample a discriminator every iteration to improve the diversity of the ensemble and reduce the computation. We have the following loss for the discriminator at iteration :
(2) 
where , and . is computed every iterations, using real and fake samples used in iterations from to . controls the diversity of discriminators in the ensemble. control how fast old Fisher matrices are forgotten. The smaller is, the faster the old information is forgotten. controls the balance between the current task (minibatch) and old tasks. could be any of the standard loss functions such as the cross entropy or Wasserstein loss.
5.2 Ensembles of discriminators create sinks
Consider a sequence of discriminators . Let be the discriminator whose output is the running average of discriminators in the sequence.
(3) 
The vector field generated by is the running average of vector fields at different time steps. The averaged vector field is shown in Figs. 1, 1, 1. To reduce computation, the average is taken every 100 iteration. The averaged vector field is robust to changes in the generated distribution. It has much nicer pattern than vector field of individual discriminator: the gradient w.r.t. datapoints inside the circle point toward real datapoints.
Because the discriminator always has access to the real dataset and it maximizes the score of real training examples, the score of any real example is likely to be higher than the scores of the majority of its neighbors for most of the time. Averaging scores at different time steps will likely make a local maximum. The real datapoint in the blue/black box in Fig. 1 is not a local maximum at any individual time step but it is a local maximum in the averaged vector field. Real datapoints are likely to be local maxima/sinks of the averaged discriminator . A discriminator trained with our method approximates so it is expected to create sinks in a way similar to an ensemble. Our method helps the discriminator satisfies our convergence condition by making real datapoints fixed sinks.
5.3 Comparison to other methods
5.3.1 Zero–centered Gradient Penalties
Zero–centered Gradient Penalty on training examples only:
To reduce the oscillation in Dirac GAN, Mescheder et al. whichGANConverge proposed the zerocentered gradient penalty on real training examples () / fake training examples () which pushes the gradient at real / fake examples toward . The loss for the discriminator is defined as follows
(4) 
where is the loss in Eqn. 1. In Dirac GAN, the penalty encourages and to reach an equilibrium where can generate the real datapoints ^{1}^{1}1Although significantly improves the quality of generated samples, improveGeneralization suggested that it might encourage GANs to remember the real training examples, resulting in poor generalization capability.. When trained with , maximizes scores of real datapoints and forces the gradient w.r.t. them to be . encourages real training datapoints to be local maxima/sinks of the discriminator. Because the location of real datapoints are fixed, helps the discriminator to satisfy condition 1. Fig.3 shows that are effective in making real training examples local maxima. During training, fake samples are attracted to different local maxima/sinks at different locations. Mode collapse is thus reduced (see generated samples in Fig. 3).
In Fig. 1, although the gradient around the real datapoint in the blue box is close to , that real datapoint is not a sink of the vector field. Fig. 1 shows an extreme case where the gradient w.r.t. all of the real and fake datapoints are close to while none of them are sinks and the two networks collapse to a bad local equilibrium. The phenomenon suggests that forcing the gradient at real datapoints toward is not enough to make them sinks. Our method creates sinks by distilling the knowledge from an ensemble of discriminators. If the distillation is perfect then our method only fails where the ensemble fails. Therefore, the diversity of discriminators in the ensemble is crucial to our method. Our method is orthogonal to and can be combined with to create a better regularizer for GANs.
Zero centered Gradient Penalty on interpolated samples
:ThanhTung et al. improveGeneralization studied the generalization capability of GANs and showed that and nonzero centered gradient penalties do not improve the generalization of GANs. The authors proposed to improve the generalization of GANs using a gradient penalty of the following form:
(5) 
where is a path from a fake sample to a real sample. They showed that 0GP inherits the convergence guarantee of and helps the discriminator distribute its capacity more equally between regions of the space. When 0GP is used, is less likely to overfocus on current fake samples so distant real training examples are less likely to be forgotten. Situations similar to blue boxes in Fig. 1 and 1 are less likely to happen. 0GP, therefore, can be seen as a method to avoid catastrophic forgetting.
5.3.2 Momentum based optimizers
The running average of gradients in Eqn. 3 is similar to that in momentum based optimizers. The similarity suggests that optimizers with momentum can also alleviate catastrophic forgetting problem. The running average of gradients is a simple form of memory that preserves information about old samples. Updating the discriminator with the running average improves its performance on current samples as well as old samples. The fact partly explains the success of momentum based optimizers such as SGD with momentum and Adam adam in training GANs. Figs. 1 and 1 show the effect of Adam optimizer on the 8 Gaussian dataset. When fake samples move, the vector field/the discriminator does not change as much as before. More interestingly, gradients in the red box in Fig. 1 still point toward the real sample although fake samples are located near by. See Fig. 7 in Appendix E for an evolution sequence of a GAN trained with Adam.
5.3.3 Mixture of discriminators
Arora et al. equiAndGeneralization
showed that an infinite mixture of generators and an infinite mixture of discriminators converge to an equilibrium. The authors experimentally showed that finite mixtures well approximate infinite mixtures and result in improved stability and sample quality. From our point of view, if each discriminator in the mixture remembers a different region of the data space, then using a mixture of discriminators reduces the probability that a set of sample is catastrophically forgotten.
When mixtures are used, the memory requirement grows linearly with the number of components in the mixture. That prevents this method from scaling up to large networks and datasets. Our method does not require additional discriminators and generators and has the same memory requirement as standard GAN. Because the memory stays the same when we vary the interval , we can control the diversity in the ensemble without using additional memory. Our method, therefore, is applicable to massive datasets and networks.
6 Experimental results and discussion
Although our regularizer can be added to arbitrary loss functions, we perform experiments on the cross entropy loss. Our GAN is denoted as EWCGAN. We note that Synaptic Intelligence (SI) synapticIntel and similar algorithms can also be adapted to GANs in a similar way.
The result of EWCGAN on the 8 Gaussian dataset is shown in Fig. 1  1. Continual learning effectively reduces the catastrophic forgetting problem: the vector field has a more stable pattern and is robust to changes in the generator’s distribution. The generator and the discriminator converge to a good equilibrium where the generator can generate all modes in the target distribution.
Fig. 4 shows the result on MNIST dataset. For each setting, the best result in 10 different runs is reported. We note the convergence in Fig. 4 and Fig. 4: going from iteration 10000 to to iteration 20000, for most images, the digits stay the same, only the quality gets improved. Without EWC, the GAN in Fig. 4 and 4 does not exhibit convergence: the digits keep changing as the training continues. EWCGANs with small also do not converge. When is small, the discriminators in the ensemble are very similar so they are likely to forget the same set of samples. As a result, the distilled discriminator also forgets that same set of samples. Larger improves the diversity of discriminators in the ensemble, resulting in less forgetting and more diverse images. In Fig. 4, the diversity of generated images increases as increases.
7 Related work
Catastrophic forgetting: Liang et al. ganIsCont independently came up with the same idea about using continual learning methods for improving GANs ^{2}^{2}2Liang et al.’s paper was completed months after our paper and has not been accepted to any conferences or journals. They agreed that our paper is the first one which consider the catastrophic forgetting problem in GANs.. Based on intuition, the authors proposed a slightly different way of using continual learning techniques to improve GANs. The authors showed that EWC, SI, and their proposed method slightly improve the performance of GANs on toy and Cifar10 datasets. However, they did not provide any theoretical analysis of the catastrophic forgetting problem or its relation to the convergence of GANs. Our paper, on the other hand, focuses on the theoretical aspect of the problem. We provide a theoretical analysis of the problem and its effects. We show that continual learning techniques help to improve the convergence of GANs by helping the discriminator to satisfy our convergence condition.
Convergence: Prior works on convergence of GANs usually study convergence in parameter space ttur ; whichGANConverge ; numericGAN ; ganStable . However, there is no guarantee that a GAN converging in parameter space can well approximate the target distribution. Our paper studies the convergence in data space which is more closely related to the capability of GANs in approximating the target distribution. We propose a method to help GANs converge to equilibriums where generators can generate samples from the target distribution.
8 Conclusion
In this paper, we study the catastrophic forgetting problem in GANs. We show that catastrophic forgetting is a reason for nonconvergence. We then establish a sufficient condition for convergence and show that the condition is violated when catastrophic forgetting happens. From that insight, we propose to apply continual learning techniques to GANs to alleviate the catastrophic forgetting problem. Experiments on synthetic and MNIST datasets confirm that continual learning techniques improve the convergence of GANs and the diversity of generated samples.
References
 [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 [2] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224–232, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

[3]
Arslan Chaudhry, Puneet K. Dokania, Thalaiyasingam Ajanthan, and Philip H. S.
Torr.
Riemannian walk for incremental learning: Understanding forgetting
and intransigence.
In
The European Conference on Computer Vision (ECCV)
, September 2018.  [4] DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 [5] Robert M. French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128 – 135, 1999.
 [6] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 [7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017.
 [8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc., 2017.
 [9] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [10] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
 [11] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Selfnormalizing neural networks. CoRR, abs/1706.02515, 2017.
 [12] Kevin J Liang, Chunyuan Li, Guoyin Wang, and Lawrence Carin. Generative Adversarial Network Training is a Continual Learning Problem. arXiv eprints, page arXiv:1811.11083, Nov 2018.
 [13] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3478–3487, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 [14] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1825–1835. Curran Associates, Inc., 2017.
 [15] Vaishnavh Nagarajan and J. Zico Kolter. Gradient descent gan optimization is locally stable. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5585–5595. Curran Associates, Inc., 2017.

[16]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.
2017.  [17] David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actorcritic methods. In NIPS Workshop on Adversarial Training, 2016.
 [18] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
 [19] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka GrabskaBarwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4528–4537, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 [20] Ari Seff, Alex Beatson, Daniel Suo, and Han Liu. Continual learning in generative adversarial nets. CoRR, abs/1705.08395, 2017.
 [21] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2990–2999. Curran Associates, Inc., 2017.
 [22] Hoang ThanhTung, Truyen Tran, and Svetha Venkatesh. Improving generalization and stability of generative adversarial networks. In International Conference on Learning Representations, 2019.
 [23] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3987–3995, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
Appendix A Proof for proposition 1
Proof.
Let be the set of fixed sinks of . Consider a fake datapoint in the basin of attraction of a sink . Because the movement of is independent of other fake datapoints, can freely move toward . Training the generator with gradient descent will make converge to . This is true for all fake datapoints in . Because the locations of sinks are fixed during training, the generator and the discriminator will come to an equilibrium where the generator can generate some of the sinks. ∎
Appendix B The empirically optimal Dirac discriminator
Given a discriminator where and is a nondecreasing activation function such as ReLU, Leaky ReLU, Sigmoid, or Tanh. The real datapoint is , the fake datapoint is . The empirically optimal discriminator must maximize the difference between and .
If is ReLU or Leaky ReLU or Tanh, then and . If is Sigmoid, then and . For both cases, we have
Thus
The equality is achieved for all cases when and . The optimal discriminator’s parameters are . If is the Leaky ReLU activation function, the optimal discriminator is
where is the negative slope of the Leaky ReLU activation function. The empirically optimal discriminator with and Leaky ReLU activation function is shown in Fig. 2(e).
Appendix C Constructing powerful generators
Given a dataset
of dimensional fake samples which are created using dimensional noise. For simplicity, we assume that noise vectors are normalized: . A generator that approximately satisfies the requirement in the condition of Proposition 1 can be constructed as a 1 hidden layer MLP as follow:
where , and is the softmax function. is defined as
For large enough , will become an onehot vector with the th element being 1. The output of is the th row of and gradient update to will affect that row only. Such MLPgenerator can move any individual fake sample in a path independent of all other fake samples.
Appendix D Catastrophic forgetting in high capacity Dirac GAN
All experiments in this paper were done on a single Nvidia GTX 1080 GPU with 8 GB of RAM. The code was written in Pytorch 1.0 [16].
See Fig. 5.
Appendix E Additional Experiments
The effect of using old fake data on MNIST dataset. We trained a standard GAN with MLP discriminator and generator on MNIST dataset for 20000 iterations. The networks’ architecture and other hyperparameters are the same as those used in our experiment in Fig.
4. We use a buffer of size to store last fake minibatches. At every iteration, we generate a random number between 0 and 1 and if the number is lower than a threshold then an old minibatch is randomly selected from the buffer to be used to train the discriminator. We set and varied between 0.1 and 0.5 to see the effect of old data. For all values of , the model suffers from severe mode collapse and the generated samples have low quality.
Comments
There are no comments yet.