chainGAN_repo
Code for the https://arxiv.org/pdf/1811.08081.pdf
view repo
We propose a new architecture and training methodology for generative adversarial networks. Current approaches attempt to learn the transformation from a noise sample to a generated data sample in one shot. Our proposed generator architecture, called ChainGAN, uses a twostep process. It first attempts to transform a noise vector into a crude sample, similar to a traditional generator. Next, a chain of networks, called editors, attempt to sequentially enhance this sample. We train each of these units independently, instead of with endtoend backpropagation on the entire chain. Our model is robust, efficient, and flexible as we can apply it to various network architectures. We provide rationale for our choices and experimentally evaluate our model, achieving competitive results on several datasets.
READ FULL TEXT VIEW PDFCode for the https://arxiv.org/pdf/1811.08081.pdf
Generative Adversarial Networks (GANs) are a class of generative models that have shown promising results in multiple domains, including image generation, text generation, and styletransfer
(Goodfellow et al., 2014; Zhang et al., 2017; Zhu et al., 2017). GANs use a gametheoretic framework pitting two agents – a discriminator and a generator – against one another to learn a data distribution . The generator attempts to transform a noise vector into a samplethat resembles the training data. Meanwhile, the discriminator acts as a classifier trying to distinguish between real samples,
, and generated ones, . Equation 1 is the original minimax objective proposed by Goodfellow et al. (2014), which can also be viewed as minimizing the JensenShannon divergence between two distributions, and. These agents are most often implemented as neural networks and
Goodfellow et al. (2014) proved that, under suitable conditions, the generator distribution will match the data distribution, .(1) 
Despite their success, GANs are notoriously hard to train and many variants have been proposed to tackle shortcomings. These shortcomings include mode collapse (i.e., the tendency for the generator to produce samples lying on a few modes instead of over the whole data space), training instability, and convergence properties. In most of these modified GANs, however, the generator is realized as a single network that transforms noise into a generated data sample.
In this paper, we propose a sequential generator architecture composed of a number of networks. In the first step, a traditional generator (in this paper called the base generator) is used to transform a noise sample into a generated sample . The sample is then fed into a network (called editor) which attempts to enhance its quality, as judged by the discriminator. We have a number of these editor networks connected in a chain, such that the output of one is used as the input of the next. Together, the base generator and the editors ahead of it are called the chain generator. Any preexisting generator architecture can be used for the base generator, making it quite flexible. For the editors, architectures designed for sample enhancement work well. The whole chain generator network can be quite small; in fact, it is smaller than most existing generator networks, yet produces equivalent or better results (Section 4.2).
Each network in the chain generator is trained independently based on scores they receive from the discriminator. This is done instead of an endtoend backpropagation through the whole base generator + editor chain (Section 3.3). This allows each network in the chain to get direct feedback on its output. A similar approach has been quite successful in classification tasks. Huang et al. (2017) showed that in a classification architecture composed of a chain of ResNet blocks, training each of these blocks independently, produces better results than an endtoend training scheme (He et al., 2015).
In this paper, we propose a sequential generator architecture and a corresponding training regime for GANs. The benefits of this approach include:
The formulation by Goodfellow et al. (2014) (Equation 1) is equivalent to minimizing the JensenShannon divergence between the data distribution and the generated distribution . However, Arjovsky et al. (2017) showed that, in practice, this leads to a loss that is potentially not continuous, causing training difficulty. To address this, they presented the Wasserstein GAN (WGAN) which minimized the Wasserstein (or ‘Earth mover’) distance between two distributions. This is a weaker distance than JensenShannon and Arjovsky et al. (2017) showed that, under mild conditions, it resulted in a function that was continuous everywhere and differentiable almost everywhere. The WGAN objective can be expressed as:
(2) 
where is the set of all 1Lipchitz functions. no longer discerns between real and fake samples but is used to approximate the Wasserstein distance between the two distributions (it is here called a ‘critic’). in turn looks to minimize this approximated distance. Several approaches have been proposed to enforce the Lipchitz constraint on the critic, the most notable being gradient penalty (Gulrajani et al., 2017). This method penalizes the norm of the critic’s gradient with respect to its input and was shown to have stable training behavior on various architectures. Due to its strong theoretical and empirical results, we use a modified version of this objective for our work. The objective for WGAN with gradient penalty can be expressed as:
(3) 
where samples uniformly along straight lines between pair of points sampled from the data distribution and the generator distribution Gulrajani et al. (2017). is the gradient penalty coefficient.
There have been a diverse range of architectures proposed for the generator and discriminator networks. DCGAN has shown promising results by using a deep convolutional architecture for both the generator and discriminator (Radford et al., 2015). However, this uses a single large generator network instead of the sequential architecture that we have developed. Our architecture provides similar results as DCGAN while being more efficient, both in network parameters (Section 4.2) and in memory. EnhanceGAN, proposed by Deng et al. (2017), successfully use GANs for unsupervised image enhancement. In this model, the generator is given an image, and it tries to improve its aesthetic quality. The editor part of our network is similar to this as they are responsible for enhancing images in an unsupervised fashion.
Sequential generator architectures have been proposed for tasks where the underlying data has some inherent sequential structure. TextGAN uses an LSTM architecture for the generator to generate a sentence, one word at a time (Zhang et al., 2017). Ghosh et al. (2016) used an RNNstyle discriminator and generator to solve abstract reasoning problems. This attempts to predict the next logical diagram, given a sequence of past diagrams. CRNNGAN is another example that uses GANs on continuous sequential data like music (Mogren, 2016). Unlike these works, we do not assume the data is inherently sequential nor do we produce a single sample by generating a number of smaller units; rather, the output of any network in the chain generator is a whole sample, and the succeeding ones try to enhance it. Moreover, each editor network has its own set of parameters, unlike LSTMs or RNNs where these are shared. This allows each editor to be independent and makes our model efficient as we do not have to backpropagate across time.
The concept of using multiple generators has been proposed previously. Hoang et al. (2017) used multiple generators which share all but the last layer, trained alongside a single discriminator and a classifier. Their work has shown resistance to mode collapse and achieves stateoftheart results on some datasets. Their work was influenced by MIX+GAN, which showed strong theoretical and empirical results by using a mixture of GANs instead of a single generator and discriminator (Arora et al., 2017). Notably, they showed that using a mixture of GANs guarantees existence of approximate equilibrium, leading to stable training. However, a drawback of their work was the large memory cost of using multiple GANs  they suggest a mixture size of 5. Our work is similar to these in that we use multiple generators. The output of each editor is a different transformation on the noise vector and can be viewed as originating from different generators. However, the generators in our model are not independent; rather, they are connected in a chain and meant to build upon the previous ones.
A clear advantage of a sequential approach is that each output in the sequence is conditioned on the previous one, simplifying the problem of generating a complicated sample by mapping them to a sequence of simpler problems. A sequential approach also means that loss or other latent information in the intermediate steps can be used to guide the training process, which is not possible in oneshot generation.
As such, we propose a sequential generator architecture and an associated training objective for GANs. We observe that the challenge of generating a data sample from noise can be divided into two components: generating a crude representation of what that sample might be, and then iteratively refining it (Figure 1). A generator trained on images may first generate a rough picture of a car or a boat, e.g., but it can then be successively edited and enhanced. Intuitively, this is akin to the editorial process wherein a writer produces a rough sketch of an article, which then passes through a number of editors before the finished copy is published.
We use a WGAN with gradient penalty as the starting point for our architecture due to its theoretical robustness and training stability (Section 2.1). However, instead of using a single generator, we use the chain generator consisting of a base generator, denoted by and several editors, denoted by . The base generator takes in , and produces a sample attempting to fool the critic, as in the traditional GAN architecture. The editors are connected in a chain where the one takes as input a sample and tries to produce an enhanced version as the output, (Figure 1). Thus, the output of each editor is conditioned on the previous one. Our network can be expressed recursively as:
(4) 
Each of the intermediate samples in the chain, , is an attempt to fool the critic. In the WGAN formulation, the critic is used to approximate the Wasserstein distance between the data distribution and the generated distribution , where is obtained by transforming using the network . In our formulation, each output is obtained by the following transformation on : . As such, can be viewed as being sampled from distribution and a critic is needed to approximate the Wasserstein distance between and (see Section 31 for a detailed explanation). For a chain of editors plus the base generator, critics are needed. For the editor and critic, the objective can be expressed as:
(5) 
Each network in the chain generator, whether the base generator or any of the editors, is trained independently, based on critic scores for that network’s output (Section 7.1). At every iteration, we randomly choose one of them to update, and use their output to also train the corresponding critic, . This means that the goal for any network in the chain generator is to ensure its outputs are as realistic as possible. Without loss of generality, we can express the training rule for the chain generator using canonical gradient descent with learning rate as:
(6) 
Our training regime is in contrast to performing endtoend backpropagation (from the last editor all the way to the base generator). The motivation for this is threefold. First, we want each editor to do its best in generating realistic samples and receive direct feedback on its performance. This way, we can utilize the intermediate samples generated in the chain to guide the training process. So editor i can start with the distribution and use the critic feedback to construct a that is closer to . Another benefit of this is that we can possibly use fewer editors during evaluation. For example, we can train with the chain generator composed of , , , …, , but in evaluation observe that after editor (), the quality of the samples saturate or do not improve significantly (Figure 8). As such, during evaluation we can cutoff the chain generator at editor , and use this smaller chain for downstream tasks, making it more compact. Performing endtoend backpropagation would mean that only the samples generated by the last editor would be viewed by the critic, and the goal of the chain generator would be to generate good samples by the last editor. That network could no longer be made compact at evaluation, and the number of editors to use would be an additional hyperparameter.
The last rationale behind our training approach is efficiency. Observe that in the canonical GAN architecture, the gradient computation when updating the generator is , which involves traversing both the generator and critic graphs, which can be expensive for large generator networks. Performing endtoend backpropagation on our chain generator would cause a similar issue as the entire chain would need to be retained alongside the critic. However, training each of these networks separately means that the gradient computation for an update step becomes , so only a single editor or base generator and its critic need to be traversed and loaded on the GPU (Section 7.1). Since these networks are quite small, our model can be made very efficient.
We deploy our model on several datasets that are commonly used to evaluate GANs, including: (i) MNIST – 28x28 grayscale images of handwritten digits (LeCun & Cortes, 2010); (ii) CIFAR10 – 32x32 colour images across 10 classes (Krizhevsky et al., ); (iii) CelebA face dataset – 214x110 colour images of celebrity faces (Liu et al., 2015). For the CIFAR10 dataset, we compute the inception score and compare it against several GAN variants (Salimans et al., 2016). Our goal is to show that our model can be used with existing generator architectures to provide competitive scores while using a smaller network.
We implement multiple critics by sharing all but the last layer (Section 7.3). To verify our results were not due to this, we also experiment with using a single critic, which did not degrade results. Each editor is composed of a few residual blocks since Deng et al. (2017) showed that such an architecture performs well for unsupervised image enhancement tasks (see Section 7.4 for details). We use this editor architecture for all our experiments and have five such networks connected in a chain. A key goal when designing our networks was efficiency; in fact, all implementations of our chain generator have fewer parameters than comparable models.
Randomly selected MNIST examples from our DCGAN with editors model at different epochs.
Our base generator and the critics use a convolutional architecture similar to DCGAN. The entire chain generator network is quite small, using approximately a third of the number of parameters as the DCGAN generator. Early in the training, the base generator’s samples are crude, but the editors work to enhance it significantly (Figure 33). As the training progresses, the base generator’s samples increase in quality and the editor’s effect becomes more subtle. Visually, it appears that the editors collaborate and build upon each other; the class and overall structure of the image remain the same throughout the editprocess.
Model  Generator Arch.  Score  Best Score  Params 
1  Small DCGAN (=256)  5.020.05  N/A  700,000 
2  Tiny DCGAN (=128) + Editors  5.000.09  5.670.07 (Edit 3)  866,000 
3  Small DCGAN + Editors (one Critic)  5.240.04  6.050.08 (Edit 2)  1,250,000 
4  Small DCGAN + Editors (multi Critic)  5.630.06  5.860.05 (Edit 2)  1,250,000 
5  Small DCGAN + Editors (endtoend)  N/A  2.56  1,250,000 
6  DCGAN  WGAN+GP (=512)  5.180.05  N/A  1,730,000 
7  Small ResNet  5.750.05  720,000  
8  Small ResNet + Editors  6.350.09  6.700.03 (Edit 3)  1,000,000 
9  ResNet  WGAN+GP  6.86  N/A  1,220,000 
We experiment with different architectures for the base generator. These include variants of DCGAN and a ResNet architecture proposed by Gulrajani et al. (2017). We add a chain of five editors to these base generators and train them according to Algorithm 1. We compare them against the original (and larger) versions of these architectures. For original architectures, we use the tuned hyperparameters recommended in the pertinent paper/implementation. For the ChainGAN variants, we use default parameters instead. We wanted to evaluate the robustness of our model and believe that better results than ours can be obtained by tuning hyperparameters. All models compared are trained to the same number of epochs using the WGAN formulation with a gradient penalty term (WGAN+GP). Table 1 contains the results of these experiments.
Based on inception scores, our DCGAN variant of the chain generator using multiple critics (model 2 and 4) was able to outperform the base DCGAN model (model 6) while using fewer parameters. In a related experiment, we train the base generator by itself (i.e., without editors) against the critic (model 1), to note the effect of the editors on the base generator. In this case, the base generator exhibits better inception scores when trained alongside editors than without them, as shown in Fig. (a)a. This suggests that the sequential model not only improves the overall sample quality, but may also improve base generator training. To discern the role of our training regime, we run an experiment that uses endtoend backpropagation on a chain generator with 5 editors (model 5), noting a significant drop in performance.

Method (unlabelled)  Inception scores 

Real Data  11.240.12 
WGAN (Arjovsky et al., 2017)  3.820.06 
MIX + WGAN (Arora et al., 2017)  4.040.07 
DCGAN  WGAN+GP  5.180.05 
ALI (Dumoulin et al., 2016)  5.350.05 
BEGAN (Berthelot et al., 2017)  5.62 
Small DCGAN (WGAN+GP) + Editors  6.050.04 
Small ResNet (WGAN+GP) + Editors  6.700.02 
ResNet  WGAN+GP  6.860.04 
EGANEntVI (Dai et al., 2017)  7.070.10 
MGAN (Hoang et al., 2017)  8.330.10 
The variant of ChainGAN using a ResNet base generator
also uses a smaller network (model 8), while achieving comparable scores. While performing slightly worse than base on the standard inception score (IV2TF), we outperform the base model in the IV3Torch inception model used in
Barratt & Sharma (2018). We also see the same effect as above: results from the base generator are better when in chain and trained alongside editors (model 8) rather than being trained by itself (model 7). Figure (b)b shows random samples from Editor 3 of the Small ResNet + Editors model.Table 2 compares our models against different GAN variants. Entries denoted by
refer to models that we implemented (all using pyTorch). The ResNet  WGAN+GP is the model proposed by
Gulrajani et al. (2017) and we closely followed all their recommendations yet were unable to achieve their stated inception score of 7.86.We also trained our model on the celebrity face dataset (Liu et al., 2015). We do not centercrop the images, which would focus only on the face; rather, we downsample but maintain the whole image. As in the previous experiments, the editors build upon one another rather than working at cross purposes. Section 7.2 shows some samples generated by our model.
In our experiments, we were able to successfully train a sequential GAN architecture to achieve competitive results. Our model uses a smaller network and the training regime is also more efficient than the alternatives (Section 7.1). We trained with 5 editors, but in evaluation needed editors to achieve the best scores; this suggests that we can use a smaller network during evaluation or for downstream tasks. Our results yielded diverse samples and we do not notice any visual signs of mode collapse (Figure (b)b). To ensure that our generator architecture and training scheme are responsible for the performance, we run experiments with a single critic and see that results do not degrade (see model 3 in Table 1); but with endtoend backpropagation through the chain generator results degraded significantly (see model 5 in Table 1).
We also experimented with inducing a game between the editors
themselves, using their loss functions. In one formulation,
’s loss was the difference of the critic scores between itself and the previous editor scaled by , . As such, each editor would compete against the previous one and try to outperform it. In another formulation, each editor’s loss was a discounted (by ) sum of all the critic scores ahead of it, . This would force earlier editors to take on more responsibility since they influence those further ahead. Although promising, these methods often led to unstable training. Future work could focus on training stability with these approaches as there is merit in exploring strategies between editors.The idea of splitting sample generation into a multistep process is very flexible and can be extended in a number of ways. Currently, editors are completely unsupervised, but this can be extended to make each editor responsible for a different feature. One approach might build upon InfoGAN, which provides the generator with several latent variables, along with noise, and tries to maximize the mutual information between the generated samples and the latent variables (Chen et al., 2016). Using a ChainGAN approach, we could use the base generator to simply generate a sample, and each editor would be responsible for a different latent variable, with a goal of maximizing mutual information between that variable and its output.
ChainGAN may also prove effective in text generation – a domain with which GANs struggle. In this context, the base generator could be responsible for generating a bag of words, and the editors would be responsible for reordering them, e.g., for coherence. Editors could similarly be used change the tone or sentiment of text appropriately.
In this paper, we present a new sequential approach to GANs. Instead of attempting to generate a sample from noise in one shot, our model first generates a crude sample via the base generator and then successively enhances it through a chain of networks called editors. We use multiple critics, each corresponding to a different network in the chain generator. Our model is efficient and we successfully trained on a number of datasets to achieve competitive results, outperforming some existing models. Furthermore, our scheme is very flexible and there are several avenues for extension.
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.CRNNGAN: Continuous recurrent neural networks with adversarial training.
NIPS, 2016.Unpaired imagetoimage translation using cycleconsistent adversarial networks.
CVPR, 2017.The Wasserstein distance can be intuitively interpreted as the cost of the cheapest transport strategy that transforms one distribution to another. Between two distributions and , this can be expressed as:
(7) 
where is the set of all distributions whose marginals are and . Arjovsky et al. (2017) show that using the KantorovichRubenstein duality and a slight reframing, this objective can be expressed as:
(8) 
Here, is the set of all Lipchitz functions for some . We can use a neural network to approximate the function . Now that we have an approximation for the Wasserstein distance between and , the generator network seeks to minimize this, giving us the WGAN formulation of:
(9) 
In our ChainGAN model, the output of each editor can be seen as a different transformation of the noise vector . For example, Editor i’s output is obtained by the following transformation on : . Thus, to approximate the Wasserstein distance between and , we use a network . For a base generator and n editors, we have n+1 critics.
The results before each ResBlock (of any type) were added to the output of that ResBlock after undergoing a resample (of the same type as the ResBlock in question) and a dimension invariant convolution with filter size . The architecture choices for the different ResBlocks all come from the WGANGP paper (Gulrajani et al., 2017)
. ResBlock type I is composed of a convolution with padding 1 followed by a ReLU activation function followed by another similar convolution followed mean pooling. ResBlock type II is composed of a ReLU activation function followed by a convolution with padding 1, then another ReLU followed by a similar convolution again and a mean pooling layer at the end. ResBlock type III is the same as ResBlock type II except it includes a batch normalization layer before each ReLU and has an upsampling layer instead of the mean pooling layer. ResBlock type IV is similar to types II and III except it has no resampling layer. The last convolution layer is just a convolution with padding 1.
Critic ()  

Kernel Size  Resample  Output shape  
Input Image      
Residual Block Type I  []  Down  
Residual Block Type II  []  Down  
Residual Block Type II  []  Down  
Residual Block Type II  []  Down  
Linear ( such layers)      1 
Base Generator ( or )  
Kernel Size  Resample  Output shape  
    128  
Linear      
Residual Block Type III  []  Up  
Residual Block Type III  []  Up  
Residual Block Type III  []  Up  
Convolution  []    
Editor ()  

Kernel Size  Resample  Output shape  
Previous Image      
Residual Block Type IV  []    
Residual Block Type IV  []    
Residual Block Type IV  []    
Convolution  []   
Each generator convolution transpose refers to a layer with a convolution transpose of stride 2 and no padding, followed by batch normalization, and then a ReLU activation function. Each critic convolution refers to a convolution operation with stride 2 and padding of 1 followed by a LeakyReLU activation function. Notice here that we do not use batch normalization for the critic. The
editor used with the DCGAN variant shares its architecture with its counterpart in the ResNet section and is only reproduced here for clarity.Critic ()  

Kernel Size  Output shape  
Input Image    
Convolution    
Convolution  []  
Convolution  []  
Convolution  []  
Linear ( such layers)    1 
Base Generator ( or )  

Kernel Size  Output shape  
  128  
Linear    
Convolution Transpose  []  
Convolution Transpose  []  
Convolution Transpose  [] 
Editor ()  

Kernel Size  Resample  Output shape  
Previous Image      
Residual Block Type IV  []    
Residual Block Type IV  []    
Residual Block Type IV  []    
Convolution  []   