AGAN: Towards Automated Design of Generative Adversarial Networks

06/25/2019 ∙ by Hanchao Wang, et al. ∙ Baidu, Inc. 14

Recent progress in Generative Adversarial Networks (GANs) has shown promising signs of improving GAN training via architectural change. Despite some early success, at present the design of GAN architectures requires human expertise, laborious trial-and-error testings, and often draws inspiration from its image classification counterpart. In the current paper, we present the first neural architecture search algorithm, automated neural architecture search for deep generative models, or AGAN for abbreviation, that is specifically suited for GAN training. For unsupervised image generation tasks on CIFAR-10, our algorithm finds architecture that outperforms state-of-the-art models under same regularization techniques. For supervised tasks, the automatically searched architectures also achieve highly competitive performance, outperforming best human-invented architectures at resolution 32×32. Moreover, we empirically demonstrate that the modules learned by AGAN are transferable to other image generation tasks such as STL-10.



There are no comments yet.


page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) have attracted much research interest since its introduction [1] with a wide range of applications, such as image generation, text-to-image synthesis, style transfer [2, 3, 4]

among many others. GAN learns a target distribution by involving two deep neural networks, namely the generator G and the discriminator D, in a minimax game. The generator G aims to generate samples that resemble real samples from the target distribution, while the discriminator D aims to distinguish the generated samples from the real samples. The model is then trained with simultaneous SGD until a Nash equilibrium is achieved.

Due to the process of minimax optimization and simultaneous SGD, GAN is known to suffer from training instabilities. To mitigate the issue, a string of works focus on the choice of GAN objective function. Notably, in Wasserstein GAN [5], the authors propose to minimize the Wasserstein distance between the model and target distributions, instead of the original Jensen-Shannon divergence. In LS-GAN [6], the authors consider a least square loss which corresponds to minimizing the Pearson divergence between the distributions. In -GAN [7], the authors show that any -divergence can be used for GAN objective. Another line of works focus on regularization and normalization techniques, especially the Lipschitz continuity of the discriminator and the conditioning of the generator [8]. Prominent examples include gradient penalty [9] which penalizes the model when the gradient norm moves away from , and spectral normalization [10]

which normalizes the largest singular value by layer using power iteration method.

Different from existing approaches, we investigate the direction of automating the design of neural architectures to stabilize GAN training and improve performance. There are empirical evidences [9, 10] suggesting that generator and discriminator architectures may have impacts on the stability of GAN training, and hence quality and diversity of images generated by GAN. Despite those early evidences, we observe that DCGAN-style [11] and ResNet [9] architectures are by far the most prevailing architectures in the GAN literature. Such architectures are built upon highly successful modules used primarily in discriminative tasks, and their optimality in generative model construction is questionable.

Neural architecture search (NAS) has emerged as a promising research direction in recent years. On benchmark data sets including Penn Treebank, CIFAR-10 and ImageNet, NAS algorithms are proven to be capable of designing architectures that rival or even outperform the best human-invented architectures

[12, 13, 14]. The direct application of NAS to GAN architecture design is, however, non-trivial, due to at least two factors. First, the generator of GAN consists of up-sampling modules which are almost never used in any image classifications. Typical image classifications only use down-sampling modules and hence we could not borrow experience from well-studied NAS directly. Second, architectures of GAN have been much less explored. Comparing to traditional NAS application of image classification we hereby aim at searching through a large variety of topological structures with less human prior knowledge imposed. To the best of our knowledge, we are the first group that aims to perform automated architecture design of deep generative models.

In order to design an automated neural architecture search, we used reinforcement learning. In our algorithm we used an RNN module to encode the architectures for the up-sampling, down-sampling, and normal modules in GAN. We carefully crafted the search space and proposed a new form of reward shaping functions so that the algorithm is guided faster towards promising architectures. We have performed comprehensive experimental study to evaluate architecture novelty, their performance, and the transferability of the identified GAN architectures.

In sum, our main contributions are described below.

  • We presented the first automated neural architecture search algorithm, AGAN, that is specifically designed for the optimization of neural network architectures in deep generative models.

  • We have identified novel, modularized architectures, AGAN-A, AGAN-B, and AGAN-C with distinct architectures.

  • In our comprehensive experimental study we found that AGAN-A, AGAN-B, and AGAN-C have comparable performance to the best GAN models designed by human-experts. In addition AGAN-C outperforms the state-of-the-art models under same regularization techniques for unsupervised image generation tasks on CIFAR-10.

  • We empirically evaluated and confirmed that the modules learned by AGAN are transferable to other data sets such as STL-10.

The rest of the paper is organized with the following sections. In Section 2, we present an overview of GAN and NAS. We discuss our methodology in Section 3 and present our experimental evaluation of AGAN in Section 4. In Section 5 we provide a brief discussion of the differences that we observed between our search and traditional NAS search and conclude there.

2 Related Work

Equipped with multilayer perceptrons as generator and discriminator, the original GAN

[1] can successfully learn the data distribution of MNIST, but fails at more complicated image generation tasks. In DCGAN [11], the authors propose a novel class of CNNs as generator and discriminator, together with a set of architecture guidelines for stable convolutional GAN training. Most notably, in generator, the spatial activation size is doubled every layer while the number of output channels is halved; the discriminator much resembles the reverse of generator. Gulrajani et al. [9] propose a ResNet architecture for GAN on CIFAR-10. In particular, the residual blocks in the generator perform nearest-neighbor up-sampling before the second convolution while some blocks in the discriminator perform average pooling after the second convolution. Many later GAN models are built upon DCGAN-style or ResNet architecture, such as SNGAN [10], SAGAN [15] and BigGAN [16]. In SAGAN, the authors propose a self-attention layer that models the non-local dependency between high-resolution and low-resolution feature maps. They also make a minor modification of the discriminator by altering the number of hidden layer output channels in residual blocks. In BigGAN, the authors introduce further architectural change including shared class embedding and skip connections in latent variable .

Concerning the architecture design of GAN, a particular line of works focus on how to make use of label information to improve the performance of GAN. In classic conditional GAN [17]

framework, label information is concatenated to the input or hidden representations to model a conditional distribution. Miyato et al.

[18] propose to use projection based way to incorporate label information into the discriminator. On the other hand, De Vries et al. [19]

introduce Conditional Batch Normalization to visual question answering tasks, which learns a scale and a shift for each class label. Conditional Batch Norm is widely used in image generation models

[10, 15, 16] to provide label information for the generator of GAN.

Neural architecture search algorithm requires an objective, quantitative metric to measure the performance of the underlying models. In case of GAN, well-studied methods such as kernel density estimation (KDE, or Parzen window estimation) have been questioned as a suitable indicator of visual fidelity of generated images

[20]. Inception Score (IS) [21]

and Frechet Inception Distance (FID) are arguably the most popular evaluation metrics in the literature. IS uses a pre-trained Google Inception model


to classify generated samples. It is defined as

where is the generated distribution, is the conditional label distribution through the Inception model, and is marginal of over . Similarly FID uses Google Inception model as a feature extractor and computes the distance between the real distribution and as

where , , , are the mean and covariance of the real and generated distributions of the extracted features.

NAS became a mainstream research topic since Zoph and Le [12] found state-of-the-art recurrent cell on Penn Treebank and highly competitive architecture on CIFAR-10 using Reinforcement Learning (RL). Various RL methods have been successfully applied to NAS including vanilla policy gradient [23, 24], Proximal Policy Optimization (PPO) [13, 25] and Q-learning [26, 27]. An alternative approach is to use evolution algorithm [28, 14, 29], maintaining and evolving a large population of neural architectures. In contrast to aforementioned gradient-free optimization methods, Liu et al. [30] propose a gradient-bases search strategy based on continuous relaxation of architecture representation. Other gradient-based approaches include Neural Architecture Optimization (NAO) [31] and ProxylessNAS [32].

Inspired by Google Inception model, Zoph et al. [13] and Zhong [27] propose a search space based on two types of convolutional cells, named normal and reduction cell. This design leads to a simplified yet quality search space and enables the transferability of resulting architecture found by NAS. It is widely adopted by many later works [33, 25, 14, 30, 31]. Our work also falls in the category of searching cell topology but differs in the following ways:

  • In all previous RL-based NAS algorithms, the convolutional cell solely consists of unitary and binary operations except for the final concatenation. In another word, candidate cell topology can only be DAG with indegree no greater than . Our architecture representation allows searching through cells with arbitrary topology.

  • Previous works search for discriminative models with normal or down-sampling modules, we search for generative models where up-sampling modules play a significant role.

3 Method

Our work makes use of the Neural Architecture Search with Reinforcement Learning framework proposed by [12]

. A controller recurrent neural network (RNN) samples architectures of the generator and discriminator of GAN simultaneously. The sampled architectures are then sent to computation nodes for training and evaluation using Inception Score. The resulting performance is used as feedback to update controller RNN parameters using REINFORCE rule

[34]. Below we provide detailed description of the three critical components in our design: (i) controller architectures, (ii) the set of operations that we use to construct a GAN (a.k.a. the search space), and (iii) how to train a reinforcement learning.

3.1 Controller architecture

The controller is a two-layer LSTM consisting of three segments (Figure 1

), programming the up-sampling module in the generator, the down-sampling and normal modules in the discriminator, respectively. In each segment, the controller iteratively outputs a candidate operation in the module and an adjacency vector indicating tensors that will be fed into the incoming operation; the output, either an operation or an adjacency vector, is then fed into next step as input.

All operations are sampled through a softmax classifier with sample temperature

and logit clipping constant


where is the output operation and is the last hidden layer at current time step. is fed into the controller RNN through an embedding layer; the embedding parameters are only shared within the same segment.

The adjacency vector is sampled from element-wise independent multivariate Bernoulli distribution

where is the adjacency vector and is the sigmoid activation. is fed into the controller RNN through a linear projection layer whose parameters are similarly shared within the same segment.

Figure 1: Controller RNN architecture. Above: The controller consists of three segments, programming the up-sampling module in the generator, the down-sampling and normal modules in the discriminator, respectively; Below: In each segment, the controller samples an operation and an adjacency vector in turn in an autoregressive manner.

3.2 The Search Space

The outputs of each segment in the controller RNN will be used to program a module in the child model. At each time step, we select tensors according to the sampled adjacency vector, and feed their sum into next operation.

More precisely, each module takes the output of last two modules, and , as inputs. The output sequence of controller RNN segment always starts with and ends with an operation. The module is constructed as follows:

Apply the first operation to to form the skip connection. Let , where and .

For each , select tensors from according to . If , input of the module will be selected.

Apply to the sum of selected tensors and add resulting tensor to .

Repeat Step 2 and Step 3.

Concatenate tensors in who have never served as an input to form the final output.

Figure 2: An normal module defined by controller sequence conv 1x1, , maxpool 3x3, , sep 3x3, , avgpool 7x7. 1) Apply conv 1x1 to prev. 2) Select input according to vector and apply maxpool 3x3. 3) No tensors selected. Apply sep 3x3 to input. 4) Sum over tensor and as instructed by . Apply avgpool 7x7. 5) Concatenate tensor and to form the final output.

In addition, we adopt the following heuristics to ensure the computation graph is well-defined:

  • The first operation is interpreted as an up-sampling (down-sampling) operation if the previous module is an up-sampling (down-sampling) module.

  • For up-sampling modules, the operations applied to , will be interpreted as up-sampling operations.

  • For down-sampling modules, the operations applied right before concatenation will be interpreted as down-sampling operations.

  • convolutions are applied to the final output to keep the number of channels constant.

The meta-architectures of the generator and the discriminator are manually determined as follows. Starting with a linear layer, the generator consists of up-sampling modules, followed by a convolution and a activation. The discriminator starts with a convolution, followed by down-sampling modules, normal modules, a global sum pooling layer and a linear layer. For conditional version of the model, the discriminator logits is augmented with a projection layer as in [18].

Figure 3: Meta-architecture of the generator and discriminator

We use hinge loss [36] as the objective function, where

To cover a large variety of candidate architectures, we collect the following set of operations as our normal operations:

  • [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  • identity

  • convolution

  • convolution

  • dilated convolution

  • depthwise-separable convolution

  • depthwise-separable convolution

  • depthwise-separable convolution

  • then convolution

  • then convolution

  • then convolution

  • max pooling

  • max pooling

  • average pooling

  • average pooling

  • average pooling

For up-sampling modules, based on state-of-the-art GAN architectures we consider two different types of up-sampling operations: 1) , or

transposed convolution 2) nearest-neighbor interpolation followed by any convolution in the list above. For down-sampling modules, motivated by optimized residual blocks in


, we include 1) convolution followed by stride

average pooling 2) stride average pooling followed by convolution as two types of atomic operations.

We use BN - ReLU - Conv for all convolutional operations in G, and ReLU - Conv for all convolutional operations in D. There is no Batch Normalization nor ReLU in between

then convolutions.

3.3 Training with Reinforcement Learning

We use REINFORCE rule [34] to udpate controller RNN parameters . Let be the output sequence of controller, including both operational and connectivity choices. We have the following update rule for :

where is the reward for taking actions and

is the baseline for variance reduction. In particular, when

is an operation or adjacency vector, can be computed through softmax or sigmoid cross-entropy.

We measure the performance of GAN using Inception Score. More precisely, we propose the following reward shaping

where and are constants, making the rewards more sensitive when IS approaching optimal value. Due to the instability of GAN training, the Inception Score needs to be averaged over multiple run of GAN to ensure reliable measurement. In practice, however, we found that the proposed NAS algorithm works with a single run of training per sampled architecture.

4 Experiments

4.1 Data Sets

We used two data sets in our experimental study. The CIFAR-10 data set consists of color images in different classes. The data set is divided into a training data set of images and a testing data set with the rest images. Only training set is used for our experiment. The STL-10 data set is an image data set of color images. It is composed of images in different classes with training images and testing images per class, and an additional

unlabeled images for unsupervised learning.

For data preprocessing, we follow the setup in [10] by scaling the images to then adding random noise for both data sets.

4.2 Experimental Procedure

The controller used in our experimental study is a two-layer LSTM with units, consisting of three segments. Each segment outputs a sequence of actions ( operations and adjacency vectors), encoding a DAG of nodes. We use sample temperature and logit clipping when sampling operations, and when sampling adjacency vectors. The controller is trained using policy gradient with learning rate . We compensate the loss with an entropy temperature to ensure better exploration. The controller is updated whenever rewards are collected from child models. We use Titan X GPUs training for days, with an overall sample complexity of .

When constructing the GAN model, we fix the number of channels in both the generator and discriminator to be . We find that using global sum pooling instead of global average pooling in the penultimate layer of the discriminator stabilizes the training. We use Adam optimizer [37] for optimization with , and . The discriminator is updated steps per one generator update step. The number of samples generated is per G update and per D update. We use batch size for real samples. To evaluate the architecture, the model is trained for steps. Inception Score is then calculated based on generated samples divided into groups. For reward shaping, we choose (Inception Score of the real data) and .

(a) Averge Inception Score
(b) Best Inception Score
Figure 4: Progression of Inception Score on CIFAR-10
Model with the highest Inception Score (when trained for steps) is generated as early as step . The controller, however, continues to learn the distribution that samples better performing models on average. In fact, the best models when trained to full size are generated at the later stage of the architecture search.

4.3 Learning GAN architecture on CIFAR-10

For the task of supervised image generation on CIFAR-10, we take top candidate models discovered in the architecture search and train for steps. We scale up the models by doubling the number of channels in both the generator and the discriminator. The label information is fed into G via Conditional Batch Normalization (CBN) [19] and into D via projection [18]. We use Spectral Normalization [10] for the discriminator but not the generator. The best architectures are reported in Table 1.

Method Inception Score FID
Real data
DCGAN style
DCGAN [38]
Salimans et al. [21]
SN-GAN [18] [18]
Table 1: Supervised image generation on CIFAR-10
(a) AGAN-A
(b) AGAN-B
(c) AGAN-C
Figure 5: Images generated by AGANs in supervised image generations tasks

AGAN-A and AGAN-B outperform all DCGAN-style architectures. The best architecture we found, AGAN-B, also outperforms all ResNet architectures with input resolution or less than M parameters. In particular, BigGAN [16] architecture resides on input images and has parameters. The architectures we proposed have much less parameters (M, M and M, respectively) in comparison.

We also train models with the same topology for unsupervised image generation tasks. We drop the projection layer in D and use Batch Normalization in place of CBN in G. We find that scaling up does not guarantee performance gain in this setting. All of the architectures proposed outperform DCGAN-style architectures and AGAN-C outperforms all ResNet architectures in terms of Inception Score.

Method Inception Score FID
Real data
DCGAN style
DCGAN [39] [40]
MMD GAN [41]
WGAN-GP [10] 24.8 [40]
Salimans et al.
Coulomb GANs 27.3
Table 2: Unsupervised image generation on CIFAR-10

In Figure 6 we decipher the architecture of the learned model AGAN-A. Note that topology of all three modules: up-sampling, down-sampling, and normal ones, are quite different from modules used in existing models. Such architecture is a hybrid between Inception and Resnet in that each cell, as deciphered here, contains multiple branches. Cells are stacked together in a way resembling Resnet as shown previously in Figure 3. We believe this is the first time that we see inception-resnet hybrid architectures that are used for GAN. Also the cells that we see here are quite different from inception cells that we typically use in discriminative models, which provide evidence supporting our original idea that optimal GAN architecture could be quite different from those in discriminative models. Architectures of AGAN-B and AGAN-C bear some resemblance to AGAN-A and we omit their diagrams for brevity.

(a) Up-sampling module
(b) Down-sampling module
(c) Normal module
Figure 6: Topology of modules in AGAN-A
Note that the up-sample (down-sample) operations following prev will only be applied when the module is preceded by an up-sampling (down-sampling) module.

4.4 Transferability of AGAN

One potential advantage of modularzied search space is that it enables the transferability of the learned architecture: modules generated on smaller data sets could be used as building blocks to construct networks on larger data sets, where direct neural architecture search may be infeasible or unfavorable. In this experiment, we empirically evaluated the transferability of some of our learned modules, namely AGAN-A and AGAN-C.

Our STL-10 network has the same meta-architecture as the one for CIFAR-10, with the distinction that the first up-sampling module in G takes input size of (instead of ). We resize the STL-10 data set to images. As in Table 3, despite that their topology are not optimized for STL-10, AGAN-A and AGAN-C achieve highly competitive performances, outperforming all DCGAN-style architectures. The experiment provide evidences suggesting that the architectures that we identified might be applicable to a wide range of data sets.

Method Inception Score FID
Real data
DCGAN style
DCGAN [42]
WGAN-GP [43] [10]
Splitting GAN
Table 3: Unsupervised image generation on STL-10
(a) Up-sampling
(b) Down-sampling
(c) Normal
Figure 7: Empirical distribution of sampled operations by module over time
Learned operations: (a) up-sampling module: conv , conv then (b) down-sampling module: conv , dilated conv (c) normal module: average pooling , ,

5 Discussion & Conclusion

As illustrated in Figure 7, in our search of GAN architectures, the controller RNN learns drastically different distributions over operations for three module types. The up-sampling modules predominately favor conv and conv then ; the down-sampling modules favor conv and dilated conv ; the normal modules favor average pooling , and . This justifies our choice of segmentation of controller RNN. We point out that it is at least in contrast to RL-based NAS algorithms over image classifiers [25, 27, 13], where both the normal cell and reduction cell choose among depthwise-separable convolutions, max and average poolings.

In addition, we observe that

  • the up-sampling modules prefer upsample-then-convolution operations, over transposed convolutions;

  • the down-sampling modules prefer convolution-then-downsample operations over downsample-then-convolutions;

  • the normal modules mostly consist of average poolings, and hence have very few parameters;

  • depthwise-separable convolutions are not present at most networks in later stage.

In our experiments we observe that the order of operations in the same module matters much. For example, in down-sampling modules, whether we perform down-sampling at the beginning of the module or at the end of the model may have significant impact on the overall performance though the exact mechanism of the effect is not clear.

In conclusion, we present AGAN, the first neural architecture search algorithm on deep generative models. We demonstrate that, by careful design of controller architecture and search space, RL-based NAS algorithm can discover highly competitive architectures that rival the best human-invented GAN architecture. Further reducing model size and enabling fast inference are on our future research agenda.