AlphaGAN: Fully Differentiable Architecture Search for Generative Adversarial Networks

06/16/2020 ∙ by Yuesong Tian, et al. ∙ Zhejiang University Tencent Columbia University 13

Generative Adversarial Networks (GANs) are formulated as minimax game problems, whereby generators attempt to approach real data distributions by virtue of adversarial learning against discriminators. The intrinsic problem complexity poses the challenge to enhance the performance of generative networks. In this work, we aim to boost model learning from the perspective of network architectures, by incorporating recent progress on automated architecture search into GANs. To this end, we propose a fully differentiable search framework for generative adversarial networks, dubbed alphaGAN. The searching process is formalized as solving a bi-level minimax optimization problem, in which the outer-level objective aims for seeking a suitable network architecture towards pure Nash Equilibrium conditioned on the generator and the discriminator network parameters optimized with a traditional GAN loss in the inner level. The entire optimization performs a first-order method by alternately minimizing the two-level objective in a fully differentiable manner, enabling architecture search to be completed in an enormous search space. Extensive experiments on CIFAR-10 and STL-10 datasets show that our algorithm can obtain high-performing architectures only with 3-GPU hours on a single GPU in the search space comprised of approximate 2 ? 1011 possible configurations. We also provide a comprehensive analysis on the behavior of the searching process and the properties of searched architectures, which would benefit further research on architectures for generative models. Pretrained models and codes are available at



There are no comments yet.


page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) Goodfellow et al. (2014) have shown promising performance on a variety of generative tasks (e.g., image generation Brock et al. (2018), image translation Zhu et al. (2017); Choi et al. (2018), dialogue generation Li et al. (2017)

, and image inpainting

Yu et al. (2018)), which are typically formulated as adversarial learning between a pair of networks, called generator and discriminator Oliehoek et al. (2017); Salimans et al. (2016)

. However, pursuing high-performance generative networks is non-trivial. The challenges of training models may arise from every factor in the process, from loss functions to network architectures. There is a rich history of research aiming to improve the training stabilization and alleviate mode collapse by introducing generative adversarial functions (e.g., Wasserstein distance

Arjovsky et al. (2017), Least Squares loss Mao et al. (2017), and hinge loss Miyato et al. (2018)) or regularization (e.g., gradient penalty Gulrajani et al. (2017); Kodali et al. (2017)).

Alongside the direction of improving loss functions, improving architectures has been proven to be important for stabilizing training and improving generalization. Radford et al. (2015) exploits deep convolutional networks in both generator and discriminator, and a series of approaches Miyato et al. (2018); Gulrajani et al. (2017); Ye et al. (2019); Brock et al. (2018) show that residual blocks He et al. (2016)

are capable of facilitating the training of GANs. However, such a manual architecture design typically requires many efforts and domain-specific knowledge from human experts, which is even challenging for GANs due to the minimax formulation that it intrinsically possesses. Recent progress of architecture search on a variety of supervised learning tasks

Zoph and Le (2016); Liu et al. (2018); Zoph et al. (2018); Brock et al. (2017) has shown that remarkable achievements can be achieved by automating the architecture search process.

In this paper, we aim to address the problem of GAN architecture search from the perspective of Game theory since it is essentially a minimax problem

Goodfellow et al. (2014) targeting at finding pure Nash Equilibrium of generator and discriminator Salimans et al. (2016); Heusel et al. (2017). From this perspective, we propose a fully differentiable architecture search framework for GANs, dubbed alphaGAN

, in which a differential evaluation metric is introduced for guiding architecture search towards pure Nash Equilibrium

Nash and others (1950). Motivated by DARTS Liu et al. (2018), we formulate the search process of alphaGAN as a bi-level minimax optimization problem, and solve it efficiently via stochastic gradient-type methods. Specifically, the outer level objective aims to optimize the generator architecture parameters towards pure Nash Equilibrium, whereas the inner level constraint targets at optimizing the weight parameters conditioned on the architecture currently searched. The formulation of alphaGAN is a generic form which is task-agnostic and suitable for any generation tasks with a minimax formulation.

This work is related to several recent methods. GAN architecture search performed with a reinforcement learning paradigm has been proposed in

Wang and Huan (2019); Gong et al. (2019), rewarded by Inception Score Salimans et al. (2016), a task-dependent, non-differential metric. Extensive experiments including comparison to these methods demonstrate the effectiveness of the proposed algorithm in performance and efficiency. Specially, alphaGAN can discover high-performance architectures while being much faster than the other automated architecture search methods. We also present comprehensive studies for better understanding the searching process and searched architectures, which we hope to facilitate the research of architecture design and search on generative tasks.

2 Preliminaries

Minimax Games have regained a lot of attraction Osborne and Rubinstein (1994); Du and Pardalos (2013)

since they are popularized in machine learning, such as generative adversarial networks (GAN)

Salimans et al. (2016), reinforcement learning Ho and Ermon (2016); Pinto et al. (2017), etc. Given the function , we consider a minimax game and its dual form:

The pure equilibrium Nash and others (1950) of minimax game can be used to characterize the best decisions of two players and for above minmax game.

Definition 1

is called a pure equilibrium of game if it holds that


where and . When minimax game equals to its dual problem, is the pure equilibrium of the game. Hence, the gap between the minimax problem and its dual form can be used to measure the degree of approaching pure equilibrium Grnarova et al. (2019).

Generative Adversarial Network (GAN) proposed in Goodfellow et al. (2014) is mathematically defined as a minimax game problem with a binary cross entropy loss of competing between the distributions of real and synthetic images generated by the GAN model. Despite remarkable progress achieved by GANs, training high-performance models is still challenging for many generative tasks due to its fragility to almost every factor in the training process. Architectures of GANs have proven useful for stabilizing training and improving generalization Miyato et al. (2018); Gulrajani et al. (2017); Ye et al. (2019); Brock et al. (2018), and we hope to discover architectures by automating the design process with limited computational resource in a principled differentiable manner.

3 GAN Architecture Search as Fully Differential Optimization

3.1 Formulation

Differentiable Architecture Search was first proposed in Liu et al. (2018), where the problem is formulated as a bi-level optimization:


where and denote the optimized variables of architectures and network parameters, respectively. In other words, it aims to seek the optimal architecture that performs best on the validation set with the network parameters trained on the training set. The search process is supervised by minimizing the cross-entropy loss which is a differential function and a good surrogate for the objective metric accuracy. By virtue of continuous relaxation on the searching space, the entire framework is differentiable and can be easily incorporated into other supervised learning tasks.

Deploying such a framework to searching architectures of GANs is non-trivial. The training of GANs corresponds to the optimization of a minimax problem (as shown in above), which learns an optimal generator trying to fool the additional discriminator. However, generator evaluation is independent from discriminators but based on some extra metrics (e.g., Inception Score Salimans et al. (2016) and FID Heusel et al. (2017)), which are typically discrete and task-dependent.

Evaluation function.

Using a suitable and differential evaluation metric function to inspect on-the-fly quality of both generator and discriminator is necessary for a GAN architecture search framework. Due to the intrinsic minimax property of GANs, the training process of GANs can be viewed as a zero-sum game as in Salimans et al. (2016); Goodfellow et al. (2014). The zero-sum game includes two players competing in an adversarial manner. The universal objective of training GANs consequentially can be regarded as reaching pure equilibrium in Definition 1. Hence, we adopt the primal-dual gap function Grnarova et al. (2019); Peng et al. (2020) for evaluating the generalization of vanilla GANs. Given a pair of and , the duality-gap function is defined as


The evaluation metric is non-negative and can be only achieved when the pure equilibrium in Definition (1) holds. Function (3) provides a quantified measure of describing “how close is current GAN to pure equilibrium", which can be used for assessing model capacity.

The architecture search for GANs can be formulated as a specific bi-level optimization problem:


where performs on the validation dataset and supervises seeking the optimal generator architecture as an outer-level problem, and the inner-level optimization on aims to learn suitable network parameters (including both the generator and discriminator) for GAN on the current architecture.

In this work, we exploit the hinge loss from Miyato et al. (2018); Zhang et al. (2018) as the generative adversarial function , i.e.,


which has been commonly used in image generation tasks due to its stable property during training.

AlphaGAN formulation.

By integrating the generative adversarial function (5) and evaluation function (3) into the bi-level optimization (4), we can obtain the final objective for the framework as follows,


where generator and discriminator are parameterized with variables and , respectively, , and . The search process contains three parts of parameters, weight parameters , test-weight parameters , and architecture parameters as we are mainly concerned with generator architectures. The architecture of the discriminator can be optimized in this framework, while we find that its function for seeking better generator architectures is marginal and even hampers the process in practice (more details can be found in Section 4.3). Weight parameters are updated on the training dataset based on the current architectural parameters to approach the optimum of the inner-level function. Architecture parameters are optimized by reducing the duality gap on the validation dataset as the outer-level optimization problem.

Discussion. The loss values of optimizing generators or discriminators cannot explicitly describe “how well GAN has been trained" due to its specific minimax structure. Adversarial loss functions (e.g., Eq. (5)) therefore may not be a suitable surrogate to evaluate and supervise the learning of generator architectures. We instead adopt the duality-gap, which is a generic evaluation metric for minimax problems, as the outer-level objective.

3.2 Algorithm and Optimization

In this section, we will give a detailed description for the training algorithm and optimization process of alphaGAN. We first describe the entire network structure of the generator and the discriminator, the search space of the generator, and the continuous relaxation of architectural parameters.

Base Backbone of and . The illumination of the entire structure for the generator and discriminator is shown in appendix B. The generator is constructed by stacking several cells whose topology is identical to those in AutoGAN Gong et al. (2019) and SN-GAN Miyato et al. (2018) (shown in appendix B). Each cell, regarded as a directed acyclic graph, is comprised of the nodes representing intermediate feature maps and the edges connecting pairs of nodes via different operations. We apply a fixed network architecture for the discriminator, based on the conventional design as Miyato et al. (2018).

Parameters: Initialize weight parameters (,). Initialize generator architecture parameters . Initialize base learning rate , momentum parameter , and exponential moving average parameter for Adam optimizer.

1:  for  do
2:     Set and set ;
3:     for  do
4:        Sample real data from training set and noise Estimate gradient of Adv loss in Eq. (7) with at , dubbed
7:     end for
8:     Set ;
9:     Receive architecture searching parameter and network weight parameters (,); Estimate neural architecture parameters of via Algorithm 2;
10:     for  do
11:        Sample real data from the validation set and latent variables . Estimate gradient of the loss in Eq. (3) with at , dubbed
13:     end for
14:     Set =
15:  end for
16:  Return =
Algorithm 1 Searching the architecture of alphaGAN

Parameters: Receive architecture searching parameter and weight parameter (,). Initialize weight parameter for . Initialize base learning rate , momentum parameter , and EMA parameter for Adam optimizer.

1:  for  do
2:     Sample real data from validation dataset and noise Estimate gradient of Adv loss in Eq. (6) with at , dubbed
4:  end for
5:  for  do
6:     Sample noise Estimate gradient of the Adv loss in Eq. (6) with at point , dubbed ;
8:  end for
9:  Return
Algorithm 2 Solving and

Search space of . The search space is compounded from two types of operations, i.e., normal operations and up-sampling operations.

The pool of normal operations, denoted as , is comprised of {conv_1x1, conv_3x3, conv_5x5, sep_conv_3x3, sep_conv_5x5, sep_conv_7x7} . The pool of up-sampling operations, denoted as , is comprised of { deconv, nearest, bilinear}, where “deconv” denotes the ConvTransposed_2x2. operation. Our method allows possible configurations for the generator architecture, which is larger than of AutoGAN Gong et al. (2019).

Continuous relaxation. The discrete selection of operations is approximated by using a soft decision with a mutually exclusive function, following Liu et al. (2018). Formally, let denote some normal operations on node , and represent the architectural parameter with respect to the operation between node and its adjacent node , respectively. Then the node output induced by the input node can be calculated by


and the final output is summed over all of its preceding nodes, i.e., . The selection on up-sampling operations follows the same procedure.

Solving alphaGAN. We apply an alternating minimization method to solve alphaGAN (6)-(7) with respect to variables in Algorithm 1, which is a fully differentiable gradient-type algorithm. Algorithm 1 is composed of three parts. The first part (line 3-8), called “weight_part", aims to optimize weight parameters on the training dataset via Adam optimizer Kingma and Ba (2014). The second part (line 9), called “test-weight_part", aims to optimize the weight parameters , and the third part (line 10-12), called ’arch_part’, aims to optimize architecture parameters by minimizing the duality gap. Both ’test-weight_part’ and ’arch_part’ are optimized over the validation dataset via Adam optimizer. Algorithm 2 illuminates the detailed process of computing and by updating weight parameters with last searched generator network architecture parameters and related network weight parameters . In summary, the variables are optimized in an alternating fashion.

4 Experiments

In this section, we conduct extensive experiments on CIFAR-10 Torralba et al. (2008) and STL-10 Coates et al. (2011). First, the generator architecture is searched on CIFAR-10 and the discretized optimal structure is used as the network configuration, in which network parameters are fully re-trained from scratch following Gong et al. (2019) in Section 4.1. We compare alphaGAN with the other automated GAN methods in multiple measures to demonstrate its effectiveness. Second, the generalization of the searched architectures is verified by fully training on STL-10 and evaluation in Section 4.2. To further understand the properties of our method, a series of studies on the key components of the framework are shown in Section 4.3.

During searching, we use a minibatch size of 64 for both generators and discriminators, channel number of 256 for generators and 128 for discriminators. When fully training the network, we use a minibatch size of 128 for generators and 64 for discriminators. The channel number is set to 256 for generators and 128 for discriminators. As the architectures of discriminators are not optimized variables, an identical architecture is used for searching and re-training (the configuration is the same as in Gong et al. (2019)). These configurations are utilized by default except we state otherwise. When testing, 50000 images are generated with random noise, and IS Salimans et al. (2016) and FID Heusel et al. (2017) are used to evaluate the performance of generators. GPU we use is Tesla P40. More details of the experimental setup and empirical studies about the rest of the proposed method can be found in appendix.

search time
( is better)
( is better)
DCGAN(Radford et al. (2015)) - - - - manual -
SN-GAN(Miyato et al. (2018)) - - - - manual
Progressive GAN(Ye et al. (2019)) - - - - manual 8.800.05 -
WGAN-GP, ResNet(Gulrajani et al. (2017)) - - - - manual -
AutoGAN(Gong et al. (2019)) - RL
AutoGAN 82 RL
AGAN(Wang and Huan (2019)) - - RL
Random search(Li and Talwalkar (2019)) Random
alphaGAN gradient 11.38
alphaGAN gradient
Table 1: Comparison with state-of-the-art GANs on CIFAR-10. denotes the results reproduced by us, with the structure released by Auto-GAN and trained under the same setting as AutoGAN.

4.1 Searching on CIFAR-10

We first compare our method with recent automated GAN methods. During the searching process, the entire dataset is randomly split to two sets for training and validation respectively, each of which contains 25,000 images. For a fair comparison, we report the performance of best run (over 3 runs) for reproduced baselines and ours in the Table 1 and provide the performance of several representative works with manually designed architectures for reference. As there is inevitable perturbation on searching optimal architectures due to stochastic initialization and optimization Arber Zela et al. (2019), we provide a detailed analysis and discussion about the searching process. And the statistic properties of architectures searched by alphaGAN are in appendix C.

Performances of alphaGAN with two search configurations are shown In Tab. 1 by adjusting step sizes and for updating the weight_part and test-weight_part in Algorithm 1, where alphaGAN

represents passing through every epoch on the training and validation sets for each loop, i.e.,

and . And alphaGAN represents using smaller interval steps with , .

The results show that our method performs well in the two settings and outperforms the other automated GAN methods in terms of both efficiency and performance. alphaGAN obtains the lowest FID compared to all the baselines, outperforming the RL-based AutoGAN (reported in Gong et al. (2019)) by and the random search baseline Li and Talwalkar (2019) by . Compared to automated baselines, alphaGAN has shown a substantial advantage on searching efficiency. Particularly, alphaGAN can attain the best tradeoff between efficiency and performance, and it can be achieve comparable results by searching in a large search space (significantly larger than RL-based baselines) in a considerably efficient manner (i.e., only 3 GPU hours compared to the baselines with tens to thousands of GPU hours). The architecture obtained by alphaGAN is light-weight and computationally efficient, which reaches a good trade-off between performance and time complexity. We also conduct the experiments of searching on STL-10 (shown in appendix E.5) and observe consistent phenomena, demonstrating that the effectiveness of our method is not confined to the CIFAR-10 dataset.

4.2 Transferability on STL-10

Params (M) FLOPs (G) IS FID
SN-GAN(Miyato et al. (2018)) - -
ProbGAN(He et al. ) - -
Improving MMD GAN(Wang et al. (2018)) - -
Auto-GAN(Gong et al. (2019))
AGAN(Wang and Huan (2019)) - -
alphaGAN 9.920.13 22.63
Table 2: Results on STL-10. The structures of alphaGAN and alphaGAN are searched on CIFAR-10 and fully trained on STL-10. denotes the reproduced results, with the architectural configurations released by the original papers.

To validate the transferability of the architectures obtained by alphaGAN, we directly train models by using the obtained architectures on the STL-10 dataset. The results are shown in Table 2. Both alphaGAN and alphaGAN show remarkable superiority in performance over the baselines with either automated or manually designed architectures. It reveals the benefit that the architecture searched by alphaGAN can be effectively exploited across datasets. It is surprising that alphaGAN is best-behaved, which achieves the best performance in both IS and FID scores. It also shows that compared to increase on model complexity, appropriate selection and composition of operations can contribute to model performance in a more efficient manner which is consistent with the primary motivation of automating architecture search.

4.3 Ablation Study

We conduct ablation experiments on CIFAR-10 to better understand the influence of components when applying different configurations on both alphaGAN and alphaGAN, including the studies with the questions: the effect of searching the discriminator architecture and obtaining the optimal generator . More experiments about the channels in search, and fixing are shown in appendix E.

Type Search D? Obtain IS FID
Update Update
Table 3: Ablation studies on CIFAR-10.

Search D’s architecture or not? A problem may arise from alphaGAN: If searching discriminator structures can facilitate the searching and training of generators? The results in Table 3 show that searching the discriminator cannot help the search of the optimal generator. We also conducted the trial by training GANs with the obtained architectures by searching G and D, while the final performance is inferior to the setting of retraining with a given discriminator configuration. Simultaneously searching architectures of both G and D potentially increases the effect of inferior discriminators which may hamper the search of optimal generators conditioned on strong discriminators. In this regard, solely learning generators’ architectures may be a better choice.

How to obtain ? In the definition of duality gap, and denote the global optima of G and D, respectively. As both of the architecture and network parameters are variables for , we do the experiments of investigating the effect of updating and for attaining . The results in Table 3 show that updating solely achieves the best performance. Approximating with update solely means that the architectures of G and are identical, and hence optimizing architecture parameters in (6) can be viewed as the compensation for the gap brought by the weight parameters of and .

5 Analysis

We have seen that alphaGAN can find high-performing architectures, and would like to provide a clearer picture of the proposed algorithm in this section.

5.1 Robustness on Model Scaling

It would be interesting to know how the architecture performs when scaling up/down model complexity. To this regard, we introduce a ratio to simply re-scale the channel dimension of the network configuration for the fully training step. The relation between performance and parameter size is illuminated in Fig. 1. The range of attaining promising performance is relatively narrow for alphaGAN

, mainly caused by the light-weight property induced by dominated depthwise separable convolutions. Light-weight architectures naturally result in highly sparse connections between network neurons which may be sensitive to the configuration difference between searching and re-training. In contrast, alphaGAN

shows acceptable performance in a wide range of parameter sizes (from 2M to 18M). While both of them present some degree of robustness on the scaling of the original searching configuration.

Figure 1: Relation between model capacity and performance. To align the model capacities with AutoGAN, the channels for G and D in alphaGAN are , the channels for G and D in alphaGAN are , and the channels for G and D in AutoGAN are .
(a) IS in search
(b) FID in search
Figure 2: Tracking architectures during searching. alphaGAN is denoted by blue color with plus marker and alphaGAN is denoted by red color with triangle marker.

5.2 Architectures on Searching

To understand the search process of alphaGAN, we track the intermediate structures of alphaGAN and alphaGAN during searching, and fully train them on CIFAR-10 (in Fig. 2). We observe a clear trend that the architectures are learned towards high performance during searching though slight oscillation may happen. Specially, alphaGAN realizes gradual improvement in performance during the process, while alphaGAN displays a faster convergence on the early stage of the process and can achieve comparable results, indicating solving inner-level optimization problem by virtue of rough approximations (as using more steps can always achieve a closer approximation of the optimum) can significantly benefit the efficiency of solving the bi-level problem without sacrifice in accuracy.

5.3 Relation between Architectures and Performances

We investigate the relation between architectures and performances by analyzing the operation distribution of searched architectures (figures are shown in appendix C). For simplicity , we divide the structures into two degrees, ’superior’ (achieving IS > 8.0, FID < 15.0) and ’inferior’ (achieving IS < 8.0, fID > 15.0). By the comparison between superior and inferior architectures, we have the following observations: For up-sampling operations, superior architectures are dominated by nearest and bilinear operations. For normal operations in superior architectures, the statistical preference (or not) on certain operations is less significant than in inferior architectures, indicating a moderate proportion of mixing dense and depth-wise separable convolution operations is beneficial for network performance. More detailed analyses can be found in appendix C.

6 Conclusion

We presented alphaGAN, a fully differentiable architecture search framework for GANs, which is efficient and effective to seek high-performing generator architectures from vast possible configurations, achieving comparable or superior performance compared to state-of-the-art architectures being either manually designed or automatically searched. In addition, the analysis of tracking the behavior of architecture performance and operation distribution gives some insights about architecture design, which may promote further research on architecture improvement. We mainly focused on vanilla GANs in this work and would like to extend such a framework to conditional GANs, in which extra regularization on the parts of networks is typically imposed for task specialization, as future work.


Appendix A Experiment Details

a.1 Searching on CIFAR-10

The CIFAR-10 dataset is comprised of images for training. The resolution of the images is x. We randomly split the dataset into two sets during searching: one is used as the training set for optimizing network parameters and ( images), and another is used as the validation set for optimizing architecture parameters ( images). The search iterations for alphaGAN and alphaGAN are set to

. The dimension of the noise vector is

. For a fair comparison, the discriminator adopted in searching is the same as the discriminator in AutoGAN Gong et al. (2019). Batch sizes of both the generator and the discriminator are set to 64. The learning rates of weight parameters and are and the learning rate of architecture parameter is

. We use Adam as the optimizer. The hyperparameters for optimizing weight parameters

and are set as, for and for , and for the weight decay. The hyperparameters for optimizing architecture parameters are set as for , for and for weight decay.

We use the entire training set of CIFAR-10 for retraining the network parameters after obtaining architectures. The dimension of the noise vector is . Discriminator exploited in the re-training phase is identical to that during searching. The batch size of the generator is set to . The batch size of the discriminator is set to . The generator is trained for iterations. The learning rates of the generator and discriminator are set to . The hyperparameters for the Adam optimizer are set to for , for and for weight decay.

a.2 Transferability

The STL-10 dataset is comprised of k training images. We resize the images to the size of x due to the consideration of memory and computational overhead. The dimension of the noise vector is . We train the generator for iterations. The batch sizes for optimizing the generators and the discriminator are set to and , respectively. The channel numbers of the generator and the discriminator are set to and , respectively. The learning rates for the generator and the discriminator are both set to . We also use the Adam as the optimizer, where is set to , is set to and weight decay is set to .

Appendix B The structures of the generator and the discriminator

(a) G
(b) D
Figure 3: The topology of the generator and the discriminator.

The entire structures of the generator and the discriminator are illustrated in Fig. 3.

(a) The cell of G
(b) The cell of D
Figure 4: The topology of the cell in the generator and the discriminator. The topology of the generator and the discriminator is identical to those of AutoGAN Gong et al. (2019) and SN-GAN Miyato et al. (2018).

The topology of cells in the generator and the discriminator is illustrated in the Fig. 4. In the cell of the generator, the edges from the node to the node and from the node to the node correspond to up-sampling operations, and the rest edges are normal operations. In the cell of the discriminator, the edges from the node to the node and from the node to the node

are the operation of avg_pool_2x2 with stride 2, the edges from the node

to the node and from the node to the node are the operation of conv_3x3 with stride 1, and the edge from the node to the node is the operation of conv_1x1 with stride 1.

Figure 5: The structure of alphaGAN.
Figure 6: The structure of alphaGAN.

The structures of alphaGAN and alphaGAN are shown in Fig. 6 and Fig. 5.

Appendix C Relation between performance and structure

The distributions of operations in ’superior’ and ’inferior’ are shown in Fig. 7 and Fig. 8, respectively. We get the following observations: first, for up-sampling operations, superior architectures tend to exploit “nearest" or “bilinear" rather than “deconvolution" operations. Second, “conv_1x1" operations dominate in the cell_1 of superior generators, suggesting that convolutions with large kernel sizes may not be optimal when the spatial dimensions of feature maps are relatively small (i.e., 8x8). Finally, convolutions with large kernels (e.g., conv_5x5, sep_conv_3x3, and sep_conv_5x5) are preferred on higher resolutions (i.e., cell_3 of superior generators), indicating the benefit of integrating information from relatively large receptive fields for low-level representations on high resolutions.

Figure 7: The distributions of normal operations.
Figure 8: The distributions of up-sampling operations.

Appendix D Generated Samples

Generated samples of alphaGAN on STL-10 are shown in Fig. 9.

Figure 9: Generated samples of alphaGAN on STL-10.

Appendix E Additional Results

In this section, we present the more experimental results and analysis (due to page limit), including using Gumbel-max trick, warm-up, ablation study on step sizes for ’arch_part’, effect of channel numbers for searching, searching on STL-10, and the analysis of failure cases. The ’baseline’ in Tab. 4 denotes the structure searched under the default settings of alphaGAN.

e.1 Gumbel-max Trick

Gumbel-max trick Maddison et al. (2014) can be written as,



is the probability of selecting operation

after Gumbel-max, and represents the architecture parameter of operation , respectively. represents the operation search space. denotes samples drawn from the Gumbel (0,1) distribution, and represents the temperature to control the sharpness of the distribution. Instead of continuous relaxation, the trick chooses an operation on each edge, enabling discretization during searching. We compare the results by searching with and without Gumbel-max trick. The results in Tab. 4 show that searching with Gumbel-max may not be the essential factor for obtaining high-performance generator architectures.

e.2 Warm-up protocols

The generator contains two parts of parameters, . The optimization of is highly related to network parameters . Intuitively, pretraining the network parameters can benefit the search of architectures since a better initialization may facilitate the convergence. To investigate the effect, we fix and only update at the initial half of the searching schedule, and then and are optimized alternately. This strategy is denoted as ’Warm-up’ in Table 4.

Type Name Gumbel-max? Fix alphas? IS FID
alphaGAN baseline
alphaGAN baseline
Table 4: Gumbel-max trick and Warm-up.

The results show that the strategy may not help performance, i.e., IS of ’Warm-up’ is slightly worse than that of the baseline and FID of ’Warm-up’ is worse than that of the baseline, while it can benefit the searching efficiency, i.e., it spends GPU-hours for alphaGAN (compared to 22 GPU-hours via the baseline) , and GPU-hour for alphaGAN (compared to GPU-hours via the baseline).

(a) IS
(b) FID
Figure 10: The effect of different step sizes of ’arch part’.

e.3 Effect of Step Sizes

To analyze the effect of different step sizes on the “arch part", corresponding to the optimization process of the architecture parameters in Algorithm 1 (line 10-13). Since alphaGAN has larger step sizes for ’weight part’ and ’test-weight part’ compared with alphaGAN, the step size of ’arch part’ can be adjusted in a wider range. We select the alphaGAN to conduct the experiments and the results are shown in Fig. 10. We can observe that the method perform fair robustness among different step sizes on the IS metric, while network performance based on the FID metric may be hampered with a less proper step.

Search channels Re-train channels Params (M) FLOPs (G) IS FID
G_ D_ G_ D_
G_ D_
G_ D_ G_ D_
G_ D_
G_ D_ G_ D_
G_ D_
G_ D_ G_ D_
G_ D_
G_ D_
G_ D_
Table 5: The channels in searching on the alphaGAN.

e.4 Effect of Channels in Searching

As the default settings of alphaGAN, we search and re-train the networks with the same channel dimensions (i.e., G_channels=256 and D_channels=128), which are predefined. To explore the impact of the channel dimensions during searching on the final performance of the searched architectures, we adjust the channel numbers of the generator and the discriminator during searching based on the searching configuration of alphaGAN. The results are shown in Tab. 5. We observe that our method can achieve acceptable performance under a wide range of channel numbers (i.e., ). We also find that using consistent channel dimensions during searching and re-training phases is beneficial to the final performance.

When reducing channels during searching, we observe an increasing trend on the operations of depth-wise convolutions with large kernels (e.g. 7x7), indicating that the operation selection induced by such automated mechanism is adaptive to need of preserving the entire information flow (i.e., increasing information extraction on the spatial dimensions to compensate for the channel limits).

Search time
Dataset of
Params (M) FLOPs (G) IS FID
Table 6: Search on STL-10. We search alphaGAN on STL-10 and re-train the searched structure on STL-10 and CIFAR-10. In our repeated experiments, failure cases are prevented.

e.5 Searching on STL-10

We also search alphaGAN on STL-10. The channel dimensions in the generator and the discriminator are set to 64 (due to the consideration of GPU memory limit). We use the size of 48x48 as the resolution of images. The rest experimental settings are same as the one of searching on CIFAR-10. The settings remain the same as Section A.2 when retraining the networks.

The results of three runs are shown in Tab. 6. Our method achieves high performance on both STL-10 and CIFAR-10, demonstrating the effectiveness and transferability of alphaGAN are not confined to a certain dataset. alphaGAN remains efficient which can obtain the structure reaching the state-of-the-art on STL-10 with only GPU-hours. We also find no failure case exists in the three repeated experiments of alphaGAN compared to that on CIFAR-10, which may be related to multiple latent factors that datasets intrinsically possess (e.g., resolution, categories) and we leave as a future work.

(a) Distribution of normal operations
(b) Distribution of up operations
Figure 11: Distributions of operations in normal cases and failure cases of alphaGAN.
Name Description Params (M) FLOPs (G) IS FID
alphaGAN normal case
failure case
alphaGAN normal case
failure case
Table 7: Repeated search on CIFAR-10.

e.6 Failure cases

As we pointed out in the main paper, the searching of alphaGAN will encounter failure cases, analogous to other NAS methods Zela et al. (2019). For better understanding the method, we present the comparison between normal cases and failure cases in Tab. 7 and the distributions of operations in Fig. 11. We find that deconvolution operations dominate in these failure cases. To validate this, we conduct the experiments on the variant by removing deconvolution operations from the search space under the configuration of alphaGAN. The results (with 6 runs) in Tab. 8 show that the failure cases can be prevented in this scenario.

Name Params (M) FLOPs (G) IS FID
Table 8: Search wo deconv on alphaGAN.

We also test on another setting by integrating conv_1x1 operation with the interpolation operations (i.e., nearest and bilinear) and making them learnable as deconvonvolution, denoted as ’learnable interpolation’. The results (with 6 runs) under the configuration of alphaGAN

are shown in Tab. 9, suggesting that the failure cases can also be alleviated by the strategy.

Method Name Params (M) FLOPs (G) IS FID
Learnable Interpolation Repeat_1
Table 9: The effect of ’learnable interpolation’ on alphaGAN.


  • [1] T. E. Arber Zela, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter (2019) Understanding and robustifying differentiable architecture search. arXiv preprint arXiv:1909.09656 2 (4), pp. 9. Cited by: §4.1.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1.
  • [3] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1, §1, §2.
  • [4] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2017) Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §1.
  • [5] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 8789–8797. Cited by: §1.
  • [6] A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    pp. 215–223. Cited by: §4.
  • [7] D. Du and P. M. Pardalos (2013) Minimax and applications. Vol. 4, Springer Science & Business Media. Cited by: §2.
  • [8] X. Gong, S. Chang, Y. Jiang, and Z. Wang (2019) Autogan: neural architecture search for generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3224–3234. Cited by: §A.1, Figure 4, §1, §3.2, §3.2, §4.1, Table 1, Table 2, §4, §4.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §1, §2, §3.1.
  • [10] P. Grnarova, K. Y. Levy, A. Lucchi, N. Perraudin, I. Goodfellow, T. Hofmann, and A. Krause (2019) A domain agnostic measure for monitoring and evaluating gans. In Advances in Neural Information Processing Systems, pp. 12069–12079. Cited by: §2, §3.1.
  • [11] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1, §1, §2, Table 1.
  • [12] H. He, H. Wang, G. Lee, and Y. Tian Probgan: towards probabilistic gan with theoretical guarantees. Cited by: Table 2.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §1, §3.1, §4.
  • [15] J. Ho and S. Ermon (2016)

    Generative adversarial imitation learning

    In Advances in neural information processing systems, pp. 4565–4573. Cited by: §2.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
  • [17] N. Kodali, J. Abernethy, J. Hays, and Z. Kira (2017) On convergence and stability of gans. arXiv preprint arXiv:1705.07215. Cited by: §1.
  • [18] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547. Cited by: §1.
  • [19] L. Li and A. Talwalkar (2019) Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638. Cited by: §4.1, Table 1.
  • [20] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §1, §3.1, §3.2.
  • [21] C. J. Maddison, D. Tarlow, and T. Minka (2014) A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094. Cited by: §E.1.
  • [22] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §1.
  • [23] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: Figure 4, §1, §1, §2, §3.1, §3.2, Table 1, Table 2.
  • [24] J. F. Nash et al. (1950) Equilibrium points in n-person games. Proceedings of the national academy of sciences 36 (1), pp. 48–49. Cited by: §1, §2.
  • [25] F. A. Oliehoek, R. Savani, J. Gallego-Posada, E. Van der Pol, E. D. De Jong, and R. Groß (2017) GANGs: generative adversarial network games. arXiv preprint arXiv:1712.00679. Cited by: §1.
  • [26] M. J. Osborne and A. Rubinstein (1994) A course in game theory. MIT press. Cited by: §2.
  • [27] C. Peng, H. Wang, X. Wang, and Z. Yang (2020) {dg}-{gan}: the {gan} with the duality gap. External Links: Link Cited by: §3.1.
  • [28] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2817–2826. Cited by: §2.
  • [29] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1, Table 1.
  • [30] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §1, §1, §1, §2, §3.1, §3.1, §4.
  • [31] A. Torralba, R. Fergus, and W. T. Freeman (2008)

    80 million tiny images: a large data set for nonparametric object and scene recognition

    IEEE transactions on pattern analysis and machine intelligence 30 (11), pp. 1958–1970. Cited by: §4.
  • [32] H. Wang and J. Huan (2019) Agan: towards automated design of generative adversarial networks. arXiv preprint arXiv:1906.11080. Cited by: §1, Table 1, Table 2.
  • [33] W. Wang, Y. Sun, and S. Halgamuge (2018) Improving mmd-gan training with repulsive loss function. arXiv preprint arXiv:1812.09916. Cited by: Table 2.
  • [34] S. Ye, X. Feng, T. Zhang, X. Ma, S. Lin, Z. Li, K. Xu, W. Wen, S. Liu, J. Tang, et al. (2019) Progressive dnn compression: a key to achieve ultra-high weight pruning and quantization rates using admm. arXiv preprint arXiv:1903.09769. Cited by: §1, §2, Table 1.
  • [35] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5505–5514. Cited by: §1.
  • [36] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter (2019) Understanding and robustifying differentiable architecture search. arXiv preprint arXiv:1909.09656. Cited by: §E.6.
  • [37] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §3.1.
  • [38] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1.
  • [39] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1.
  • [40] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1.