1 Introduction
Generative adversarial networks (GANs) have been successfully applied to a wide range of generation tasks, including image generation [12, 3, 1, 53, 15], text to image synthesis [44, 62, 40] and image translation [25, 7], to name a few. To further improve the generation quality, several extensions and further developments have been proposed, ranging from regularization terms [4, 14], progressive training strategy [26], utilizing attention mechanism [59, 61], and to new architectures [27, 3].
While designing favorable neural architectures of GANs has made great success, it typically requires a large amount of time, effort, and domain expertise. For instance, several stateoftheart GANs [27, 3] design appreciably complex generator or discriminator backbones for better generating highresolution images. To alleviate the network engineering pain, an efficient automated architecture searching framework for GAN is highly needed. On the other hand, Neural architecture search (NAS) has been applied and proved effective in discriminative tasks such as image classification [30] and segmentation [33]. Encouraged by this, AGAN [52] and AutoGAN [11] have introduced neural architecture search methods for GAN based on reinforcement learning (RL), thereby enabling a significant speedup of architecture searching process.
Similar to the other architecture search tasks (image classification, image segmentation), recently proposed RLbased GAN architecture search method AGAN [52]
optimized the entire architecture. Since the same policy might sample different architectures, it is likely to suffer from noisy gradients and a high variance, which potentially further harms the policy update stability. To circumvent this issue, multilevel architecture search (MLAS) has been used in AutoGAN
[52], and a progressive optimization formulation is used. However, because optimization is based on the best performance of the current architecture level, this progressive formulation potentially leads to a local minimum solution.To overcome these drawbacks, we reformulate the GAN architecture search problem as a Markov decision process (MDP). The new formulation is partly inspired by the humandesigned Progressive GAN [26], which has shown to improve generation quality progressively in intermediate outputs of each architecture cell. In our new formulation, a sequence of decisions will be made during the entire architecture design process, which allows statebased sampling and thus alleviates the variance. In addition, as we will show later in the paper, by using a carefully designed reward, this new formulation also allows us to target effective global optimization over the entire architecture.
More importantly, the MDP formulation can better facilitate offpolicy RL training to improve data efficiency. The previously proposed RLbased GAN architecture search methods [11, 52] are based on onpolicy RL, leading to limited data efficiency that results in considerably long training time. Specifically, onpolicy RL approach generally requires frequent sampling of a batch of architectures generated by current policy to update the policy. Moreover, new samples are required to be collected for each gradient step, while the previous batches are directly disposed. This quickly becomes very expensive as the number of gradient steps and samples increases with the complexity of the task, especially in the architecture search tasks. by comparison, offpolicy reinforcement learning algorithms make use of past experience such that the RL agents are enabled to learn more efficiently. This has been proven to be effective in other RL tasks, including legged locomotion [32] and complex video games [38]. However, exploiting offpolicy data for GAN architecture search poses new challenges. Training the policy network inevitably becomes unstable by using offpolicy data, because these training samples are systematically different from the onpolicy ones. This presents a great challenge to the stability and convergence of the algorithm [2]. Our proposed MDP formulation can make a difference here. By allowing statebased sampling, the new formulation alleviates this instability, and better supports the offpolicy strategy.
The contributions of this paper are twofold:

We reformulate the problem of neural architecture search for GAN as an MDP for smoother architecture sampling, which enables a more effective RLbased search algorithm and potentially more global optimization.

We propose an efficient and effective offpolicy RL NAS framework for GAN architecture search (EGAN), which is 6 times faster than existing RLbased GAN search approaches with competitive performance.
We conduct a variety of experiments to validate the effectiveness of EGAN. Our discovered architectures yield better results compared to RLbased competitors. EGAN is efficient, as it is able to to find a highly competitive model within 7 GPU hours.
2 Related Work
Reinforcement Learning Recent progress in modelfree reinforcement learning (RL) [49] has fostered promising results in many interesting tasks ranging from gaming [37, 48], to planning and control problems [23, 31, 58, 18, 6, 19] and even up to the AutoML [65, 41, 34]. However, modelfree deep RL methods are notoriously expensive in terms of their sample complexity. One reason of the poor sample efficiency is the use of onpolicy reinforcement learning algorithms, such as trust region policy optimization (TRPO) [46], proximal policy optimization(PPO) [47] or REINFORCE [56]. Onpolicy learning algorithms require new samples generated by the current policy for each gradient step. On the contrary, offpolicy algorithms aim to reuse past experience. Recent developments of the offpolicy reinforcement learning algorithms, such as soft ActorCritic (SAC) [17], have demonstrated substantial improvements in both performance and sample efficiency in previous onpolicy methods.
Neural architecture search Neural architecture search methods aim to automatically search for a good neural architecture for various tasks, such as image classification [30] and segmentation [33], in order to ease the burden of handcrafted design of dedicated architectures for specific tasks. Several approaches have been proposed to tackle the NAS problem. Zoph and Le [65]
proposed a reinforcement learningbased method that trains an RNN controller to design the neural network
[65]. Guo et al. [16]exploited a novel graph convolutional neural networks for policy learning in reinforcement learning. Further successfully introduced approaches include evolutionary algorithm based methods
[43], differentiable methods [35] and oneshot learning methods [5, 35]. Early works of RLbased NAS algorithms [57, 41, 65, 34] proposed to optimize the entire trajectory (i.e., the entire neural architecture). To the best of our knowledge, most of the previously proposed RLbased NAS algorithms used onpolicy RL algorithms, such as REINFORCE or PPO, except [63] which uses Qlearning algorithm for NAS, which is a valuebased method and only supports discrete state space problems. For onpolicy algorithms, since each update requires new data collected by the current policy and the reward is based on the internal neural network architecture training, the onpolicy training of RLbased NAS algorithms inevitably becomes computationally expensive.GAN Architecture Search Due to the specificities of GAN and their challenges, such as instability and mode collapse, the NAS approaches proposed for discriminative models cannot be directly transferred to the architecture search of GANs. Only recently, few approaches have been introduced tackling the specific challenges of the GAN architectures. Recently, AutoGAN has introduced a neural architecture search for GANs based on reinforcement learning (RL), thereby enabling a significant speedup of the process of architecture selection [11]. The AutoGAN algorithm is based on onpolicy reinforcement learning. The proposed multilevel architecture search (MLAS) aims at progressively finding wellperforming GAN architectures and completes the task in around 2 GPU days. Similarly, AGAN [52] uses reinforcement learning for generative architecture search in a larger search space. The computational cost for AGAN is comparably very expensive (1200 GPU days). In addition, AdversarialNAS [10] and DEGAS [9] adopted a different approach, i.e., differentiable searching strategy [35], for the GAN architecture search problem.
3 Preliminary
In this section, we briefly review the basic concepts and notations used in the following sections.
3.1 Generative Adversarial Networks
The training of GANs involves an adversarial competition between two players, a generator and a discriminator. The generator aims at generating realisticlooking images to ‘fool’ its opponent. Meanwhile, the discriminator aims to distinguish whether an image is real or fake. This can be formulated as a minmax optimization problem:
(1) 
where G and D are the generator and discriminator parametrized by neural networks. is sampled from random noise. are the real and are the generated images.
3.2 Reinforcement Learning
A Markov decision process (MDP) is a discretetime stochastic control process. At each time step, the process is in some state , and its associated decisionmaker chooses an available action . Given the action, the process moves into a new state at the next step, and the agent receives a reward.
An MDP could be described as a tuple (), where is the set of states that is able to precisely describe the current situation, is the set of actions, is the reward function,
is the transition probability function, and
is the initial state distribution.MDPs can be particularly useful for solving optimization problems via reinforcement learning. In a general reinforcement learning setup, an agent is trained to interact with the environment and get a reward from its interaction. The goal is to find a policy that maximizes the cumulative reward :
(2) 
While the standard RL merely maximizes the expected cumulative rewards, the maximum entropy RL framework considers a more general objective [64], which favors stochastic policies. This objective shows a strong connection to the explorationexploitation tradeoff, and could also prevent the policy from getting stuck in local optima. Formally, it is given by
(3) 
where is the temperature parameter that controls the stochasticity of the optimal policy.
4 Problem Formulation
4.1 Motivation
Given a fixed search space, GAN architecture search agents aim to discover an optimal network architecture on a given generation task. Existing RL methods update the policy network by using batches of entire architectures sampled from the current policy. Even though these data samples are only used for the current update step, the sampled GAN architectures nevertheless require tedious training and evaluation processes. The sampling efficiency is therefore very low resulting in limited learning progress of the agents. Moreover, the entire architecture sampling leads to a high variance, which might influence the stability of the policy update.
The key motivation of the proposed methodology is to stabilize and accelerate the learning process by stepwise sampling instead of entiretrajectorybased sampling and making efficient use of past experiences from previous policies. To achieve this, we propose to formulate the GAN architecture search problem as an MDP and solve it by offpolicy reinforcement learning.
4.2 GAN Architecture Search formulated as MDP
We propose to formulate the GAN architecture search problem as a Markov decision process (MDP) which enables statebased sampling. It further boosts the learning process and overcomes the potential challenge of a large variance stemming from sampling entire architectures that makes it inherently difficult to train a policy using offpolicy data.
Formulating GAN architecture search problem as an MDP provides a mathematical description of architecture search processes. An MDP can be described as a tuple (), where is the set of states that is able to precisely describe the current architecture (such as the current cell number, the structure or the performance of the architectures), is the set of actions that defines the architecture design of the next cell, is the reward function used to define how good the architecture is, is the transition probability function indicating the training process, and is the initial architecture. We define a cell as an architecture block we are using to search in one step. The design details of states, actions, and rewards is discussed in Section 5.
It is important to highlight that the formulation proposed in this paper has two main differences compared to previous RL methods for neural architecture search. Firstly, it is essentially different to the classic RL approaches for NAS [65], which formulate the task as an optimization problem over the entire trajectory/architecture. Instead, the MDP formulation proposed here enables us to do the optimization based on the disentangled steps of cell design. Secondly, it is also different to the progressive formulation used by AutoGAN [11], where the optimization is based on the best performance of the current architecture level and can potentially lead to a local minimum solution. Instead, the proposed formulation enables us to potentially conduct a more global optimization using cumulative reward without the burden of calculating the reward over the full trajectory at once. It is important to point out that the multilevel optimization formulation used in AutoGAN [11] does not have this property.
5 Offpolicy RL for GAN Architecture Search
In this section, we integrate offpolicy RL in the GAN architecture search by making use of the newly proposed MDP formulation. We introduce several innovations to address the challenges of an offpolicy learning setup.
The MDP formulation of GAN architecture search enables us to use offpolicy reinforcement learning for a stepwise optimization of the entire search process to maximize the cumulative reward.
5.1 RL for GAN Architecture Search
Before we move on to the offpolicy solver, we need to design the state, reward, and action to meet the requirements of both the GAN architecture design, as well as of the MDP formulation.
5.1.1 State
MDP requires a state representation that can precisely represent the current network up to the current step. Most importantly, this state needs to be stable during training to avoid adding more variance to the training of the policy network. The stability requirement is particularly relevant since the policy network relies on it to design the next cell. The design of the state is one of the main challenges we face when adopting offpolicy RL to GAN architecture search.
Inspired by the progressive GAN [26], which has shown to improve generation quality in intermediate RGB outputs of each architecture cell, we propose a progressive state representation for GAN architecture search. Specifically, given a fixed batch of input noise, we adopt the average output of each cell as the progressive state representation. We downsample this representation to impose a constant size across different cells. Note that there are alternative ways to encode the network information. For example, one could also deploy another network to encode the previous layers. However, we find the proposed design efficient and also effective.
In addition to the progressive state representation, we also use network performance (Inception Score / FID) and layer number to provide more information about the state. To summarize, the designed state includes the depth, performance of the current architecture, and the progressive state representation.
5.1.2 Action
Given the current state, which encodes the information about previous layers, the policy network decides on the next action. The action describes the architecture of one cell. For example, if we follow the search space used by AutoGAN [11], action will contain skip options, upsampling operations, shortcut options, different types of convolution blocks, and the normalization option, as shown in Figure 2.
This can then be defined as
. The action output by the agent will be carried out by a softmax classifier decoding into an operation. To demonstrate the effectiveness of our offpolicy methods and enable a fair comparison, in all of our experiments, we use the same search space as AutoGAN
[11], which means we search for generator cells, and the discriminator architecture is predesigned and growing as the generator becomes deeper. More details on the search space are discussed in Section 6.5.1.3 Reward
We propose to design the reward function as the performance improvement after adding the new cell. In this work, we use both Inception Score (IS) and Frchet Inception Distance (FID) as the indicators of the network performance. Since IS score is progressive (the higher the better) and FID score is degressive (the lower the better), the proposed reward function can be formulated as:
(4) 
where is a factor to balance the tradeoff between the two indicators. We use in our main experiments. The motivation behind using a combined reward is based on an empirical finding indicating that IS and FID are not always consistent with each other and can lead to a biased choice of architectures. A detailed discussion about the choice of indicators is provided in Section 7.
By employing the performance improvement in each step instead of only using performance as proposed in [11], RL can maximize the expected sum of rewards over the entire trajectory. This enables us to target the potential global optimal structure with the highest reward:
(5) 
where and are the final scores of the entire architecture.
5.2 Offpolicy RL Solver
The proposed designs of state, reward, and action fulfill the criteria of MDPs and makes it possible to stabilize the training using offpolicy samples. We are now free to choose any offpolicy RL solver to improve data efficiency.
In this paper, we apply the offtheshelf soft actorcritic algorithm (SAC) [17], an offpolicy actorcritic deep RL algorithm based on the maximum entropy reinforcement learning framework, as the learning algorithm. It has demonstrated to be 10 to 100 times more dataefficient compared to any other onpolicy algorithms on traditional RL tasks. In SAC, the actor aims at maximizing expected reward while also maximizing entropy. This increases training stability significantly and improves the exploration during training.
For the learning of the critic, the objective function is defined as:
(6) 
where is the approximation target for :
(7) 
The objective function of the the policy network is given by:
(8) 
where is parameterized by a neural network ,
is an input vector consisting of Gaussian noise, and the
is the replay buffer for storing the MDP tuples [38]. is a positive Lagrange multiplier that controls the relative importance of the policy entropy versus the safety constraint.5.3 Implementation of EGan
In this section, we present the implementation details of the proposed offpolicy RL framework EGAN. The training process is briefly outlined in Algorithm 1.
5.3.1 Agent Training
Since we reformulated the NAS as a multistep MDP, our agent will make several decisions in any trajectory . In each step, the agent will collect this experience in the memory buffer . Once the threshold of the smallest memory length is reached, the agent is updated using the Adam [28] optimizer via the objective function presented in Eq. 8 by sampling a batch of data from the memory buffer in an offpolicy way.
The entire search comprises two periods: the exploration period and the exploitation period. During the exploration period, the agent will sample any possible architecture. While in the exploitation period, the agent will choose the best architecture, in order to quickly stabilize the policy.
The exploration period lasts for 70 of iterations, and the exploitation takes 30 iterations. Once the memory threshold is reached, for every exploration step, the policy will be updated once. For every exploitation step, the policy will be updated 10 times in order to converge quickly.
5.3.2 Proxy Task
We use a progressive proxy task in order to collect the rewards fast. When a new cell is added, we train the current full trajectory for one epoch and calculate the reward for the current cell. Within a trajectory, the previous cells’ weights will be kept and trained together with the new cell. In order to accurately estimate the Qvalue of each stateaction pair, we reset the weight of the neural network after finishing the entire architecture trajectory design.
6 Experiments
6.1 Dataset
In this paper, we use the CIFAR10 dataset [29] to evaluate the effectiveness and efficiency of the proposed EGAN framework. The CIFAR10 dataset consists of 50,000 training images and 10,000 test images with a resolution. We use its training set without any data augmentation technique to search for the architecture with the highest cumulative return for a GAN generator. Furthermore, to evaluate the transferability of the discovered architecture, we also adopt the STL10 dataset [8] to train the network without any other data augmentation to make a fair comparison to previous works.
6.2 Search Space
To verify the effectiveness of the offpolicy framework and to enable a fair comparison, we use the same search space as used in the AutoGAN experiments [11]. There are five control variables: 1)Skip operation, which is a binary value indicating whether the current cell takes a skip connection from any specific cell as its input. Note that each cell could take multiple skip connections from other preceding cells. 2)Preactivation [21]
and postactivation convolution block. 3)Three types of normalization operations, including batch normalization
[24], instance normalization [51], and no normalization. 4)Upsampling operation which is standard in current image generation GAN, including bilinear upsampling, nearest neighbor upsampling, and stride2 deconvolution. 5)Shortcut operation.
Methods  Inception Score  FID  Search Cost (GPU days) 

DCGAN [42]    
Improved GAN [45]    
LRGAN [60]    
DFM [55]    
ProbGAN [20]  7.75  24.60  
WGANGP, ResNet [14]    
Splitting GAN [13]    
SNGAN [36]  
MGAN [22]  26.7  
DistGAN [50]    
Progressive GAN [26]  8.80 .05    
Improv MMD GAN [54]  8.29  16.21  
Random search1 [11]  8.09  17.34   
Random search2 [11]  7.97  21.39   
AGAN [52]  30.5  1200  
AutoGANtop1 [11]  12.42  2  
AutoGANtop2 [11]  13.67  2  
AutoGANtop3 [11]  13.68  2  
EGANtop1  11.26  0.3  
EGANtop2  12.96  0.3  
EGANtop3  12.48  0.3 
6.3 Results
The generator architecture discovered by EGAN on the CIFAR10 training set is displayed in Figure 3. For the task of unconditional CIFAR10 image generation (no class labels used), several notable observations could be summarized:

EGAN prefers postactivation convolution block to preactivation convolution blocks. This finding is contrary to AutoGAN’s preference, but coincides with previous experiences from human experts.

EGAN prefers the use of batch normalization. This finding is also contrary to AutoGAN’s choice, but is in line with experts’ common practice.

EGAN prefers bilinear upsample to nearest neighbour upsample. This in theory provides finer upsample ability between different cells.
Methods  Inception Score  FID  Search Cost (GPU days) 

D2GAN [39]  7.98    Manual 
DFM [55]    Manual  
ProbGAN [20]  46.74  Manual  
SNGAN [36]  Manual  
DistGAN [50]    36.19  Manual 
Improving MMD GAN [54]  9.34  37.63  Manual 
AGAN [52]  52.7  1200  
AutoGANtop1 [11]  31.01  2  
EGANtop1  9.51 .09  25.35  0.3 

Our EGAN framework only takes about 0.3 GPU day for searching while the AGAN spends 1200 GPU days and AutoGAN spends 2 GPU days.
We train the discovered EGAN from scratch for 500 epochs and summarize the IS and FID scores in Table 1. On the CIFAR10 dataset, our model achieves a highly competitive FID 11.26 compared to published results by AutoGAN [11], and handcrafted GAN [42, 45, 60, 55, 20, 14, 13, 36, 22, 54]. In terms of IS score, EGAN is also highly competitive to AutoGAN [11]. We additionally report the performance of the top2 and top3 architectures discovered in one search. Both have higher performance than the respective AutoGAN counterparts.
We also test the transferability of EGAN. We retrain the weights of the discovered EGAN architecture using the STL10 training and unlabeled set for the unconditional image generation task. EGAN achieves a highlycompetitive performance on both IS (9.51) and FID (25.35), as shown in Table 2.
Because our main contribution is the new formulation and using offpolicy RL for GAN architecture framework, we compare the proposed method directly with existing RLbased algorithms. We use the exact same searching space as AutoGAN, which does not include the search for a discriminator. As GAN training is an interactive procedure between generator and discriminator, one might expect better performance if the search is conducted on both networks. We report our scores using the exact same evaluation procedure provided by the authors of AutoGAN. The reported scores are based on the best models achieved during training on a 20 epoch evaluation interval. Mean and standard deviation of the IS score are calculated based on the 10fold evaluation on 50,000 generated images. We additionally report the performance curve against training steps of E
GAN and AutoGAN for three runs in the supplementary material.Methods  Inception Score  FID  Search Cost (GPU days) 

AutoGANtop1 [11]  12.42  2  
EGAN(IS and FID as reward)  0.3  
EGAN(IS only as reward)  8.81 .11  0.1  

7 Discussion
7.1 Reward Choice: IS and FID
IS and FID scores are two main evaluation metrics for GAN. We conduct the ablation study of using different combinations. Specifically, IS only (
) and the combination of IS and FID () as the reward. Our agent successfully discovered two different architectures. When only IS is used as the reward signal, the agent discovered a different architecture using only 0.1 GPU day. The searched architecture achieved stateoftheart IS score of 8.86, as shown in Table 3., but a relatively plain FID score of 15.78. This demonstrates the effectiveness of the proposed method, as we are encouraging the agent to find the architecture with a higher IS score. Interestingly, this special architecture shows that, at least in certain edge cases, the IS score and FID score may not always have a strong positive correlation. This finding motivates us to additionally use FID as part of the reward. When both IS and FID are used as the reward signal, the discovered architecture performs well in term of both metrics. This combined reward takes 0.3 GPU days (compared to 0.1 GPU days of IS only optimization) because of the relatively expensive cost of FID computation.7.2 Reproducibility
We train our agent over 3 different seeds. As shown in Figure 5, we observe that our agent steadily converged the policy in the exploitation period. EGAN can find similar architectures with relatively good performance on the proxy task.
8 Conclusion
In this work, we proposed a novel offpolicy reinforcement learning method, EGAN, to efficiently and effectively search for GAN architectures. We reformulated the problem as an MDP process, and overcame the challenges of using offpolicy data. We first introduced a new progressive state representation. We additionally introduced a new reward function, which allowed us to target the potential global optimization in our MDP formulation. The EGAN achieves stateoftheart efficiency in GAN architecture searching, and the discovered architecture shows highly competitive performance compared to other stateoftheart methods. In future work, we plan to simultaneously optimize the generator and discriminator architectures in a multiagent context.
Acknowledgement
The contributions of Yuan Tian, Qin Wang, and Olga Fink were funded by the Swiss National Science Foundation (SNSF) Grant no. PP00P2_176878.
References

[1]
(2017)
CVAEgan: finegrained image generation through asymmetric training.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 2745–2754. Cited by: §1.  [2] (2009) Convergent temporaldifference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pp. 1204–1212. Cited by: §1.
 [3] (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1, §1.
 [4] (2016) Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093. Cited by: §1.
 [5] (2017) Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §2.
 [6] (2020) Realtime model calibration with deep reinforcement learning. arXiv preprint arXiv:2006.04001. Cited by: §2.

[7]
(2018)
Stargan: unified generative adversarial networks for multidomain imagetoimage translation
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 8789–8797. Cited by: §1. 
[8]
(2011)
An analysis of singlelayer networks in unsupervised feature learning.
In
Proceedings of the fourteenth International Conference on Artificial Intelligence and Statistics
, pp. 215–223. Cited by: §6.1.  [9] (2019) DEGAS: differentiable efficient generator search. arXiv preprint arXiv:1912.00606. Cited by: §2.
 [10] (2019) AdversarialNAS: adversarial neural architecture search for gans. arXiv preprint arXiv:1912.02037. Cited by: §2.
 [11] (2019) Autogan: neural architecture search for generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3224–3234. Cited by: §1, §1, §2, §4.2, Figure 2, §5.1.2, §5.1.2, §5.1.3, §6.2, §6.3, Table 1, Table 2, Table 3.
 [12] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1.
 [13] (2017) Classsplitting generative adversarial networks. arXiv preprint arXiv:1709.07359. Cited by: §6.3, Table 1.
 [14] (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §1, §6.3, Table 1.
 [15] (2019) Autoembedding generative adversarial networks for high resolution image synthesis. IEEE Transactions on Multimedia 21 (11), pp. 2726–2737. Cited by: §1.
 [16] (2019) Nat: neural architecture transformer for accurate and compact architectures. In Advances in Neural Information Processing Systems, pp. 737–748. Cited by: §2.
 [17] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2, §5.2.
 [18] (2019) H infinity modelfree reinforcement learning with robust stability guarantee. arXiv preprint arXiv:1911.02875. Cited by: §2.
 [19] (2020) Actorcritic reinforcement learning for control with stability guarantee. arXiv preprint arXiv:2004.14288. Cited by: §2.
 [20] (2019) Probgan: towards probabilistic gan with theoretical guarantees. International Conference on Learning Representations. Cited by: §6.3, Table 1, Table 2.
 [21] (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §6.2.
 [22] (2018) MGAN: training generative adversarial nets with multiple generators. Cited by: §6.3, Table 1.
 [23] (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26), pp. eaau5872. Cited by: §2.
 [24] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §6.2.

[25]
(2017)
Imagetoimage translation with conditional adversarial networks
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §1.  [26] (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1, §1, §5.1.1, Table 1.
 [27] (2019) A stylebased generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1, §1.
 [28] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.1.
 [29] (2009) Learning multiple layers of features from tiny images. Cited by: §6.1.
 [30] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1, §2.
 [31] (2016) Learning dexterous manipulation policies from experience and imitation. arXiv preprint arXiv:1611.05095. Cited by: §2.
 [32] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
 [33] (2019) Autodeeplab: hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 82–92. Cited by: §1, §2.
 [34] (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §2, §2.
 [35] (2019) DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §2, §2.
 [36] (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §6.3, Table 1, Table 2.
 [37] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §2.
 [38] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1, §5.2.
 [39] (2017) Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2670–2680. Cited by: Table 2.
 [40] (2019) Semantic image synthesis with spatiallyadaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §1.
 [41] (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2, §2.
 [42] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §6.3, Table 1.

[43]
(2017)
Largescale evolution of image classifiers.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 2902–2911. Cited by: §2.  [44] (2016) Generative adversarial text to image synthesis. In International Conference on Machine Learning, pp. 1060–1069. Cited by: §1.
 [45] (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §6.3, Table 1.
 [46] (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §2.
 [47] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.
 [48] (2014) Deterministic policy gradient algorithms. Cited by: §2.
 [49] (1992) Reinforcement learning is direct adaptive optimal control. IEEE Control Systems Magazine 12 (2), pp. 19–22. Cited by: §2.
 [50] (2018) Distgan: an improved gan using distance constraints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 370–385. Cited by: Table 1, Table 2.
 [51] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §6.2.
 [52] (2019) Agan: towards automated design of generative adversarial networks. arXiv preprint arXiv:1906.11080. Cited by: §1, §1, §1, §2, Table 1, Table 2.
 [53] (2018) Highresolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807. Cited by: §1.

[54]
(2019)
Improving MMDGAN training with repulsive loss function
. In International Conference on Learning Representations, External Links: Link Cited by: §6.3, Table 1, Table 2.  [55] (2016) Improving generative adversarial networks with denoising feature matching. Cited by: §6.3, Table 1, Table 2.
 [56] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §2.
 [57] (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §2.
 [58] (2019) Iterative reinforcement learning based design of dynamic locomotion skills for cassie. arXiv preprint arXiv:1903.09537. Cited by: §2.
 [59] (2018) Attngan: finegrained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324. Cited by: §1.
 [60] (2017) Lrgan: layered recursive generative adversarial networks for image generation. arXiv preprint arXiv:1703.01560. Cited by: §6.3, Table 1.
 [61] (2019) Selfattention generative adversarial networks. In International Conference on Machine Learning, pp. 7354–7363. Cited by: §1.
 [62] (2017) Stackgan: text to photorealistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907–5915. Cited by: §1.
 [63] (2018) Practical blockwise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §2.
 [64] (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §3.2.
 [65] (2017) Neural architecture search with reinforcement learning. International Conference on Learning Representations. Cited by: §2, §2, §4.2.
Comments
There are no comments yet.