E2GAN
[ECCV 2020]"OffPolicy Reinforcement Learning for Efficient and Effective GAN Architecture Search" By Yuan Tian, Qin Wang, Zhiwu Huang, Wen Li, Dengxin Dai, Minghao Yang, Jun Wang, Olga Fink
view repo
In this paper, we introduce a new reinforcement learning (RL) based neural architecture search (NAS) methodology for effective and efficient generative adversarial network (GAN) architecture search. The key idea is to formulate the GAN architecture search problem as a Markov decision process (MDP) for smoother architecture sampling, which enables a more effective RLbased search algorithm by targeting the potential global optimal architecture. To improve efficiency, we exploit an offpolicy GAN architecture search algorithm that makes efficient use of the samples generated by previous policies. Evaluation on two standard benchmark datasets (i.e., CIFAR10 and STL10) demonstrates that the proposed method is able to discover highly competitive architectures for generally better image generation results with a considerably reduced computational burden: 7 GPU hours. Our code is available at https://github.com/Yuantian013/E2GAN.
READ FULL TEXT VIEW PDF[ECCV 2020]"OffPolicy Reinforcement Learning for Efficient and Effective GAN Architecture Search" By Yuan Tian, Qin Wang, Zhiwu Huang, Wen Li, Dengxin Dai, Minghao Yang, Jun Wang, Olga Fink
None
Generative adversarial networks (GANs) have been successfully applied to a wide range of generation tasks, including image generation [12, 3, 1, 53, 15], text to image synthesis [44, 62, 40] and image translation [25, 7], to name a few. To further improve the generation quality, several extensions and further developments have been proposed, ranging from regularization terms [4, 14], progressive training strategy [26], utilizing attention mechanism [59, 61], and to new architectures [27, 3].
While designing favorable neural architectures of GANs has made great success, it typically requires a large amount of time, effort, and domain expertise. For instance, several stateoftheart GANs [27, 3] design appreciably complex generator or discriminator backbones for better generating highresolution images. To alleviate the network engineering pain, an efficient automated architecture searching framework for GAN is highly needed. On the other hand, Neural architecture search (NAS) has been applied and proved effective in discriminative tasks such as image classification [30] and segmentation [33]. Encouraged by this, AGAN [52] and AutoGAN [11] have introduced neural architecture search methods for GAN based on reinforcement learning (RL), thereby enabling a significant speedup of architecture searching process.
Similar to the other architecture search tasks (image classification, image segmentation), recently proposed RLbased GAN architecture search method AGAN [52]
optimized the entire architecture. Since the same policy might sample different architectures, it is likely to suffer from noisy gradients and a high variance, which potentially further harms the policy update stability. To circumvent this issue, multilevel architecture search (MLAS) has been used in AutoGAN
[52], and a progressive optimization formulation is used. However, because optimization is based on the best performance of the current architecture level, this progressive formulation potentially leads to a local minimum solution.To overcome these drawbacks, we reformulate the GAN architecture search problem as a Markov decision process (MDP). The new formulation is partly inspired by the humandesigned Progressive GAN [26], which has shown to improve generation quality progressively in intermediate outputs of each architecture cell. In our new formulation, a sequence of decisions will be made during the entire architecture design process, which allows statebased sampling and thus alleviates the variance. In addition, as we will show later in the paper, by using a carefully designed reward, this new formulation also allows us to target effective global optimization over the entire architecture.
More importantly, the MDP formulation can better facilitate offpolicy RL training to improve data efficiency. The previously proposed RLbased GAN architecture search methods [11, 52] are based on onpolicy RL, leading to limited data efficiency that results in considerably long training time. Specifically, onpolicy RL approach generally requires frequent sampling of a batch of architectures generated by current policy to update the policy. Moreover, new samples are required to be collected for each gradient step, while the previous batches are directly disposed. This quickly becomes very expensive as the number of gradient steps and samples increases with the complexity of the task, especially in the architecture search tasks. by comparison, offpolicy reinforcement learning algorithms make use of past experience such that the RL agents are enabled to learn more efficiently. This has been proven to be effective in other RL tasks, including legged locomotion [32] and complex video games [38]. However, exploiting offpolicy data for GAN architecture search poses new challenges. Training the policy network inevitably becomes unstable by using offpolicy data, because these training samples are systematically different from the onpolicy ones. This presents a great challenge to the stability and convergence of the algorithm [2]. Our proposed MDP formulation can make a difference here. By allowing statebased sampling, the new formulation alleviates this instability, and better supports the offpolicy strategy.
The contributions of this paper are twofold:
We reformulate the problem of neural architecture search for GAN as an MDP for smoother architecture sampling, which enables a more effective RLbased search algorithm and potentially more global optimization.
We propose an efficient and effective offpolicy RL NAS framework for GAN architecture search (EGAN), which is 6 times faster than existing RLbased GAN search approaches with competitive performance.
We conduct a variety of experiments to validate the effectiveness of EGAN. Our discovered architectures yield better results compared to RLbased competitors. EGAN is efficient, as it is able to to find a highly competitive model within 7 GPU hours.
Reinforcement Learning Recent progress in modelfree reinforcement learning (RL) [49] has fostered promising results in many interesting tasks ranging from gaming [37, 48], to planning and control problems [23, 31, 58, 18, 6, 19] and even up to the AutoML [65, 41, 34]. However, modelfree deep RL methods are notoriously expensive in terms of their sample complexity. One reason of the poor sample efficiency is the use of onpolicy reinforcement learning algorithms, such as trust region policy optimization (TRPO) [46], proximal policy optimization(PPO) [47] or REINFORCE [56]. Onpolicy learning algorithms require new samples generated by the current policy for each gradient step. On the contrary, offpolicy algorithms aim to reuse past experience. Recent developments of the offpolicy reinforcement learning algorithms, such as soft ActorCritic (SAC) [17], have demonstrated substantial improvements in both performance and sample efficiency in previous onpolicy methods.
Neural architecture search Neural architecture search methods aim to automatically search for a good neural architecture for various tasks, such as image classification [30] and segmentation [33], in order to ease the burden of handcrafted design of dedicated architectures for specific tasks. Several approaches have been proposed to tackle the NAS problem. Zoph and Le [65]
proposed a reinforcement learningbased method that trains an RNN controller to design the neural network
[65]. Guo et al. [16]exploited a novel graph convolutional neural networks for policy learning in reinforcement learning. Further successfully introduced approaches include evolutionary algorithm based methods
[43], differentiable methods [35] and oneshot learning methods [5, 35]. Early works of RLbased NAS algorithms [57, 41, 65, 34] proposed to optimize the entire trajectory (i.e., the entire neural architecture). To the best of our knowledge, most of the previously proposed RLbased NAS algorithms used onpolicy RL algorithms, such as REINFORCE or PPO, except [63] which uses Qlearning algorithm for NAS, which is a valuebased method and only supports discrete state space problems. For onpolicy algorithms, since each update requires new data collected by the current policy and the reward is based on the internal neural network architecture training, the onpolicy training of RLbased NAS algorithms inevitably becomes computationally expensive.GAN Architecture Search Due to the specificities of GAN and their challenges, such as instability and mode collapse, the NAS approaches proposed for discriminative models cannot be directly transferred to the architecture search of GANs. Only recently, few approaches have been introduced tackling the specific challenges of the GAN architectures. Recently, AutoGAN has introduced a neural architecture search for GANs based on reinforcement learning (RL), thereby enabling a significant speedup of the process of architecture selection [11]. The AutoGAN algorithm is based on onpolicy reinforcement learning. The proposed multilevel architecture search (MLAS) aims at progressively finding wellperforming GAN architectures and completes the task in around 2 GPU days. Similarly, AGAN [52] uses reinforcement learning for generative architecture search in a larger search space. The computational cost for AGAN is comparably very expensive (1200 GPU days). In addition, AdversarialNAS [10] and DEGAS [9] adopted a different approach, i.e., differentiable searching strategy [35], for the GAN architecture search problem.
In this section, we briefly review the basic concepts and notations used in the following sections.
The training of GANs involves an adversarial competition between two players, a generator and a discriminator. The generator aims at generating realisticlooking images to ‘fool’ its opponent. Meanwhile, the discriminator aims to distinguish whether an image is real or fake. This can be formulated as a minmax optimization problem:
(1) 
where G and D are the generator and discriminator parametrized by neural networks. is sampled from random noise. are the real and are the generated images.
A Markov decision process (MDP) is a discretetime stochastic control process. At each time step, the process is in some state , and its associated decisionmaker chooses an available action . Given the action, the process moves into a new state at the next step, and the agent receives a reward.
An MDP could be described as a tuple (), where is the set of states that is able to precisely describe the current situation, is the set of actions, is the reward function,
is the transition probability function, and
is the initial state distribution.MDPs can be particularly useful for solving optimization problems via reinforcement learning. In a general reinforcement learning setup, an agent is trained to interact with the environment and get a reward from its interaction. The goal is to find a policy that maximizes the cumulative reward :
(2) 
While the standard RL merely maximizes the expected cumulative rewards, the maximum entropy RL framework considers a more general objective [64], which favors stochastic policies. This objective shows a strong connection to the explorationexploitation tradeoff, and could also prevent the policy from getting stuck in local optima. Formally, it is given by
(3) 
where is the temperature parameter that controls the stochasticity of the optimal policy.
Given a fixed search space, GAN architecture search agents aim to discover an optimal network architecture on a given generation task. Existing RL methods update the policy network by using batches of entire architectures sampled from the current policy. Even though these data samples are only used for the current update step, the sampled GAN architectures nevertheless require tedious training and evaluation processes. The sampling efficiency is therefore very low resulting in limited learning progress of the agents. Moreover, the entire architecture sampling leads to a high variance, which might influence the stability of the policy update.
The key motivation of the proposed methodology is to stabilize and accelerate the learning process by stepwise sampling instead of entiretrajectorybased sampling and making efficient use of past experiences from previous policies. To achieve this, we propose to formulate the GAN architecture search problem as an MDP and solve it by offpolicy reinforcement learning.
We propose to formulate the GAN architecture search problem as a Markov decision process (MDP) which enables statebased sampling. It further boosts the learning process and overcomes the potential challenge of a large variance stemming from sampling entire architectures that makes it inherently difficult to train a policy using offpolicy data.
Formulating GAN architecture search problem as an MDP provides a mathematical description of architecture search processes. An MDP can be described as a tuple (), where is the set of states that is able to precisely describe the current architecture (such as the current cell number, the structure or the performance of the architectures), is the set of actions that defines the architecture design of the next cell, is the reward function used to define how good the architecture is, is the transition probability function indicating the training process, and is the initial architecture. We define a cell as an architecture block we are using to search in one step. The design details of states, actions, and rewards is discussed in Section 5.
It is important to highlight that the formulation proposed in this paper has two main differences compared to previous RL methods for neural architecture search. Firstly, it is essentially different to the classic RL approaches for NAS [65], which formulate the task as an optimization problem over the entire trajectory/architecture. Instead, the MDP formulation proposed here enables us to do the optimization based on the disentangled steps of cell design. Secondly, it is also different to the progressive formulation used by AutoGAN [11], where the optimization is based on the best performance of the current architecture level and can potentially lead to a local minimum solution. Instead, the proposed formulation enables us to potentially conduct a more global optimization using cumulative reward without the burden of calculating the reward over the full trajectory at once. It is important to point out that the multilevel optimization formulation used in AutoGAN [11] does not have this property.
In this section, we integrate offpolicy RL in the GAN architecture search by making use of the newly proposed MDP formulation. We introduce several innovations to address the challenges of an offpolicy learning setup.
The MDP formulation of GAN architecture search enables us to use offpolicy reinforcement learning for a stepwise optimization of the entire search process to maximize the cumulative reward.
Before we move on to the offpolicy solver, we need to design the state, reward, and action to meet the requirements of both the GAN architecture design, as well as of the MDP formulation.
MDP requires a state representation that can precisely represent the current network up to the current step. Most importantly, this state needs to be stable during training to avoid adding more variance to the training of the policy network. The stability requirement is particularly relevant since the policy network relies on it to design the next cell. The design of the state is one of the main challenges we face when adopting offpolicy RL to GAN architecture search.
Inspired by the progressive GAN [26], which has shown to improve generation quality in intermediate RGB outputs of each architecture cell, we propose a progressive state representation for GAN architecture search. Specifically, given a fixed batch of input noise, we adopt the average output of each cell as the progressive state representation. We downsample this representation to impose a constant size across different cells. Note that there are alternative ways to encode the network information. For example, one could also deploy another network to encode the previous layers. However, we find the proposed design efficient and also effective.
In addition to the progressive state representation, we also use network performance (Inception Score / FID) and layer number to provide more information about the state. To summarize, the designed state includes the depth, performance of the current architecture, and the progressive state representation.
Given the current state, which encodes the information about previous layers, the policy network decides on the next action. The action describes the architecture of one cell. For example, if we follow the search space used by AutoGAN [11], action will contain skip options, upsampling operations, shortcut options, different types of convolution blocks, and the normalization option, as shown in Figure 2.
This can then be defined as
. The action output by the agent will be carried out by a softmax classifier decoding into an operation. To demonstrate the effectiveness of our offpolicy methods and enable a fair comparison, in all of our experiments, we use the same search space as AutoGAN
[11], which means we search for generator cells, and the discriminator architecture is predesigned and growing as the generator becomes deeper. More details on the search space are discussed in Section 6.We propose to design the reward function as the performance improvement after adding the new cell. In this work, we use both Inception Score (IS) and Frchet Inception Distance (FID) as the indicators of the network performance. Since IS score is progressive (the higher the better) and FID score is degressive (the lower the better), the proposed reward function can be formulated as:
(4) 
where is a factor to balance the tradeoff between the two indicators. We use in our main experiments. The motivation behind using a combined reward is based on an empirical finding indicating that IS and FID are not always consistent with each other and can lead to a biased choice of architectures. A detailed discussion about the choice of indicators is provided in Section 7.
By employing the performance improvement in each step instead of only using performance as proposed in [11], RL can maximize the expected sum of rewards over the entire trajectory. This enables us to target the potential global optimal structure with the highest reward:
(5) 
where and are the final scores of the entire architecture.
The proposed designs of state, reward, and action fulfill the criteria of MDPs and makes it possible to stabilize the training using offpolicy samples. We are now free to choose any offpolicy RL solver to improve data efficiency.
In this paper, we apply the offtheshelf soft actorcritic algorithm (SAC) [17], an offpolicy actorcritic deep RL algorithm based on the maximum entropy reinforcement learning framework, as the learning algorithm. It has demonstrated to be 10 to 100 times more dataefficient compared to any other onpolicy algorithms on traditional RL tasks. In SAC, the actor aims at maximizing expected reward while also maximizing entropy. This increases training stability significantly and improves the exploration during training.
For the learning of the critic, the objective function is defined as:
(6) 
where is the approximation target for :
(7) 
The objective function of the the policy network is given by:
(8) 
where is parameterized by a neural network ,
is an input vector consisting of Gaussian noise, and the
is the replay buffer for storing the MDP tuples [38]. is a positive Lagrange multiplier that controls the relative importance of the policy entropy versus the safety constraint.In this section, we present the implementation details of the proposed offpolicy RL framework EGAN. The training process is briefly outlined in Algorithm 1.
Since we reformulated the NAS as a multistep MDP, our agent will make several decisions in any trajectory . In each step, the agent will collect this experience in the memory buffer . Once the threshold of the smallest memory length is reached, the agent is updated using the Adam [28] optimizer via the objective function presented in Eq. 8 by sampling a batch of data from the memory buffer in an offpolicy way.
The entire search comprises two periods: the exploration period and the exploitation period. During the exploration period, the agent will sample any possible architecture. While in the exploitation period, the agent will choose the best architecture, in order to quickly stabilize the policy.
The exploration period lasts for 70 of iterations, and the exploitation takes 30 iterations. Once the memory threshold is reached, for every exploration step, the policy will be updated once. For every exploitation step, the policy will be updated 10 times in order to converge quickly.
We use a progressive proxy task in order to collect the rewards fast. When a new cell is added, we train the current full trajectory for one epoch and calculate the reward for the current cell. Within a trajectory, the previous cells’ weights will be kept and trained together with the new cell. In order to accurately estimate the Qvalue of each stateaction pair, we reset the weight of the neural network after finishing the entire architecture trajectory design.
In this paper, we use the CIFAR10 dataset [29] to evaluate the effectiveness and efficiency of the proposed EGAN framework. The CIFAR10 dataset consists of 50,000 training images and 10,000 test images with a resolution. We use its training set without any data augmentation technique to search for the architecture with the highest cumulative return for a GAN generator. Furthermore, to evaluate the transferability of the discovered architecture, we also adopt the STL10 dataset [8] to train the network without any other data augmentation to make a fair comparison to previous works.
To verify the effectiveness of the offpolicy framework and to enable a fair comparison, we use the same search space as used in the AutoGAN experiments [11]. There are five control variables: 1)Skip operation, which is a binary value indicating whether the current cell takes a skip connection from any specific cell as its input. Note that each cell could take multiple skip connections from other preceding cells. 2)Preactivation [21]
and postactivation convolution block. 3)Three types of normalization operations, including batch normalization
[24], instance normalization [51], and no normalization. 4)Upsampling operation which is standard in current image generation GAN, including bilinear upsampling, nearest neighbor upsampling, and stride2 deconvolution. 5)Shortcut operation.
Methods  Inception Score  FID  Search Cost (GPU days) 

DCGAN [42]    
Improved GAN [45]    
LRGAN [60]    
DFM [55]    
ProbGAN [20]  7.75  24.60  
WGANGP, ResNet [14]    
Splitting GAN [13]    
SNGAN [36]  
MGAN [22]  26.7  
DistGAN [50]    
Progressive GAN [26]  8.80 .05    
Improv MMD GAN [54]  8.29  16.21  
Random search1 [11]  8.09  17.34   
Random search2 [11]  7.97  21.39   
AGAN [52]  30.5  1200  
AutoGANtop1 [11]  12.42  2  
AutoGANtop2 [11]  13.67  2  
AutoGANtop3 [11]  13.68  2  
EGANtop1  11.26  0.3  
EGANtop2  12.96  0.3  
EGANtop3  12.48  0.3 
The generator architecture discovered by EGAN on the CIFAR10 training set is displayed in Figure 3. For the task of unconditional CIFAR10 image generation (no class labels used), several notable observations could be summarized:
EGAN prefers postactivation convolution block to preactivation convolution blocks. This finding is contrary to AutoGAN’s preference, but coincides with previous experiences from human experts.
EGAN prefers the use of batch normalization. This finding is also contrary to AutoGAN’s choice, but is in line with experts’ common practice.
EGAN prefers bilinear upsample to nearest neighbour upsample. This in theory provides finer upsample ability between different cells.
Methods  Inception Score  FID  Search Cost (GPU days) 

D2GAN [39]  7.98    Manual 
DFM [55]    Manual  
ProbGAN [20]  46.74  Manual  
SNGAN [36]  Manual  
DistGAN [50]    36.19  Manual 
Improving MMD GAN [54]  9.34  37.63  Manual 
AGAN [52]  52.7  1200  
AutoGANtop1 [11]  31.01  2  
EGANtop1  9.51 .09  25.35  0.3 

Our EGAN framework only takes about 0.3 GPU day for searching while the AGAN spends 1200 GPU days and AutoGAN spends 2 GPU days.
We train the discovered EGAN from scratch for 500 epochs and summarize the IS and FID scores in Table 1. On the CIFAR10 dataset, our model achieves a highly competitive FID 11.26 compared to published results by AutoGAN [11], and handcrafted GAN [42, 45, 60, 55, 20, 14, 13, 36, 22, 54]. In terms of IS score, EGAN is also highly competitive to AutoGAN [11]. We additionally report the performance of the top2 and top3 architectures discovered in one search. Both have higher performance than the respective AutoGAN counterparts.
We also test the transferability of EGAN. We retrain the weights of the discovered EGAN architecture using the STL10 training and unlabeled set for the unconditional image generation task. EGAN achieves a highlycompetitive performance on both IS (9.51) and FID (25.35), as shown in Table 2.
Because our main contribution is the new formulation and using offpolicy RL for GAN architecture framework, we compare the proposed method directly with existing RLbased algorithms. We use the exact same searching space as AutoGAN, which does not include the search for a discriminator. As GAN training is an interactive procedure between generator and discriminator, one might expect better performance if the search is conducted on both networks. We report our scores using the exact same evaluation procedure provided by the authors of AutoGAN. The reported scores are based on the best models achieved during training on a 20 epoch evaluation interval. Mean and standard deviation of the IS score are calculated based on the 10fold evaluation on 50,000 generated images. We additionally report the performance curve against training steps of E
GAN and AutoGAN for three runs in the supplementary material.Methods  Inception Score  FID  Search Cost (GPU days) 

AutoGANtop1 [11]  12.42  2  
EGAN(IS and FID as reward)  0.3  
EGAN(IS only as reward)  8.81 .11  0.1  

IS and FID scores are two main evaluation metrics for GAN. We conduct the ablation study of using different combinations. Specifically, IS only (
) and the combination of IS and FID () as the reward. Our agent successfully discovered two different architectures. When only IS is used as the reward signal, the agent discovered a different architecture using only 0.1 GPU day. The searched architecture achieved stateoftheart IS score of 8.86, as shown in Table 3., but a relatively plain FID score of 15.78. This demonstrates the effectiveness of the proposed method, as we are encouraging the agent to find the architecture with a higher IS score. Interestingly, this special architecture shows that, at least in certain edge cases, the IS score and FID score may not always have a strong positive correlation. This finding motivates us to additionally use FID as part of the reward. When both IS and FID are used as the reward signal, the discovered architecture performs well in term of both metrics. This combined reward takes 0.3 GPU days (compared to 0.1 GPU days of IS only optimization) because of the relatively expensive cost of FID computation.We train our agent over 3 different seeds. As shown in Figure 5, we observe that our agent steadily converged the policy in the exploitation period. EGAN can find similar architectures with relatively good performance on the proxy task.
In this work, we proposed a novel offpolicy reinforcement learning method, EGAN, to efficiently and effectively search for GAN architectures. We reformulated the problem as an MDP process, and overcame the challenges of using offpolicy data. We first introduced a new progressive state representation. We additionally introduced a new reward function, which allowed us to target the potential global optimization in our MDP formulation. The EGAN achieves stateoftheart efficiency in GAN architecture searching, and the discovered architecture shows highly competitive performance compared to other stateoftheart methods. In future work, we plan to simultaneously optimize the generator and discriminator architectures in a multiagent context.
The contributions of Yuan Tian, Qin Wang, and Olga Fink were funded by the Swiss National Science Foundation (SNSF) Grant no. PP00P2_176878.
Proceedings of the IEEE International Conference on Computer Vision
, pp. 2745–2754. Cited by: §1.Stargan: unified generative adversarial networks for multidomain imagetoimage translation
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 8789–8797. Cited by: §1.Proceedings of the fourteenth International Conference on Artificial Intelligence and Statistics
, pp. 215–223. Cited by: §6.1.Imagetoimage translation with conditional adversarial networks
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §1.Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 2902–2911. Cited by: §2.Improving MMDGAN training with repulsive loss function
. In International Conference on Learning Representations, External Links: Link Cited by: §6.3, Table 1, Table 2.
Comments
There are no comments yet.