Evolutionary-Neural Hybrid Agents for Architecture Search

11/24/2018 ∙ by Krzysztof Maziarz, et al. ∙ 0

Neural Architecture Search has recently shown potential to automate the design of Neural Networks. The use of Neural Network agents trained with Reinforcement Learning can offer the possibility to learn complex patterns, as well as the ability to explore a vast and compositional search space. On the other hand, evolutionary algorithms offer the greediness and sample efficiency needed for such an application, as each sample requires a considerable amount of resources. We propose a class of Evolutionary-Neural hybrid agents (Evo-NAS), that retain the best qualities of the two approaches. We show that the Evo-NAS agent can outperform both Neural and Evolutionary agents, both on a synthetic task, and on architecture search for a suite of text classification datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Networks (NN) have yielded success in many supervised learning problems. However, the design of state-of-the-art deep learning algorithms requires many decisions, normally involving human time and expertise. As an alternative, Auto-ML approaches aim to automate manual design with meta-learning agents. Many different approaches have been proposed for architecture optimization, including random search, evolutionary algorithms, Bayesian optimization and NN trained with Reinforcement Learning.

Deep Reinforcement Learning (deep RL) is one of the most common approaches. It involves sampling architectures from a distribution, which is modeled by a deep neural network agent. The parameters of the agent’s NN are trained using RL to maximize the performance of the generated models on the downstream task. Architecture search based on deep RL have yielded success in automatic design of state-of-the-art RNN cells (Zoph & Le, 2017), convolutional blocks (Zoph et al., 2017)

, activation functions 

(Prajit Ramachandran, 2018), optimizers (Bello et al., 2017; Wichrowska et al., 2017) and data augmentation strategies (Cubuk et al., 2018).

Recently, (Real et al., 2018)

has shown that evolutionary approaches, with appropriate regularization, can match or outperform deep RL methods on architecture search task where sample efficiency is critical. Evolutionary approaches can efficiently leverage a single good model by generating similar models via a mutation process. Deep RL methods generate new models by sampling from a learned distribution, and cannot latch on with ease to patterns of a single model, unless it has been promoted multiple times through the learning process. However, evolutionary methods have the disadvantage of relying on heuristics or random sampling when choosing mutations. Unlike approaches based on a Neural Network (NN) agent, Evolutionary approaches are unable to learn patterns to drive the search.

The main contribution of this paper is to introduce a class of Evolutionary-Neural hybrid agents (Evo-NAS). We propose an evolutionary agent whose mutations are guided by a NN trained with RL. This combines both the sample efficiency of Evolutionary agents, and the ability to learn complex patterns.

In Section 3 we give a brief description of state-of-the-art Neural and Evolutionary agents, and introduce the Evo-NAS agent in Section 4. Then, in Section 6, we present and discuss the properties of the proposed Evolutionary-Neural agent by applying it to a synthetic task. Finally, we apply Evo-NAS to architecture search benchmarks, showing that it outperforms both RL-based and Evolution-based algorithms on architecture search for a variety of text and image classification datasets.

2 Related Work

In recent years, progress has been made in automating the design process required to produce state-of-the-art neural networks. Recent methods have shown that learning-based approaches can achieve state-of-the-art results on ImageNet 

(Zoph & Le, 2017; Liu et al., 2017b). These results have been subsequently scaled by transferring architectural building blocks between datasets (Zoph et al., 2017). Some works explicitly address resource efficiency (Zhong et al., 2018; Pham et al., 2018), which is crucial, as architecture search is known to require a large amount of resources (Zoph & Le, 2017).

Another important approach to architecture search is neuro-evolution (Floreano et al., 2008; Stanley et al., 2009; Real et al., 2017; Conti et al., 2017). Recent work has highlighted the importance of regularizing the evolutionary search process, showing that evolution can match or outperform a learning-based baseline (Real et al., 2018)

. Others have applied genetic algorithms to evolve the weights of the model 

(Such et al., 2017).

Other than deep RL and evolution, different approaches have been applied to architecture search and hyper-parameter optimization: cascade-correlation (Fahlman & Lebiere, 1990), boosting (Cortes et al., 2016), deep-learning based tree search (Negrinho & Gordon, 2017), hill-climbing (Elsken et al., 2017) and random search (Bergstra & Bengio, 2012).

3 Baselines

We compare the proposed Evo-NAS agent with a set of baselines. These baseline algorithms include the state-of-the-art approaches in architecture search literature (Zoph & Le, 2017; Real et al., 2018).

Random Search (RS)

generates a new model by sampling every architectural choice from a uniform distribution over the available actions. Thus, it disregards the models generated in the past and their rewards. The performance of the RS agent gives a sense of the complexity of the task at hand, and allows to estimate the quality gains to attribute to the use of a more complex agent.

Neural Architecture Search (NAS)

(Zoph & Le, 2017) uses an RNN based agent to perform a sequence of architectural choices that define a generated model. These choices can also include hyper-parameters such as learning rate. The resulting model is then trained on a downstream task, and its quality on a validation set is computed. This quality metric serves as the reward for training the agent using a policy gradient approach. In the following sections, we will refer to the standard NAS agent as the Neural agent.

Regularized Evolution Architecture Search

(Real et al., 2018) is a variant of the tournament selection method (Goldberg & Deb, 1991). A population of generated models is improved in iterations. At each iteration, a sample of

models is selected at random. The best model of the sample, parent model, is mutated to produce a new generated model with an altered architecture, which is trained and added to the population. The regularization consists of discarding the oldest model of the population instead of the one with the lowest reward. This avoids over-weighting “lucky” models that may have reached high reward due to the variance of the NN’s training process. In the following sections, we will refer to the Regularized Evolution agent as the Evolutionary agent.

4 Evolutionary-Neural Architecture Search

We propose a hybrid agent, which combines the sampling efficiency of Evolutionary approaches with the ability to learn architectural patterns of a RL trained Neural agent.

As for the Evolutionary agent, new models are generated by mutating a parent model, which is selected from the population of the most recent models. Unlike the Evolutionary agent, the mutations are not sampled at random among the possible architectural choices, but from distributions inferred by a NN trained with RL.

Figure 1:

Overview of how the Evo-NAS agent creates a sequence of actions specifying a generated model, given a parent model. The colored blocks represent actions that the agent has to perform (e.g. architectural and hyperparameter choices). Each action is sampled from the distribution learned by the agent neural-network with probability

, or reused from the parent trial with probability .

Specifically, at mutation sampling time, the sequence of architectural choices of the parent model is fed into the Evolutionary-Neural controller. The Evo-NAS controller can either reuse the parent’s architectural choice or to sample a new one from the learned distributions. If an action is re-sampled, it will be conditioned on all the prior actions. The probability of mutating the parent action is a hyperparameter of the model, and we refer to it as mutation probability: . Evo-NAS sampling algorithm is represented in Figure 1. Refer to Appendix A for a comparison with the baseline algorithms.

The Evo-NAS agent is initialized so that the distributions over the architectural choices are uniform. Thus, an initialized Evo-NAS agent produces random mutations as an Evolutionary agent. During training, the Evo-NAS agent’s NN parameters are updated using a policy gradient approach to maximize the expected quality metric achieved by the generated models on the downstream task. Therefore, the distributions over actions will become more skewed with time to promote the patterns of the good models. In contrast, an Evolutionary agent is unable to learn mutation patterns, since the distributions from which the mutations are sampled are constantly uniform.

In summary, Evo-NAS is designed to retain the sample-efficiency of Evolution, which is able to efficiently leverage good generated models using a mutation strategy, while the underlying Neural Network is able to learn complex patterns, as a Neural agent is.

5 Training algorithms

We consider two alternative training algorithms for NN based agents such as Neural and Evo-NAS:


(Williams, 1992). Reinforce is the standard on-policy policy-gradient training algorithm. It is often considered the default choice for applications where an agent needs to be trained to explore a complex search space to find the optimal solution, as is in the case of architecture search. This approach has the disadvantage of not being as sample efficient compared to off-policy alternatives, which are able to reuse samples. Reinforce is the training algorithm chosen in the original NAS paper (Zoph & Le, 2017).

Priority Queue Training

(PQT) (Abolafia et al., 2018). With PQT, the NN gradients are generated to directly maximize the log likelihood of the best samples produced so far. This training algorithm has higher sample efficiency than Reinforce, as good models generate multiple updates. PQT has the simplicity of supervised learning, since the best models are directly promoted as if they constitute the supervised training set, with no need of reward scaling as in Reinforce, or sample probability scaling as in off-policy training.

6 Experiments

To highlight the the properties of the different approaches, we propose to consider two characteristics: 1) whether the agent has learnable parameters, enabling it to learn patterns; 2) whether the agent is capable of efficiently leveraging good past experiences by using mutations. These two characteristics are independent, and for a fixed architecture search algorithm, both of them may or may not be present. In Table 1, we summarize the characteristics of the methods we aim to compare.

Algorithm Learning Mutation
Random Search No No
Neural agent Yes No
Evolutionary agent No Yes
Evo-NAS agent Yes Yes
Table 1: Properties of the compared architecture search algorithms.

6.1 Synthetic task: Learn to count

We first compare the different agents on a synthetic toy task, designed to have similar properties and complexity of architecture search tasks.

This task can be described as learning to count. The agent is asked to choose a sequence of integer numbers, where each number is selected from the set . The complexity of the task can be controlled with the value of .

After producing a sequence, , the agent receives the reward:

This reward is designed to encourage every two adjacent numbers to be close to each other, but also, to keep the first number small, and the last number large. The maximum reward is achieved by the single optimal sequence . Note that the reward is in the range . For a detailed definition of the synthetic task and an analysis of its properties, see Appendix B.

Figure 2:

Comparison between Reinforce and Priority-Queue-Training (PQT) on “Learn to count” synthetic task, for both Neural (Left) and Evo-NAS (Right) agents. The plot shows the best reward attained (Y-axis) after a given number of trials (X-axis). Each experiment was ran 20 times, and the shaded area represents 70% confidence interval.

We perform a preliminary tuning of the hyperparameters of the agents to ensure a fair comparison. For the following experiments, we set the mutation probability , the population size , and the sample size , for both the Evolutionary and the Evo-NAS agent. We set the learning rate to be 0.0005 for the Neural agent and 0.0001 for the Evo-NAS agent. We set the entropy penalty to 0.1 for the Neural agent and 0.2 for the Evo-NAS agent. PQT maximizes the log likelihood of the top 5% trials for the Neural agent and top 20% trials for the Evo-NAS agent.

We start by comparing Reinforce and PQT as alternative training algorithms for the Neural and Evo-NAS agents. The results of the experiments are shown in Figure 2. We found that PQT outperformed Reinforce for both the Neural and the Evo-NAS agent. For the Evo-NAS agent, the gain is especially pronounced at the beginning of the experiment. Thus, we conclude that PQT provides a more sample efficient training signal than Reinforce. We will use PQT as the training algorithm for Neural and Evo-NAS agents for the following experiment.

Figure 3: Comparison of different agents on “Learn to count” synthetic task. On the Y-axis, the plots show the moving average of the reward over 50 trials (Left) and the best reward attained so far (Right). Each experiment was ran 20 times, and the shaded area represents 70% confidence interval.

Figure 3 shows the results of the comparison on the “Learn to count” task between Random Search, Neural, Evolutionary and Evo-NAS agents. The Evolutionary agent finds better models than the Neural agent during the initial 1000 samples, while in the second half of the the experiment, the Neural agent outperforms the Evolutionary agent. Our interpretation is that Evolutionary agent’s efficient exploitation allows to have a better start by mutating good trials. While the Neural agent needs 1000+ samples to learn the required patterns, after this is achieved it can generate better samples than those generated by Evolutionary agent’s random mutations. The results show that Evo-NAS agent achieves both the sample efficiency of Evolutionary approaches and the learning capability of Neural approaches. Evo-NAS initial fast improvement shows the ability to take advantage of the sample efficiency of Evolution. Learning the proper mutation patterns allows it to keep outperforming the Evolutionary agent. The poor performance of the Random Search agent shows that the “Learn to count” task is non-trivial. Also, comparing with Figure 2, we see that the Neural agent would not have been able to catch up with the Evolutionary agent within the 5000 trials of this experiment if it was trained with Reinforce instead of PQT.

6.2 Text classification tasks

Figure 4: Results of the experiments on 7 text classification tasks. Each experiment was ran 10 times. For each run, we have selected the model that obtained the best ROC-AUC on the validation set (the best reward). These best models were then evaluated by computing the ROC-AUC on the holdout testset. The empty circles in the plot represent the test ROC-AUC achieved by each of the 10 best models. The filled circles represent the means of the empty circles. We superpose standard deviation bars.

We now compare the different agents on a real architecture search task. The Neural, Evolutionary and Evo-NAS agents are applied to the task of finding architectures for 7 text classification datasets. References for these datasets are provided in Appendix C Table 4.

Similarly to (Wong et al., 2018), we design a search space of common architectural and hyperparameter choices that define two-tower “wide and deep” models (Cheng et al., 2016). One tower is a deep FFNN, built by stacking: a pre-trained text-embedding module, a stack of fully connected layers, and a softmax classification layer. The other tower is a wide-shallow layer that directly connects the one-hot token encodings to the softmax classification layer with a linear projection. The wide tower allows the model to learn task-specific biases for each token directly, such as trigger words, while the deep tower allows it to learn complex patterns. The wide tower is regularized with L1 loss to promote sparsity, and the deep tower is regularized with L2 loss. The details of the search space are reported in Appendix C Table 5.

The agent defines the generated model architecture by selecting a value for every available architectural or hyper-parameter choice. The first action selects the pre-trained text-embedding module. The details of the text-embedding modules are reported in Appendix C Table 6

. These modules are available via the TensorFlow Hub service

111https://www.tensorflow.org/hub. Using pre-trained text-embedding modules has two benefits: first, improves the quality of the generated models trained on smaller datasets, and second, decreases convergence time of the generated models. The optimizer for the deep column can be either Adagrad (Duchi et al., 2011) or Lazy Adam 222https://www.tensorflow.org/api_docs/python/tf/contrib/opt/LazyAdamOptimizer. “Lazy Adam” refers to a commonly used version of Adam (Kingma & Ba, 2014) that computes the moving averages only on the current batch. These are efficient optimizers, that allow to halve the back-propagation time, compared to more expensive optimizers such as Adam. The optimizer used for the wide column is FTRL (McMahan, 2011). Notice that this search space is not designed to discover original architectures that set a new state-of-the-art for this type of tasks, but it is rather a medium complexity architecture search environment, that allows to analyze the properties of the agents.

All the experiments in this section are run with a fixed budget: trials are trained in parallel for 2 hours with 2 CPUs each. Choosing a small budget allows to run a higher number of replicas for each experiment to increase the significance of the results. It also makes the budget accessible to most of the scientific community, thus increasing the reproducibility of the results.

During the experiments, the models sampled by the agent are trained on the training set of the current text classification task, and the area under the ROC curve (ROC-AUC) computed on the validation set is used as the reward for the agent. To compare the generated models that achieved the best reward, we compute the ROC-AUC on the holdout testset as the final unbiased evaluation metric. For the datasets that do not come with a pre-defined train/validation/test split, we split randomly 80%/10%/10% respectively for training, validation and test set.

We use ROC-AUC instead of the more commonly used accuracy, since it provides a less noisy reward signal. In a preliminary experiment, we validated this hypothesis by running experiments on the ConsumerComplaints task. Then, for a sample of 30 models, we have computed 4 metrics: ROC-AUC on validation and test set, accuracy on validation and test set. The Pearson correlation between the validation ROC-AUC and the test ROC-AUC resulted to be 99.96%, while between the validation accuracy and the test accuracy resulted to be 99.70%. The scatter plot of these two sets is reported in Appendix C Figure 9.

Dataset Neural Evolutionary Evo-NAS
Table 2: Best ROC-AUC(%) on the testset for each algorithm and dataset. We report the average over 10 distinct architecture search runs, as well as

2 standard-error-of-the-mean (s.e.m.) Bolding indicates the best performing algorithm or those within 2 s.e.m. of the best.

We validate the results of the comparison between PQT and Reinforce done in Section 6.1 by running 5 experiment replicas for each of the 7 tasks using the Neural agent with both training algorithms. We measure an average relative gain of over the final test ROC-AUC achieved by using PQT instead of Reinforce.

We will use PQT for the following experiments to train the Evo-NAS and Neural agents to maximize the log likelihood of the top 5 trials. For the Evo-NAS and Evolutionary agents, we have set the mutation probability to . Evo-NAS and Neural agents use PQT with learning rate 0.0001 and entropy penalty 0.1. These parameters were selected with a preliminary tuning to ensure a fair comparison.

To measure the quality of the models generated by the three agents, we run 10 architecture search experiment replicas for each of the 7 tasks, and we measure the test ROC-AUC obtained by the best model generated by each experiment replica. The results are summarized in Table 2 and Figure 4. The Evo-NAS agent achieves the best performance on all 7 tasks. On 3 out of 7 tasks it significantly outperforms both the Neural and the Evolutionary agents.

We also report the number of trials each of the agents performed during the 2h long experiments, and summarize the results in Appendix C in Table 7 and Figure 10. We find that the Evolutionary and the Evo-NAS agents strongly outperform the Neural agent in terms of number of trials performed. The Evo-NAS agent achieves the largest number of trials on 6 out of 7 datasets, while the Evolutionary agent on 5 out of 7 datasets. On 4 datasets the Evolutionary and Evo-NAS agents perform joint best. We conclude that this shows that the evolutionary algorithms are biased towards faster models, as shown in (Real et al., 2018).

An in depth analysis of the architectures that achieved the best performance is beyond the scope of this paper. However, we mention a few relevant patterns of the best architectures generated across tasks. The FFNNs for the deep part of the network are often shallow and wide. The learning rate for both wide and deep parts is in the bottom of the range (). The L1 and L2 regularization are often disabled. Our interpretation of this last observation is that reducing the number of parameters is a simpler and more effective regularization, which is preferred over adding L1 and L2 factors to the loss.

To verify that the learning patterns highlighted in Section 6.1 generalize, we plot in Figure 5 the reward moving average for two tasks: 20Newsgroups and ConsumerComplaints. For these experiments, we have extended the time budget from 2h to 5h. This time budget extension is needed to be able to capture longer term trends exhibited by the Neural and Evo-NAS agents. We run 3 replicas for each task. In the early stages of the experiments, we notice that the quality of the samples generated by the Neural agent are on the same level as the randomly generated samples, while the quality of the samples generated by the Evo-NAS and Evolutionary agents grows steadily. In the second half of the experiments, the Neural agent starts applying the learned patterns to the generated samples. The quality of the samples generated by the Evolutionary agent flattens, which we assume is due to the fact that the quality of the samples in the population is close to optimum, and the quality of the samples cannot improve, since good mutations patterns cannot be learned. Finally, we observe that the Evo-NAS agent keeps generating better samples.

Figure 5: Reward moving average for the compared agents. The average is computed over a window of 50 consecutive trials. We ran 3 replicas for each experiment. The shaded area represents minimum and maximum value of the rolling average across the runs.

6.3 Image classification task

We also compare the agents on a different architecture search domain: image classification. This is a higher complexity task and the most common benchmark for architecture search (Zoph & Le, 2017; Real et al., 2018; Liu et al., 2017a, 2018).

As shown in recent studies (Zoph et al., 2017; Liu et al., 2017a), the definition of the architecture search space is critical to be able to achieve state-of-the-art performance. In this line of experiments we reuse the Factorized Hierarchical Search Space defined in (Tan et al., 2018). This is a recently proposed search space that has shown to be able to reach state-of-the-art performance. We abstain from proposing an improved search space that could allow to set a new state-of-the-art for this benchmark, since the main objective of this work is to analyze and compare the properties of the agents.

As the target image classification task we use ImageNet (Russakovsky et al., 2015). As it is common in the architecture search literature, we create a validation set by randomly selecting 50K images from the training set. The accuracy computed on this validation set is used as the reward for the agents, while the original ImageNet test set is used only for the final evaluation.

Following common practice in previous architecture search work (Zoph & Le, 2017; Real et al., 2018; Tan et al., 2018)

, we conduct architecture search experiments on a smaller proxy task, and then transfer the top-performing discovered model to the target task. As a simple proxy task we use ImageNet itself, but with fewer training steps. During architecture search, we train each generated model for 5 epochs using an aggressive learning schedule, and evaluate the model on the 50K validation images.

During a single architecture search experiment, each agent trains thousands of models. However, only the model achieving the best reward is transferred to the full ImageNet. As (Tan et al., 2018)

, for full ImageNet training, we train for 400 epochs using RMSProp optimizer with decay 0.9 and momentum 0.9, batch norm momentum 0.9997, and weight decay 0.00001. The learning rate linearly increases from 0 to 0.256 in the first 5-epoch warmup training stage, and then decays by 0.97 every 2.4 epochs. We use standard Inception preprocessing and resize input images to


Every architecture search experiment trains 60 generated model in parallel. Each model is trained on a Cloud TPUv2, and takes approximately 3 hours to complete the training on the proxy task. Because of the high cost of experiment, we limit the agents’ hyper-parameters tuning, and we set them to the values that have worked well in the previous experiments. Evo-NAS and Neural agents use PQT with learning rate 0.0001 and entropy penalty 0.1. PQT maximizes the log likelihood of the top 5% trials. Population size is and the sample size , for both the Evolutionary and the Evo-NAS agent. The only parameter we do a preliminary tuning for is the mutation probability , since in our experience this parameter is the most sensitive to the complexity of the search space, and the Factorized Hierarchical Search Space used for this experiments is orders of magnitude more complex: it contains different architectures. To tune , we run 4 experiments using the Evolutionary agent with values: , and choose the best to be used for both: Evo-NAS and Evolutionary agents. Due to the high cost of the experiments, we do not run experiment replicas. Notice that this is common practice for architecture search on image domain.

Figure 6: Quality metrics of the different agents during the first 6k trials of architecture search on the image classification proxy task. We report the moving average of the reward over 50 trials (Left) and the best reward attained so far (Right).

In Figure 6 we show the plots of the metrics tracked during the architecture search experiments. Each architecture search experiment required ~304 hours to produce 6000 trials. The plot of the moving average of the reward confirms the properties that we observed in the previous lines of experiments. The Neural agent has a slower start, while Evo-NAS retains the initial sample efficiency of the Evolutionary agent, and is able to improve the quality of the samples generated in the longer term by leveraging the learning ability. The discussed properties are also visible on the plot of the best reward in Figure 6 (Right). The Neural agent has a slower start, but is able to close the gap with the Evolutionary agent in the longer term. While Evo-NAS shows an initial rate of improvement comparable to the Evolutionary agent, it is able to outperform the other agents in the later stages.

As an additional baseline, we run a Random Search based agent up to 3000 trials. Its reward moving average does not show improvements over time as expected. The max reward achieved is 52.54, while Evo-NAS achieves reward of 57.68 with the same number of trials. These results confirm the complexity of the task. Refer to Appendix D Figure 11 for more details about the comparison with Random Search.

The best rewards achieved by each agent are respectively: for the Neural agent, for the Evolutionary agent, and for Evo-NAS. Trial 2003 of the Evo-NAS agent is the first one that outperforms all models generated by the other agents. Thus, Evo-NAS surpasses the performance of the other agents with only 1/3 of the trials. Furthermore, during the course of the entire experiment, Evo-NAS generates 1063 models achieving higher reward than any model generated by the other agents.

Finally, for each of the agents we select the generated model that achieved the best reward, train them on the full ImageNet task, and evaluate on the held-out test set. This allows to measure the extent to which the reward gains on the proxy task translate to the full task, and also compare with other results published on this benchmark. The results are reported in Table 3, which shows, that the reward gains translate to the final task. Also, the achieved test errors are comparable to the best results published on this benchmark. Notice that this comparison is influenced by factors unrelated to the choice of the agent. For example, the definition of architecture search space is an important factor in determining the quality of the generated models on the downstream task. MNasNet-92 is the only published result of a network that was generated by exploring the same Factorized Hierarchical Search Space (Tan et al., 2018). It achieves slightly lower results even compared to our Neural agent baseline. Our hypothesis is that this delta can be justified by considering that MNasNet was generated by maximizing a hybrid reward accounting for model latency.

Architecture Test Search Search
error Cost method
(top-1) (gpu-days)
Inception-v1 (Szegedy et al., 2015) 30.2 manual
MobileNet (Howard et al., 2017) 29.4 manual
ShuffleNet-v1 (Zhang et al., 2017) 29.1 manual
ShuffleNet-v2 (Zhang et al., 2017) 26.3 manual
DARTS (Liu et al., 2018) 26.9 4 gradient
NASNet-A (Zoph et al., 2017) 26.0 1800 rl
NASNet-B (Zoph et al., 2017) 27.2 1800 rl
NASNet-C (Zoph et al., 2017) 27.5 1800 rl
PNAS (Liu et al., 2017a) 25.8 225 smbo
AmoebaNet-A (Real et al., 2018) 25.5 3150 evo.
AmoebaNet-B (Real et al., 2018) 26.0 3150 evo.
AmoebaNet-C (Real et al., 2018) 24.3 3150 evo.
MnasNet-92 (Tan et al., 2018) 25.2 988 rl
Neural agent best model 24.78 740 rl
Evolutionary best model 24.70 740 evo.
Evo-NAS best model 24.57 740 evo. + rl
Table 3:

Comparison of mobile-sized state-of-the-art image classifiers on ImageNet.

The architectures of the best models generated by the 3 agents show noticeable common patterns. The core of all 3 networks is mostly constructed with convolutions with kernel size 5 by 5, and have similar network depth of 22 or 23 blocks. The networks found by Evolutionary and Evo-NAS agents both have same first and last block, but Evo-NAS tends to use more filters (such as 192 and 384) in later stages to achieve higher accuracy than the Evolutionary agent. For more details about the architecture structures refer to the Appendix D Figure 12.

7 Conclusion

We introduce a class of Evo-NAS hybrid agents, which are designed to retain both the sample efficiency of evolutionary approaches, and the ability to learn good architectural patterns of Neural agents. We experiment on synthetic, text and image classification tasks, analyze the properties of the proposed Evo-NAS agent, and show that it outperforms both Neural and Evolutionary agents. Additionally, we show that Priority Queue Training outperforms Reinforce also on architecture search applications.


Appendix A Details on baseline architecture search algorithms

Figure 7: Overview of the how the Evolutionary agent creates a sequence of actions specifying a generated model, given a parent trial. The colored blocks represent actions that the agent has to perform. Each action is resampled randomly with probability , or reused from the parent with probability .

Figure 8: Overview of how the Neural agent samples a trial. The colored blocks represent actions that the agent has to perform. Each action is sampled from a distribution defined by an RNN.

Appendix B Details on "Learn to count" toy task

Given a sequence, , its imbalance, , is defined as follows:

Note that imbalance is high when:

  • two adjacent values produced by the agent are far apart

  • is large

  • is small

The reward observed by the agent after choosing a sequence is:

Notice that:

  • in order to maximize reward, the agent must minimize imbalance.

  • .

  • there is a single optimal sequence that achieves the reward of , namely .

The proposed toy task has multiple key properties:

  • The size of the search space is , which even for small is already too big for any exhaustive search algorithm to succeed.

  • As shown by our experiments in Section 6.1, Random Search performs very poorly. This allows to attribute the discovery of good sequences to properties of the algorithm, rather than to accidental discovery over time.

    The experimental observation that Random Search performs badly can be intuitively explained as follows. Let be a uniform distribution over , then:

    This suggests that, for a random sequence , can be expected to be much smaller than , which is confirmed by Figure 3.

  • The search space exhibits a sequential structure, with a notion of locally optimal decisions, making it a good task both for learning patterns, and for mutating past trials by local modifications.

Appendix C Details on text classification experiments

Dataset Reference
20Newsgroups (Lang, 1995)
Brown (Francis & Kuera, 1982)
ConsumerComplaints catalog.data.gov
McDonalds crowdflower.com
NewsAggregator (Lichman, 2013)
Reuters (Debole & Sebastiani, 2004)
SmsSpamCollection (Almeida et al., 2011)
Table 4: References for the datasets used in the text experiments.

Parameters Search space
1) Input embedding modules Refer to Table 6
2) Fine-tune input embedding module {True, False}
3) Use convolution {True, False}
4) Convolution activation {relu, relu6, leaky relu, swish, sigmoid, tanh}
5) Convolution batch norm {True, False}
6) Convolution max ngram length {2, 3}
7) Convolution dropout rate {0.0, 0.1, 0.2, 0.3, 0.4}
8) Convolution number of filters {32, 64, 128}
9) Number of hidden layers {0, 1, 2, 3, 5}
10) Hidden layers size {64, 128, 256}
11) Hidden layers activation {relu, relu6, leaky relu, swish, sigmoid, tanh}
12) Hidden layers normalization {none, batch norm, layer norm}
13) Hidden layers dropout rate {0.0, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5}
14) Deep optimizer name {adagrad, lazy adam}
15) Lazy adam batch size {128, 256}
16) Deep tower learning rate {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}
17) Deep tower regularization weight {0.0, 0.0001, 0.001, 0.01}
18) Wide tower learning rate {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}
19) Wide tower regularization weight {0.0, 0.0001, 0.001, 0.01}
20) Number of training samples {1e5, 2e5, 5e5, 1e6, 2e6, 5e6}
Table 5: The search space defined for text classification experiments.

Language/ID Dataset Embed Vocab. Training TensorFlow Hub Handles
size dim. size algorithm Prefix: https://tfhub.dev/google/
English-small 7B 50 982k Lang. model nnlm-en-dim50-with-normalization/1
English-big 200B 128 999k Lang. model nnlm-en-dim128-with-normalization/1
English-wiki-small 4B 250 1M Skipgram Wiki-words-250-with-normalization/1
Universal-sentence-encoder - 512 - (Cer et al., 2018) universal-sentence-encoder/2
Table 6: Options for text input embedding modules. These are pre-trained text embedding tables, trained on datasets with different languages and size. The text input to these modules is tokenized according to the module dictionary and normalized by lower-casing and stripping rare characters. We provide the handles for the modules that are publicly distributed via the TensorFlow Hub service (https://www.tensorflow.org/hub).

Figure 9: Correlation of validation accuracy with test accuracy (Left) and validation ROC-AUC with test ROC-AUC (Right). The correlation is higher for ROC-AUC. For plotting the correlations, we used the ConsumerComplaints dataset.

Figure 10: Number of trials performed for the experiments from Figure 4. The empty circles represent the number of trials performed in each of the 10 experiment replicas. The filled circles represent the means of the empty circles. We superpose standard deviation bars.

Dataset Neural Evolutionary Evo-NAS
Table 7: The number of trials performed for the experiments from Figure 4. We report the average over 10 runs, as well as 2 standard-error-of-the-mean (s.e.m.). Bolding indicates the algorithm with the highest number of trials or those that have performed within 2 s.e.m. of the largest number of trials.

Dataset Train samples Valid samples Test samples Classes Lang Mean text length
20 Newsgroups 15,076 1,885 1,885 20 En 2,000
Brown Corpus 400 50 50 15 En 20,000
Consumer Complaints 146,667 18,333 18,334 157 En 1,000
McDonalds 1,176 147 148 9 En 516
News Aggregator 338,349 42,294 42,294 4 En 57
Reuters 8,630 1,079 1,079 90 En 820
SMS Spam Collection 4,459 557 557 2 En 81
Table 8: Statistics of the text classification tasks.

Appendix D Details on image experiments

Figure 11: Quality metrics of the different agents during the first 3k trials of architecture search on the image classification proxy task. We report the moving average of the reward over 50 trials (Left) and the best reward attained so far (Right).

Figure 12: Neural networks achieving the best reward for image classification generated by: (a) Neural agent, (b) Evolutionary agent, (c) Evo-NAS agent. For a detailed description of the Factorized Hierarchical Search Space and its modules refer to (Tan et al., 2018).