With breakthroughs achieved by deep learning models in applications such as speech recognition, object detection, and machine translation, the automated design of state of the art neural network architectures for a given learning task has garnered attention in recent years. Deep learning architectures designed via algorithmically driven search procedures have surpassed their hand-crafted counterparts for tasks such as image recognition and language modeling(luo2018neural; zoph2018learning), underlining the practical importance of this field of research.
Techniques for neural architecture search broadly fall under reinforcement learning (zoph2016neural; zoph2018learning; pham2018efficient), Bayesian optimization (shahriari2015taking; kandasamy2018neural; ma2019deep; white2019bananas), evolutionary learning (real2019regularized; liu2017hierarchical; real2017large; xie2017genetic; elsken2018efficient), and gradient based methods (liu2018darts; luo2018neural; shaw2019squeezenas), with random search widely regarded as a competitive baseline (bergstra2012random; li2019random)
This paper presents Neural Architecture Search with ReMAADE, which incorporates recent advancements in neural networks such as the Transformer (vaswani2017attention) to design an auto-regressive model and discover good architectures using policy gradient. We show that neural architecture search with ReMAADE achieves a test error of 5.91% on NASBench-101 (ying2019bench) when given an exploration budget of 3,200 architectures and significantly outperforms all algorithms with the exception of BANANAS (white2019bananas).
Our contributions can be summarized as:
We present a 2-stream attention based architecture for capturing dependencies between hyper-parameters. This architecture’s parametric complexity is invariant to the dimensionality of the search-space, yet the architecture is expressive enough to capture long range dependencies. This makes it suitable for discovering good architectures in high-dimensional search spaces within a limited exploration budget.
We show how performance can be further improved by combining models with shared parameters, each conditioned on a different autoregressive factorization order. This allows the algorithm to discover better architectures with increased exploration budgets without getting struck in sub-optimal architectures.
|NAS||Neural Architecture Search|
|ReMAADE||REINFORCE and Masked Attention Auto-Regressive Density Estimators|
|BANANAS||Bayesian Optimization with Neural Architectures for Neural Architecture Search|
|MADE||Masked Autoregressive Density Estimator|
|NADE||Neural Autoregressive Density|
|PPO||Proximal Policy Optimization|
|NASBOT||Neural Architecture Search with Bayesian Optimization and optimal Transport|
|MCTS||Monte Carlo Tree Search|
2 Related Work
A subset of the wider problem of hyper-parameter optimization (bergstra2011algorithms; snoek2012practical), neural architecture search focuses exclusively on the search for network architecture elements, rather than model training related hyper-parameters such as the learning rate. Here, the search space is usually predefined, discrete, multi-dimensional and sometimes of variable dimensionality.
In methods based on Bayesian optimization with GPs (kandasamy2018neural; falkner2018bohb; bergstra2011algorithms; white2019bananas), the function mapping the neural network architecture to the validation error is modelled as a GP. This permits easy computation of the posterior distribution of the validation error for a new architecture. The performance of the method hinges on designing an appropriate kernel function for the GP and an acquisition function for fetching the next set of architectures for evaluation. BANANAS (white2019bananas)
achieves state of the art on NASBench-101 by deploying an ensemble of feed-forward neural networks to model the GP, and acquiring new architectures for evaluation by mutating the set of best architectures found thus far. A limitation of these approaches is that performance degrades in high-dimensional search spaces, because larger samples are required to update the posterior distribution(pmf).
In evolutionary learning models, inspired by genetic evolution, the architectures are modelled as gene strings. The search proceeds by mutating and combining strings so as to hone in on promising architectures. Model free RL based techniques (zoph2016neural; zoph2018learning; baker2016designing; pham2018efficient; wang2018alphax) specify a policy network that learns to output desirable architectures. Search proceeds by training the policy network using Q-learning or policy gradient. These techniques are flexible in that they can search over variable length architectures, and have shown very promising results for neural architecture search.
Let the search space comprise hyperparameters, indexed by . RL methods based on policy gradient (zoph2016neural; zoph2018learning) specify a policy network, parametrized by
, to learn a desired probability distribution over values of the hyperparameters,, where denotes the value of the hyper-parameter.
We set up the training regime such that the policy network learns to assign higher probabilities to those sequences of hyperparameter values (henceforth, referred to as strings) that yield a higher accuracy on the cross-validation dataset. Accordingly, we maximize
where is an unknown function that maps strings from the hyperparameter search space to the ML model’s accuracy on the validation dataset.
is a baseline function to reduce the variance of the estimate of the gradient of (1). We set as the exponential moving average of accuracies of previously sampled architectures. is referred to as the advantage function, denoted by .
We optimize the objective via gradient ascent where the gradient can be estimated using the Reinforce rule in (williams1992simple). The optimization procedure alternates between two steps until we exhaust the exploration budget:
sample a batch of action strings based on the current state of the policy network and fetch the corresponding rewards from the environment
update the policy network’s parameters using policy gradient
The exploration budget can be quantified in units of computation such as number of GPU/TPU hours for training architectures or, alternately, as the number of times the policy network can query the environment to fetch the architecture’s score. In case of the latter, it is assumed that all architectures consume identical compute for getting trained.
4 Autoregressive Models for Density Estimation
The policy network can be set up as an autoregressive model, an approach that has been successfully applied to language models (mikolov2010recurrent; kim2016character), generative models of images (oord2016pixel; salimans2017pixelcnn++; chen2017pixelsnail) and speech (oord2016wavenet).
The choice of a parametric architecture for modelling terms in (2
) becomes crucial as it needs to balance expressiveness against model complexity. The former is important to learn dependencies between hyper-parameters over increasing string lengths, while the latter needs to be economized for discovering optimal strings within an exploration budget. RNN based networks struggle to learn adequate context representations over longer sequence lengths because of the vanishing gradient/exploding gradient problem(pascanu2013difficulty)
. On the other hand, the parametric complexity of masked multi-layer perceptron-based models, such as MADE(germain2015made) and NADE (uria2016neural), increases with string length (table 2).
5 Masked Attention Autoregressive Density Estimators (MAADE)
To strike a balance between expressiveness and complexity in designing the policy network, we propose the Masked Attention Autoregressive Density Estimation (MAADE) architecture, comprising three layers:
Embedding layer maps hyper-parameters and hyper-parameter values to a
dimensional vector space
Context Representation layer models dependencies between hyper-parameters as specified by the auto-regressive factorization
Density Estimation Layer computes the probability density for a string
Embedding Layer Let , be query vectors for each of the hyper-parameters. We also maintain value vectors for the values that each hyper-parameter can take. We assume without loss of generality that all hyper-parameters assume categorical values in the same space and dimensionality . We therefore share the embedding layer across hyper-parameters. Given the value of the hyper-parameter, , the corresponding value vector is which we denote as with slight abuse of notation. Note that this framework can easily be extended to deal with hyper-parameters operating in different spaces as well, such that we maintain a separate embedding layer for each hyper-parameter family.
Context Representation Layer This layer produces dimensional contextual vectors, for each hyper-parameter as a function of the hyper-parameter being predicted and previously seen hyper-parameters:
Inspired by XL-net (yang2019xlnet), we use a two-stream masked attention based architecture comprising query and key vectors to compose . A notable departure from XL-net is that since we are not predicting probabilities for a position but for a given hyper-parameter, we let the query vector of the target hyper-parameter attend to preceding key vectors. Each key vector attends to preceding key vectors as well as itself.
Stream 1 is initialized as stream 2 is initialized as . Then we update the streams as:
Finally, we get the contextual representations as:
Simplified Transformer Block
To specify the Simplified Transformer block, we use a simplified version of Transformer (vaswani2017attention). In the attention layer, we eschew dot production attention in favour of additive attention (bahdanau2014neural) to model interactions between query and key/value vectors as it was found to marginally improve the policy network’s performance. We also found the policy network’s performance to deteriorate when
, obviating the need for both residual connections and layer-normalization. Finally, we do away with positional encoding since the sequence in which preceding hyper-parameters in the auto-regressive order were encountered doesn’t matter.
We provide details of the computation steps in the Simplified Transformer block:
where Masked Attention is the additive attention operation, and PosFF, the position-wise feed-forward operation with relu non-linearity replaced by tanh:
Note that both the streams share parameters of the masked attention and feed-forward operations.
Density Estimation Layer: In this layer, we pass the context representations through an affine transformation specific to the target hyper-parameter followed by a softmax,
For training the policy network, we can optimize either a PPO objective (schulman2017proximal) or the term in (1). Performance can be potentially further improved by averaging over multiple autoregressive factorization orders with shared parameters, as done in (uria2016neural; yang2019xlnet). To do so, we explicitly condition the density on the autoregressive factorization order and share the policy network’s parameters across orders.
Accordingly, we set up the following framework: Let denote the set of all possible permutations of length index sequences. Let with set size . Let denote the element of a permutation . We then define the joint probability of the autoregressive factorization order and action string tuple, , as:
To sample from this distribution, we sample a permutation uniformly at random from . We then fix the autoregressive factorization order based on the sampled permutation and sample the action string based on the MAADE architecture.
We maximize the following PPO objective:
An unbiased estimate of the gradient of (11) is:
We update the policy network parameters via gradient ascent,
where , B and S constitute ReMAADE’s hyper-parameters, which, along with the learning rate, , can be tuned via cross-validation.
We expect that higher values of will lead to improved performance as we increase the exploration budget. Also note that in the limit when , the training objective is identical to the bidirectional training objective in (yang2019xlnet). We can optionally add an entropy term to the objective to encourage exploration if we have a large exploration budget (mnih2016asynchronous).
7 ReMAADE algorithm
We now describe the REINFORCE with Masked Attention Auto-regressive Density Estimators (ReMAADE) algorithm.
Search space of architectures :
Environment function that maps an architecture from the searchspace to the corresponding validation accuracy :
ReMAADE hyperparameters: where is the batch-size, is the set size of auto-regressive orderings to sample from, is the learning rate, is the PPO co-efficient.
MAADE hyperparameters: where is the embedding dimension for the transformer block and is the number of transformer blocks stacked.
Complexity A key advantage of attention-based networks is that model complexity of the context representation layer doesn’t change with string length . This is crucial for the NAS problem where the policy network needs to discover the best architecture within a limited exploration budget.
|Architecture||Complexity of Context Representation layer|
|Masked Attention Autoregressive Density Estimator (MAADE)|
|Neural Autoregressive Density (NADE)|
|Masked Autoregressive Density Estimator (MADE)|
|LSTM based controller|
8 Results on NASBench-101
NASBench-101 search space The NASBench-101 dataset (yang2019xlnet)
is a public architecture dataset to facilitate NAS research and compare NAS algorithms. The search space comprises the elements of small-feed forward structures called cells. These cells are assembled together in a predefined manner to form an overall convolutional neural network architecture that is trained on the CIFAR-10 dataset.
A cell comprises 7 nodes, of which the first node is the input node and the last node is the output. The remaining 5 nodes need to be assigned one of 3 operations: 1x1 convolution, 3x3 convolution, or 3x3 max pooling. The nodes then need to be connected to form a valid directed acyclic graph (DAG). The NAS algorithm therefore needs to specify the operations for each of the 5 nodes, and then specify the edges to form a valid DAG. To limit the search space, NASBench-101 imposes additional constraints: the total number of edges cannot exceed 9, and there needs to be a path from the input node to the output node. This results in 423K valid and unique ways to specify a cell. NASBench-101 has precomputed the validation and test errors for all the neural network architectures that can be designed from these 432K cell configurations.
To benchmark ReMAADE on NASBench-101, we investigate short term performance (exploration budget of 150 architectures), and medium term performance (exploration budget of 3200 architectures). We include random search, which is regarded as a competitive baseline (li2019random), regularized evolution (real2019regularized), and AlphaX, an RL algorithm that uses MCTS (wang2018alphax). We also compare with several algorithms based on Bayesian optimization with GP priors: BOHB (snoek2012practical), tree-structured Parzen estimator (TPE) (bergstra2011algorithms), BANANAS (white2019bananas), and NASBOT (kandasamy2018neural). For all NAS algorithms, during a trial, we track the best random validation error achieved after explorations and the corresponding random test error. We report metrics averaged over 500 trials for each NAS algorithm. For ReMAADE, in all experiments, we set , and used ADAM (adam) for updating .
Short Term Performance
NAS algorithms need to discover good architectures within 150 explorations to be of practical use. In this setting, (table 3), ReMAADE outperforms all algorithms with the exception of BANANAS. For ReMAADE, we set .
|Algorithm||Test Error (in %)|
|TPE||6.43 +- 0.16|
|BOHB||6.40 +- 0.12|
|Random Search||6.36 +- 0.12|
|Alpha X||6.31 +- 0.13|
|NASBOT||6.35 +- 0.10|
|Reg Evolution||6.20 +- 0.13|
|ReMAADE||6.15 +- 0.13|
|BANANAS||5.77 +- 0.31|
8.1 Ablation studies
We perform an ablation study to understand the importance of the autoregressive component and MAADE in designing the context representation layer. We can use REINFORCE without an autoregressive model, which we call plain vanilla REINFORCE. In other words, we assume that all actions are independent:
This amounts to updating only the bias terms in (9) using policy gradient, and yields a baseline test set error of 6.26%. Interestingly, using MADE (germain2015made) to design the autoregressive model failed to improve upon plain vanilla REINFORCE, underscoring the importance of MAADE in capturing auto-regressive dependencies. In the short term setting, performance was marginally better without PPO.
|Algorithm||Test Error (in %)|
|Random Search||6.30 +-0.25|
|Plain vanilla REINFORCE||6.26 +- 0.22|
|REINFORCE with MADE||6.25 +- 0.22|
|ReMAADE with PPO||6.17 +-0.25|
|ReMAADE w/o PPO||6.16 +- 0.25|
Medium Term Performance
We also investigate and benchmark the performance of ReMAADE when the exploration budget is 3,200. To take advantage of the increased budget, we use a PPO objective, set the learning rate to , and increase MAADE’s parametric complexity by setting . The entropy term did not improve performance and so we dropped it. Other hyper-parameter values were: .
|Algorithm||Test Error (in %)|
|TPE||6.19 +- 0.83|
|Random Search||6.03+- 0.23|
|Reg Evolution||6.04 +- 0.23|
ReMAADE outperforms all algorithms with a bigger margin with the exception of BANANAS in this setting.
8.2 Effect of auto-regressive ordering ensembles
Autoregressive density estimation models struggle with terms in (11) with since they use a fixed capacity in the context representation layer. This can be mitigated to some extent by picking an autoregressive factorization order that exploits spatial-temporal dependencies using an appropriate architecture. For instance, in generative modelling of images, the raster scan ordering is preferred as it is able to capture spatial dependencies in the immediate neighborhood (oord2016pixel).
In case of neural architecture search, however, it is not clear, a priori, what autoregressive order to fix. Therefore, training an ensemble of models, each with a different autoregressive factorization order, with parameters shared across all models, can potentially improve performance. In line with this intuition, we found performance to improve, then dip, before asymptotically improving again as increases.
|S||Test Error(in %)|
|1||5.95 +- 0.18|
|2||5.94 +- 0.19|
|4||5.91 +- 0.19|
|6||5.92 +- 0.19|
|8||5.93 +- 0.19|
|16||5.92 +- 0.19|
|N!||5.92 +- 0.19|
9 Conclusion, Future Work
In this work, we propose a neural architecture search algorithm which uses reinforcement learning, the Transformer as the controller and we improve upon it by conditioning and then ensembling across multiple autoregressive factorization orders. We present our results on the NASBench-101 search space. We evaluate our algorithm in the short-term and the medium-term against a number of NAS approaches such as random search, regularized evolution, Bayesian optimization etc. We also present ablation studies for studying the impact of the various systems to the performance. An interesting line of work would be to leverage the meta-network from BANANAS (white2019bananas) to further train the MAADE policy network and improve performance. We also plan to investigate the performance of ReMAADE on other neural architecture search spaces such as DARTS (liu2018darts) and also for the wider problem of hyper-parameter optimization.
Appendix A Best Practices
In order for better evaluation and reproduction of our research we address all the items in the checklist as mentioned in (lindauer2019best).
Code for the training pipeline used to evaluate the final architectures We used the search space of the architectures reported in NASBench-101 and thus the accuracy for all the architectures were precomputed. We have published code to train the policy network for ReMAADE algorithm, we also provide code for how different NAS algortihms were evaluated.
Code for the search space The publicly available NASBench-101 dataset search space was used.
Hyperparameters used for the final evaluation pipeline, as well as random seeds The hyperparameters were left unchanged.
For all NAS methods you compare, did you use exactly the same NAS benchmark, including the same dataset, search space, and code for training the architectures and hyperparameters for that code? Yes, as NASBench-101 was used for evaluation, the search space was fixed accordingly. All the different NAS methods we surveyed were evaluated against the same NASBench-101 dataset.
Did you control for confounding factors?Yes, all the experiments across all NAS algorithms were on on the same NASBench-101 framework.
Did you run ablation studies? Yes, the results for the ablation studies have been outlined in the paper.
Did you use the same evaluation protocol for the methods being compared? Yes, the same evaluation protocol was used.
Did you compare performance over time? Yes, the performance was evaluated both in the short term and the medium term with an exploration budget of 150 architectures in the short term and an exploration budget of 3200 architectures in the medium-term..
Did you compare to random search? Yes.
Did you perform multiple runs of your experiments and report seeds? Yes, we ran 500 trials of each experiment, with a different seed for each trial on NASBench-101. These results are completely reproducible.
Did you use tabular or surrogate benchmarks for indepth evaluations Yes, all our experiments were evaluated against the NASBench-101 dataset.
Did you report how you tuned hyperparameters, and what time and resources this required? We explored certain ranges of hyper-parameters get the best performance. These have been mentioned in the paper.
Did you report the time for the entire end-to-end NAS method? Since all our experiments were run against the NASBench-101 framework for which we had pre-computed results, we were able to test the architectures generated by our network without training them from scratch.
Did you report all details of your experimental setup? Yes, all the details for the experimental setup have been reported.