Differentiable architecture search for convolutional and recurrent networks
This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR-10, ImageNet, Penn Treebank and WikiText-2 show that our algorithm excels in discovering high-performance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than state-of-the-art non-differentiable techniques.READ FULL TEXT VIEW PDF
Neural architecture search (NAS) aims to discover network architectures ...
This paper addresses the efficiency challenge of Neural Architecture Sea...
Convolutional Neural Networks (CNN) have been regarded as a capable clas...
In this paper, we propose the differentiable channel pruning search (DCP...
In recent years, there has been increasing demand for automatic architec...
The effort devoted to hand-crafting image classifiers has motivated the ...
This paper reports the first successful application of a differentiable
Differentiable architecture search for convolutional and recurrent networks
Discovering state-of-the-art neural network architectures requires substantial effort of human experts. Recently, there has been a growing interest in developing algorithmic solutions to automate the manual process of architecture design. The automatically searched architectures have achieved highly competitive performance in tasks such as image classification(Zoph & Le, 2017; Zoph et al., 2018; Liu et al., 2018b, a; Real et al., 2018) and object detection (Zoph et al., 2018).
The best existing architecture search algorithms are computationally demanding despite their remarkable performance. For example, obtaining a state-of-the-art architecture for CIFAR-10 and ImageNet required 2000 GPU days of reinforcement learning (RL) (Zoph et al., 2018) or 3150 GPU days of evolution (Real et al., 2018). Several approaches for speeding up have been proposed, such as imposing a particular structure of the search space (Liu et al., 2018b, a), weights or performance prediction for each individual architecture (Brock et al., 2018; Baker et al., 2018) and weight sharing/inheritance across multiple architectures (Elsken et al., 2017; Pham et al., 2018b; Cai et al., 2018; Bender et al., 2018), but the fundamental challenge of scalability remains. An inherent cause of inefficiency for the dominant approaches, e.g. based on RL, evolution, MCTS (Negrinho & Gordon, 2017), SMBO (Liu et al., 2018a) or Bayesian optimization (Kandasamy et al., 2018), is the fact that architecture search is treated as a black-box optimization problem over a discrete domain, which leads to a large number of architecture evaluations required.
In this work, we approach the problem from a different angle, and propose a method for efficient architecture search called DARTS (Differentiable ARchiTecture Search). Instead of searching over a discrete set of candidate architectures, we relax the search space to be continuous, so that the architecture can be optimized with respect to its validation set performance by gradient descent. The data efficiency of gradient-based optimization, as opposed to inefficient black-box search, allows DARTS to achieve competitive performance with the state of the art using orders of magnitude less computation resources. It also outperforms another recent efficient architecture search method, ENAS (Pham et al., 2018b). Notably, DARTS is simpler than many existing approaches as it does not involve controllers (Zoph & Le, 2017; Baker et al., 2017; Zoph et al., 2018; Pham et al., 2018b; Zhong et al., 2018), hypernetworks (Brock et al., 2018) or performance predictors (Liu et al., 2018a), yet it is generic enough handle both convolutional and recurrent architectures.
The idea of searching architectures within a continuous domain is not new (Saxena & Verbeek, 2016; Ahmed & Torresani, 2017; Veniat & Denoyer, 2017; Shin et al., 2018), but there are several major distinctions. While prior works seek to fine-tune a specific aspect of an architecture, such as filter shapes or branching patterns in a convolutional network, DARTS is able to learn high-performance architecture building blocks with complex graph topologies within a rich search space. Moreover, DARTS is not restricted to any specific architecture family, and is applicable to both convolutional and recurrent networks.
In our experiments (Sect. 3) we show that DARTS is able to design a convolutional cell that achieves 2.76 0.09% test error on CIFAR-10 for image classification using 3.3M parameters, which is competitive with the state-of-the-art result by regularized evolution (Real et al., 2018) obtained using three orders of magnitude more computation resources. The same convolutional cell also achieves 26.7% top-1 error when transferred to ImageNet (mobile setting), which is comparable to the best RL method (Zoph et al., 2018). On the language modeling task, DARTS efficiently discovers a recurrent cell that achieves 55.7 test perplexity on Penn Treebank (PTB), outperforming both extensively tuned LSTM (Melis et al., 2018) and all the existing automatically searched cells based on NAS (Zoph & Le, 2017) and ENAS (Pham et al., 2018b).
Our contributions can be summarized as follows:
We introduce a novel algorithm for differentiable network architecture search based on bilevel optimization, which is applicable to both convolutional and recurrent architectures.
Through extensive experiments on image classification and language modeling tasks we show that gradient-based architecture search achieves highly competitive results on CIFAR-10 and outperforms the state of the art on PTB. This is a very interesting result, considering that so far the best architecture search methods used non-differentiable search techniques, e.g. based on RL (Zoph et al., 2018) or evolution (Real et al., 2018; Liu et al., 2018b).
We achieve remarkable efficiency improvement (reducing the cost of architecture discovery to a few GPU days), which we attribute to the use of gradient-based optimization as opposed to non-differentiable search techniques.
We show that the architectures learned by DARTS on CIFAR-10 and PTB are transferable to ImageNet and WikiText-2, respectively.
The implementation of DARTS is available at https://github.com/quark0/darts
We describe our search space in general form in Sect. 2.1, where the computation procedure for an architecture (or a cell in it) is represented as a directed acyclic graph. We then introduce a simple continuous relaxation scheme for our search space which leads to a differentiable learning objective for the joint optimization of the architecture and its weights (Sect. 2.2). Finally, we propose an approximation technique to make the algorithm computationally feasible and efficient (Sect. 2.3).
Following Zoph et al. (2018); Real et al. (2018); Liu et al. (2018a, b), we search for a computation cell as the building block of the final architecture. The learned cell could either be stacked to form a convolutional network or recursively connected to form a recurrent network.
A cell is a directed acyclic graph consisting of an ordered sequence of nodes. Each node is a latent representation (e.g. a feature map in convolutional networks) and each directed edge is associated with some operation that transforms . We assume the cell to have two input nodes and a single output node. For convolutional cells, the input nodes are defined as the cell outputs in the previous two layers (Zoph et al., 2018). For recurrent cells, these are defined as the input at the current step and the state carried from the previous step. The output of the cell is obtained by applying a reduction operation (e.g. concatenation) to all the intermediate nodes.
Each intermediate node is computed based on all of its predecessors:
A special zero operation is also included to indicate a lack of connection between two nodes. The task of learning the cell therefore reduces to learning the operations on its edges.
be a set of candidate operations (e.g., convolution, max pooling,zero) where each operation represents some function to be applied to . To make the search space continuous, we relax the categorical choice of a particular operation to a softmax over all possible operations:
where the operation mixing weights for a pair of nodes
are parameterized by a vectorof dimension . The task of architecture search then reduces to learning a set of continuous variables , as illustrated in Fig. 1. At the end of search, a discrete architecture can be obtained by replacing each mixed operation with the most likely operation, i.e., . In the following, we refer to as the (encoding of the) architecture.
After relaxation, our goal is to jointly learn the architecture and the weights within all the mixed operations (e.g. weights of the convolution filters). Analogous to architecture search using RL (Zoph & Le, 2017; Zoph et al., 2018; Pham et al., 2018b) or evolution (Liu et al., 2018b; Real et al., 2018) where the validation set performance is treated as the reward or fitness, DARTS aims to optimize the validation loss, but using gradient descent.
Denote by and the training and the validation loss, respectively. Both losses are determined not only by the architecture , but also the weights in the network. The goal for architecture search is to find that minimizes the validation loss , where the weights associated with the architecture are obtained by minimizing the training loss .
The nested formulation also arises in gradient-based hyperparameter optimization(Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2018), which is related in a sense that the architecture could be viewed as a special type of hyperparameter, although its dimension is substantially higher than scalar-valued hyperparameters such as the learning rate, and it is harder to optimize.
Evaluating the architecture gradient exactly can be prohibitive due to the expensive inner optimization. We therefore propose a simple approximation scheme as follows:
where denotes the current weights maintained by the algorithm, and is the learning rate for a step of inner optimization. The idea is to approximate by adapting using only a single training step, without solving the inner optimization (equation 4) completely by training until convergence. Related techniques have been used in meta-learning for model transfer (Finn et al., 2017), gradient-based hyperparameter tuning (Luketina et al., 2016) and unrolled generative adversarial networks (Metz et al., 2017). Note equation 6 will reduce to if is already a local optimum for the inner optimization and thus .
The iterative procedure is outlined in Alg. 1. While we are not currently aware of the convergence guarantees for our optimization algorithm, in practice it is able to reach a fixed point with a suitable choice of 111A simple working strategy is to set equal to the learning rate for ’s optimizer.. We also note that when momentum is enabled for weight optimisation, the one-step unrolled learning objective in equation 6 is modified accordingly and all of our analysis still applies.
Applying chain rule to the approximate architecture gradient (equation6) yields
where denotes the weights for a one-step forward model. The expression above contains an expensive matrix-vector product in its second term. Fortunately, the complexity can be substantially reduced using the finite difference approximation. Let be a small scalar222We found to be sufficiently accurate in all of our experiments. and . Then:
Evaluating the finite difference requires only two forward passes for the weights and two backward passes for , and the complexity is reduced from to .
When , the second-order derivative in equation 7 will disappear. In this case, the architecture gradient is given by
, corresponding to the simple heuristic of optimizing the validation loss by assuming the currentis the same as . This leads to some speed-up but empirically worse performance, according to our experimental results in Table 1 and Table 2. In the following, we refer to the case of as the first-order approximation, and refer to the gradient formulation with as the second-order approximation.
To form each node in the discrete architecture, we retain the top- strongest operations (from distinct nodes) among all non-zero candidate operations collected from all the previous nodes. The strength of an operation is defined as . To make our derived architecture comparable with those in the existing works, we use for convolutional cells (Zoph et al., 2018; Liu et al., 2018a; Real et al., 2018) and for recurrent cells (Pham et al., 2018b).
The zero operations are excluded in the above for two reasons. First, we need exactly
non-zero incoming edges per node for fair comparison with the existing models. Second, the strength of the zero operations is underdetermined, as increasing the logits of zero operations only affects the scale of the resulting node representations, and does not affect the final classification outcome due to the presence of batch normalization(Ioffe & Szegedy, 2015).
Our experiments on CIFAR-10 and PTB consist of two stages, architecture search (Sect. 3.1) and architecture evaluation (Sect. 3.2). In the first stage, we search for the cell architectures using DARTS, and determine the best cells based on their validation performance. In the second stage, we use these cells to construct larger architectures, which we train from scratch and report their performance on the test set. We also investigate the transferability of the best cells learned on CIFAR-10 and PTB by evaluating them on ImageNet and WikiText-2 (WT2) respectively.
We include the following operations in : and separable convolutions, and dilated separable convolutions, max pooling, average pooling, identity, and
. All operations are of stride one (if applicable) and the convolved feature maps are padded to preserve their spatial resolution. We use the ReLU-Conv-BN order for convolutional operations, and each separable convolution is always applied twice(Zoph et al., 2018; Real et al., 2018; Liu et al., 2018a).
Our convolutional cell consists of nodes, among which the output node is defined as the depthwise concatenation of all the intermediate nodes (input nodes excluded). The rest of the setup follows Zoph et al. (2018); Liu et al. (2018a); Real et al. (2018), where a network is then formed by stacking multiple cells together. The first and second nodes of cell are set equal to the outputs of cell and cell , respectively, and convolutions are inserted as necessary. Cells located at the and of the total depth of the network are reduction cells, in which all the operations adjacent to the input nodes are of stride two. The architecture encoding therefore is , where is shared by all the normal cells and is shared by all the reduction cells.
Detailed experimental setup for this section can be found in Sect. A.1.1.
Our set of available operations includes linear transformations followed by one of, , activations, as well as the identity mapping and the zero operation. The choice of these candidate operations follows Zoph & Le (2017); Pham et al. (2018b).
Our recurrent cell consists of nodes. The very first intermediate node is obtained by linearly transforming the two input nodes, adding up the results and then passing through a activation function, as done in the ENAS cell (Pham et al., 2018b). The rest of the cell is learned. Other settings are similar to ENAS, where each operation is enhanced with a highway bypass (Zilly et al., 2016) and the cell output is defined as the average of all the intermediate nodes. As in ENAS, we enable batch normalization in each node to prevent gradient explosion during architecture search, and disable it during architecture evaluation. Our recurrent network consists of only a single cell, i.e. we do not assume any repetitive patterns within the recurrent architecture.
Detailed experimental setup for this section can be found in Sect. A.1.2.
Search progress of DARTS for convolutional cells on CIFAR-10 and recurrent cells on Penn Treebank. We keep track of the most recent architectures over time. Each architecture snapshot is re-trained from scratch using the training set (for 100 epochs on CIFAR-10 and for 300 epochs on PTB) and then evaluated on the validation set. For each task, we repeat the experiments for 4 times with different random seeds, and report the median and the best (per run) validation performance of the architectures over time. As references, we also report the results (under the same evaluation setup; with comparable number of parameters) of the best existing cells discovered using RL or evolution, including NASNet-A(Zoph et al., 2018) (2000 GPU days), AmoebaNet-A (3150 GPU days) (Real et al., 2018) and ENAS (0.5 GPU day) (Pham et al., 2018b).
To determine the architecture for final evaluation, we run DARTS four times with different random seeds and pick the best cell based on its validation performance obtained by training from scratch for a short period (100 epochs on CIFAR-10 and 300 epochs on PTB). This is particularly important for recurrent cells, as the optimization outcomes can be initialization-sensitive (Fig. 3).
To evaluate the selected architecture, we randomly initialize its weights (weights learned during the search process are discarded), train it from scratch, and report its performance on the test set. We note the test set is never used for architecture search or architecture selection.
, respectively. Besides CIFAR-10 and PTB, we further investigated the transferability of our best convolutional cell (searched on CIFAR-10) and recurrent cell (searched on PTB) by evaluating them on ImageNet (mobile setting) and WikiText-2, respectively. More details of the transfer learning experiments can be found in Sect.A.2.3 and Sect. A.2.4.
The CIFAR-10 results for convolutional architectures are presented in Table 1. Notably, DARTS achieved comparable results with the state of the art (Zoph et al., 2018; Real et al., 2018) while using three orders of magnitude less computation resources (i.e. 1.5 or 4 GPU days vs 2000 GPU days for NASNet and 3150 GPU days for AmoebaNet). Moreover, with slightly longer search time, DARTS outperformed ENAS (Pham et al., 2018b) by discovering cells with comparable error rates but less parameters. The longer search time is due to the fact that we have repeated the search process four times for cell selection. This practice is less important for convolutional cells however, because the performance of discovered architectures does not strongly depend on initialization (Fig. 3).
To better understand the necessity of bilevel optimization, we investigated a simplistic search strategy, where and are jointly optimized over the union of the training and validation sets using coordinate descent. The resulting best convolutional cell (out of 4 runs) yielded 4.16 0.16% test error using 3.1M parameters, which is worse than random search. In the second experiment, we optimized simultaneously with (without alteration) using SGD, again over all the data available (training + validation). The resulting best cell yielded 3.56 0.10% test error using 3.0M parameters. We hypothesize that these heuristics would cause (analogous to hyperparameters) to overfit the training data, leading to poor generalization. Note that is not directly optimized on the training set in DARTS.
Table 2 presents the results for recurrent architectures on PTB, where a cell discovered by DARTS achieved the test perplexity of 55.7. This is on par with the state-of-the-art model enhanced by a mixture of softmaxes (Yang et al., 2018), and better than all the rest of the architectures that are either manually or automatically discovered. Note that our automatically searched cell outperforms the extensively tuned LSTM (Melis et al., 2018), demonstrating the importance of architecture search in addition to hyperparameter search. In terms of efficiency, the overall cost (4 runs in total) is within 1 GPU day, which is comparable to ENAS and significantly faster than NAS (Zoph & Le, 2017).
It is also interesting to note that random search is competitive for both convolutional and recurrent models, which reflects the importance of the search space design. Nevertheless, with comparable or less search cost, DARTS is able to significantly improve upon random search in both cases (2.76 0.09 vs 3.29 0.15 on CIFAR-10; 55.7 vs 59.4 on PTB).
Results in Table 3 show that the cell learned on CIFAR-10 is indeed transferable to ImageNet. It is worth noticing that DARTS achieves competitive performance with the state-of-the-art RL method (Zoph et al., 2018) while using three orders of magnitude less computation resources.
|Architecture||Test Error (%)||Params||Search Cost||Search|
|Inception-v1 (Szegedy et al., 2015)||30.2||10.1||6.6||1448||–||manual|
|MobileNet (Howard et al., 2017)||29.4||10.5||4.2||569||–||manual|
|ShuffleNet 2 () (Zhang et al., 2017)||26.3||–||5||524||–||manual|
|NASNet-A (Zoph et al., 2018)||26.0||8.4||5.3||564||2000||RL|
|NASNet-B (Zoph et al., 2018)||27.2||8.7||5.3||488||2000||RL|
|NASNet-C (Zoph et al., 2018)||27.5||9.0||4.9||558||2000||RL|
|AmoebaNet-A (Real et al., 2018)||25.5||8.0||5.1||555||3150||evolution|
|AmoebaNet-B (Real et al., 2018)||26.0||8.5||5.3||555||3150||evolution|
|AmoebaNet-C (Real et al., 2018)||24.3||7.6||6.4||570||3150||evolution|
|PNAS (Liu et al., 2018a)||25.8||8.1||5.1||588||225||SMBO|
|DARTS (searched on CIFAR-10)||26.7||8.7||4.7||574||4||gradient-based|
Table 4 shows that the cell identified by DARTS transfers to WT2 better than ENAS, although the overall results are less strong than those presented in Table 2 for PTB. The weaker transferability between PTB and WT2 (as compared to that between CIFAR-10 and ImageNet) could be explained by the relatively small size of the source dataset (PTB) for architecture search. The issue of transferability could potentially be circumvented by directly optimizing the architecture on the task of interest.
We presented DARTS, a simple yet efficient architecture search algorithm for both convolutional and recurrent networks. By searching in a continuous space, DARTS is able to match or outperform the state-of-the-art non-differentiable architecture search methods on image classification and language modeling tasks with remarkable efficiency improvement by several orders of magnitude.
There are many interesting directions to improve DARTS further. For example, the current method may suffer from discrepancies between the continuous architecture encoding and the derived discrete architecture. This could be alleviated, e.g., by annealing the softmax temperature (with a suitable schedule) to enforce one-hot selection. It would also be interesting to investigate performance-aware architecture derivation schemes based on the shared parameters learned during the search process.
The authors would like to thank Zihang Dai, Hieu Pham and Zico Kolter for useful discussions.
International Conference on Machine Learning, pp. 549–558, 2018.
A theoretically grounded application of dropout in recurrent neural networks.In NIPS, pp. 1019–1027, 2016.
Automatic differentiation in pytorch.In NIPS-W, 2017.
Since the architecture will be varying throughout the search process, we always use batch-specific statistics for batch normalization rather than the global moving average. Learnable affine parameters in all batch normalizations are disabled during the search process to avoid rescaling the outputs of the candidate operations.
To carry out architecture search, we hold out half of the CIFAR-10 training data as the validation set. A small network of 8 cells is trained using DARTS for 50 epochs, with batch size (for both the training and validation sets) and the initial number of channels . The numbers were chosen to ensure the network can fit into a single GPU. We use momentum SGD to optimize the weights , with initial learning rate (annealed down to zero following a cosine schedule without restart (Loshchilov & Hutter, 2016)), momentum , and weight decay . We use zero initialization for architecture variables (the ’s in both the normal and reduction cells), which implies equal amount of attention (after taking the softmax) over all possible ops. At the early stage this ensures weights in every candidate op to receive sufficient learning signal (more exploration). We use Adam (Kingma & Ba, 2014) as the optimizer for , with initial learning rate , momentum and weight decay . The search takes one day on a single GPU333All of our experiments were performed using NVIDIA GTX 1080Ti GPUs..
For architecture search, both the embedding and the hidden sizes are set to 300. The linear transformation parameters across all incoming operations connected to the same node are shared (their shapes are all 300 300), as the algorithm always has the option to focus on one of the predecessors and mask away the others. Tying the weights leads to memory savings and faster computation, allowing us to train the continuous architecture using a single GPU. Learnable affine parameters in batch normalizations are disabled, as we did for convolutional cells. The network is then trained for 50 epochs using SGD without momentum, with learning rate , batch size 256, BPTT length 35, and weight decay . We apply variational dropout (Gal & Ghahramani, 2016) of to word embeddings, to the cell input, and to all the hidden nodes. A dropout of is also applied to the output layer. Other training settings are identical to those in Merity et al. (2018); Yang et al. (2018). Similarly to the convolutional architectures, we use Adam for the optimization of (initialized as zeros), with initial learning rate , momentum and weight decay . The search takes 6 hours on a single GPU.
A large network of 20 cells is trained for 600 epochs with batch size 96. The initial number of channels is increased from 16 to 36 to ensure our model size is comparable with other baselines in the literature (around 3M). Other hyperparameters remain the same as the ones used for architecture search. Following existing works (Pham et al., 2018b; Zoph et al., 2018; Liu et al., 2018a; Real et al., 2018), additional enhancements include cutout (DeVries & Taylor, 2017), path dropout of probability and auxiliary towers with weight . The training takes 1.5 days on a single GPU with our implementation in PyTorch (Paszke et al., 2017)
. Since the CIFAR results are subject to high variance even with exactly the same setup(Liu et al., 2018b)
, we report the mean and standard deviation of 10 independent runs for our full model.
A single-layer recurrent network with the discovered cell is trained until convergence with batch size 64 using averaged SGD (Polyak & Juditsky, 1992) (ASGD), with learning rate and weight decay . To speedup, we start with SGD and trigger ASGD using the same protocol as in Yang et al. (2018); Merity et al. (2018). Both the embedding and the hidden sizes are set to 850 to ensure our model size is comparable with other baselines. The token-wise dropout on the embedding layer is set to 0.1. Other hyperparameters remain exactly the same as those for architecture search. For fair comparison, we do not finetune our model at the end of the optimization, nor do we use any additional enhancements such as dynamic evaluation (Krause et al., 2017) or continuous cache (Grave et al., 2016). The training takes 3 days on a single 1080Ti GPU with our PyTorch implementation. To account for implementation discrepancies, we also incorporated the ENAS cell (Pham et al., 2018b) into our codebase and trained their network under the same setup as our discovered cells.
We consider the mobile setting where the input image size is 224224 and the number of multiply-add operations in the model is restricted to be less than 600M.
A network of 14 cells is trained for 250 epochs with batch size 128, weight decay and initial SGD learning rate 0.1 (decayed by a factor of 0.97 after each epoch). Other hyperparameters follow Zoph et al. (2018); Real et al. (2018); Liu et al. (2018a)444We did not conduct extensive hyperparameter tuning.. The training takes 12 days on a single GPU.
We use embedding and hidden sizes 700, weight decay , and hidden-node variational dropout 0.15. Other hyperparameters remain the same as in our PTB experiments.
To better understand the effect of depth for architecture search, we conducted architecture search on CIFAR-10 by increasing the number of cells in the stack from 8 to 20. The initial number of channels is reduced from 16 to 6 due to memory budget of a single GPU. All the other hyperparameters remain the same. The search cost doubles and the resulting cell achieves 2.88 0.09% test error, which is slightly worse than 2.76 0.09% obtained using a shallower network. This particular setup may have suffered from the enlarged discrepancy of the number of channels between architecture search and final evaluation. Moreover, searching with a deeper model might require different hyperparameters due to the increased number of layers to back-prop through.
In this section, we analyze the complexity of our search space for convolutional cells.
Each of our discretized cell allows possible DAGs without considering graph isomorphism (recall we have 7 non-zero ops, 2 input nodes, 4 intermediate nodes with 2 predecessors each). Since we are jointly learning both normal and reduction cells, the total number of architectures is approximately . This is greater than the of PNAS (Liu et al., 2018a) which learns only a single type of cell.
Also note that we retained the top-2 predecessors per node only at the very end, and our continuous search space before this final discretization step is even larger. Specifically, each relaxed cell (a fully connected graph) contains learnable edges, allowing possible configurations ( to include the zero op indicating a lack of connection). Again, since we are learning both normal and reduction cells, the total number of architectures covered by the continuous space before discretization is .