1 Introduction
Deep Neural Networks (DNNs) have demonstrated the stateoftheart performance on many machinelearning tasks such as image recognition
Krizhevsky et al. (2012), speech recognition Hannun et al. (2014), and language modeling Sutskever et al. (2014). Despite the successes achieved by DNNs, crafting neural architectures is usually a timeconsuming and costly process that requires expert knowledge and experience in the field. Recently, Neural Architecture Search (NAS) has drawn much attention from both industry and academia Negrinho and Gordon (2017); Zoph and Le (2016)because it learns better network automatically from data. The NAS approaches can be categorized into two main groups. The methods in the first group use blackbox optimization approaches, such as Reinforcement Learning (RL)
Zoph and Le (2016); Pham et al. (2018); Baker et al. (2016); Zoph et al. (2017); Zhong et al. (2017)or Genetic Algorithm (GA)
Real et al. (2017); Xie and Yuille (2017); Liu et al. (2017b); Real et al. (2018), to optimize a reward function. The algorithm by Liu et al. Liu et al. (2017a)is also a blackbox scheme although it uses a slightly more efficient optimization method. The main drawback of the blackbox optimization approaches is computational cost. Both RL and GAbased approaches need to train thousands of deep learning models to learn a neural network architecture. On the other hand, the methods in the second group formulate the neural architecture search task as a differentiable optimization problem and utilizes alternative gradient descent to find the optimal solution. One representative example is the differentiable architecture search (DARTS)
Liu et al. (2018) method, which has been shown to perform well on multiple benchmark datasets. It is also computationally more efficient than the blackbox approaches.We consider the problem of oneshot NAS which is critical for resource constrained applications because different tasks require different neural network architectures. For example, for a simple problem such as classifying image color, we can use a simple neural network architecture, e.g., a twolayer neural network. On the other hand, classifying cats and dogs from images requires a complex neural network. Pior works on NAS in resource constrained environments is based on blackbox optimization, and is computationally too expensive for onsshot NAS.
In this paper, we propose the Resource Constrained Neural Architecture Search (RCDARTS) for oneshot NAS with good balance between efficiency and accuracy. The RCDARTS method requires the learned architecture to maximize accuracy under userdefined resource constraints. We uses FLOPs and model size as resource constraints in this paper for simplicity, but can also utilize platformaware speed as resource constraints by fitting a nonlinear mapping function from neural network architecture to inference latency on a device. Our method is built upon the differentiable architecture search (DARTS) Liu et al. (2018) by formulating the problem into a constrained optimization problem by adding resource constraints. The search space for the resource constrained optimization problem is still a continuous search space, which thus allows for using gradient descent methods. To solve this optimization problem, we propose an iterative projection algorithm to learn architectures in the feasible set defined by constraints. We also develop a multilevel search strategy to learn different architectures for layers at different depths, considering that layers at different depths take up distinctive proportions of overall model size and FLOPs. As such, the proposed RCDARTS enjoys the merits from the differentiable search space and costaware training process. It is experimentally demonstrated that RCDARTS learns better lightweight architectures, which are useful in mobile platforms with constrained computing resources. These properties makes the RCDARTS approach suitable for oneshot resource constrained NAS problem.
To summarize, we make the following contributions:

We propose an endtoend resourceconstrained NAS framework which is trained in an oneshot manner using standard gradient backpropogation. An iterative projection algorithm is introduced to sovle the constrained optimization problem.

We present a multilevel search strategy to learn different architectures for layers at different depths of networks. We also learn a new connection cell between adjacent cells. It facilitates learning paretooptimal architecture across all layers in a network.

We show the proposed RCDARTS algorithm achieves the stateofthearts performance in terms of accuracy, model size and FLOPs on the Cifar10 and ImageNet datasets.
2 Background
2.1 Architectural building block
Since it is computationally expensive to search for the architecture of the whole network on largescale dataset, recent NAS methods Zoph et al. (2017); Liu et al. (2017b, a); Pham et al. (2018) usually search for the best architectural building block (or “cell”) on a smallscale dataset. Multiple building blocks with the same architecture but independently learned weights are stacked to create a deeper network for larger datasets.
As shown in Figure 1 (a), a block is represented as a directed acyclic graph (DAG): . Each node . is an intermediate representation (e.g. a feature map in convolutional networks) and is the predefined number of nodes in the block. Corresponding to edge , there is an operation taking the intermediate representation as inputs and outputting . is a predefined set of all possible operations including pooling, convolution, zero connection (i.e. no connection between two nodes), etc. The intermediate representation of a node is the summation of all of its predecessors’ transformed outputs: where is the set of predecessors of . Following ENAS Zoph et al. (2017) and DARTS Liu et al. (2018), we define special input nodes and output node for each block. The first two nodes and in a block are defined as input nodes which transform the outputs of the previous two blocks, respectively. For the first bock, both and are the input images. The node is the output node which is the concatenation of all intermediate nodes, i.e. .
2.2 Darts
Since is a discrete set, most NAS methods employ timeconsuming RL or genetic algorithms to select the best operations in cell. Recently, DARTS proposes a continuous relaxation to make the architecture search space continuous so that the architecture can be optimized through gradient descent. Concretely, the categorical operation for edge is replaced by a mixing operation which outputs the softmax weighted sum of all possible operations in : . The mixing weights are parameterized by . The discrete architecture is derived through two steps. The first step finds strongest predecessors for node based on the strength of the corresponding edge. The strength of an edge is defined as . The second step replaces the mixing operation for edge to a single operation with the largest mixing weight:
(1) 
The objective of DARTS is to learn neural network architecture by optimizing the following function:
(2) 
where and denote the losses on training and validation dataset respectively, and are the weights of the operations. Since it is difficult to get the exact solution for Eq. (2) for both the weights
and the hyperparameters
at the same time. DARTS utilizes coordinate gradient to alternatively update weights and hyperparameters while fixing the value of the other.3 RcDarts
RCDARTS aims to learn deep architectures under resource constraints such as memory, FLOPs, inference speed, etc. The resource constraints are very important for both mobile platforms and real time applications. In Section 3.1, we formulate RCDARTS as a constrained optimization problem and introduce its objective function. In Section 3.2, we introduce the iterative projection method to efficiently optimize the constrained objective function of RCDARTS. In Section 3.3, we propose a multilevel search strategy to adaptively search different architectures for different layers. A new connection cell is introduced to better tradeoff between resource costs and accuracy. The overview of RCDARTS is illustrated in Figure 1.
3.1 Objective function of RCDARTS
Different from DARTS, RCDARTS adds resource constraints in objective function, which require the learned architectures to satisfy task dependent resource constraints, such as the model size and computational complexity. The objective function of RCDARTS is
(3)  
s.t.  (4)  
(5) 
where consists of cost functions, each of which maps the architecture hyperparameters to a particular resource cost. and are userdefined lower and upper bounds of cost constraints, respectively, i.e., the cost is constrained to be in the range of . In this work, we consider two widely used resource costs in Eq. (5) (i.e. ): the number of parameters and the number of floatpoint operations (FLOPs).
We introduce the function form of as follows. The exact cost of a neural network architecture can be computed by creating a discretized network architecture from the network hyperparameters according to Eq. (1), and compute the cost for the discretized network. However, since the function of discretized network architecture is not continuous, it is challenging to optimize the objective function with gradient descent. Similar to DARTS, we use a continuous relaxation strategy on the resource constraints, where the cost of edge is calculated as the softmax over all possible operations’ costs:
(6) 
where consists of the resource costs of all operations in and is the softmax function. is the indicator function and is the set of predecessor nodes for node (ref. Section 2.2 for the selection of predecessor nodes). Eq. (6) uses the expectation of resource costs in a cell as an approximation to the actual cost of the discrete architecture derived from . There are two advantages to use the function form in Eq. 6. First since Eq. 6 is differentiable w.r.t. , it enables the use of gradient descent to optimize the objective function of RCDARTS. Second, it is easy to implement because the resource cost of each candidate operation for the edge is independent of the values of . Therefore is fixed and they can be computed before training. If people want to optimize a more complicated resource constraint, such as inference speed on a particular platform, we can also learn a neural network to map from network architecture hyperparameters to resource cost. We will leave it as our future work.
Note we set both lower bound and higher bound constraints for to prevent the model from learning oversimplified architectures. The lower bound constraints ensures that the model has sufficient representation capabilities. Since the cost function w.r.t is nonconvex because of the softmax function in Eq. (6), there is no closedform solution to the objective function. In the next, we introduce an iterative projection method to optimize the constrained function.
3.2 Iterative projection
The proposed iterative projection method optimizes the objective function of RCDARTS in two alternative phases. Phase 1 is unconstrained training, which searches better architectures by learning in a larger parameter space without constraints. Phase II is architecture projection, which projects the network architecture parameters outputting by phase I to its nearest point in the feasible set defined by constrains in Eq. 5. The algorithm is shown in Algorithm 1.
3.2.1 Unconstrained training
In this phase, the objective function of RCDARTS is the same as that of DARTS (i.e. Eq. (2)) because we do not consider the constraints. As illustrated in Algorithm 1, and are alternatively updated for
iterations via gradient descend. Note in step 2, we adopt the simple heuristic of optimizing
by assuming and are independent, which corresponds to the firstorder approximation in DARTS, because it is more computational efficient while achieving good performance.There are two benefits of performing phase I. First, by jointly optimizing weights and architecture, the network learns a good starting point for projection in phase II. Secondly, after a phase II (note phase I and phase II are performed in an iterative way), phase I is performed to learn architecture in a larger parameter space which has no constraints on . Thus, even if a projection step results in a suboptimal network architecture, the unconstrained training step can still learn better network architecture based on it. Besides, it is observed that the network in initial training stage is fragile to perturbations on weights Yosinski et al. (2014). In order to improve the training stability, we use a warmstart strategy, i.e. the first phase I in Algorithm 1 has larger than the rest phase I. The warm start reduces the model’s risk of getting into bad local optimum in the projection step.
3.2.2 Architecture projection
In this phase, we project the architecture hyperparameters of phase I to its nearest point in the feasible set defined by Eq. (5). The objective of projection is
(7) 
Because are nonconvex functions of (ref. to Eq. (6)), there is no closedform solution to Eq. (7). We transform Eq. (7) to its Lagrangian Eq. (3.2.2) for optimization.
(8) 
We use regular gradient descent to optimize Eq. (3.2.2). At time step 0, . At time step , is obtained by descending in the direction of . The update is carried out iteratively until all constraints are satisfied or the maximum iteration number is reached. The output network hyperparameters of Eq. (3.2.2) is utilized as initialization of the next phase I (i.e. step 4 in Algorithm 1).
In our experiments, we set the weighting terms of and to be identical for all constraints for simplicity. To facilitate convergence, we set to diminish exponentially during training. At the end of training, and . Since it is fast to compute for simple resource constraints, the proposed iterative projection step is much faster than the unconstrained training step. Thus, RCDARTS is almost as efficient as regular DARTS.
3.3 Multilevel architecture search
In previous methods Zoph et al. (2017); Liu et al. (2017b, a); Pham et al. (2018), where is shared by all normal cells, and is shared by all reduce cells. We argue such a simple solution is suboptimal for learning resource constrained networks for two reasons. First, cells at different network depths exhibit large variation on the resource cost (#params and #flops), because the #channels of filters is increased wherever the resolution of input is reduced. This design is widely used in current deep networks Zoph et al. (2017); Liu et al. (2017b, a); Pham et al. (2018) to avoid the bottleneck of information flow Szegedy et al. (2016), where the lowlevel layers (i.e. layers near input) have larger FLOPs than the highlevel layers while the highlevel layers have larger number of parameters than lowlevel layers. In order to make the learned architectures satisfy given resource constraints, it is better to make the architecture of cells vary with the depths of layers.
Second, cells at different depths have different effects on network’s overall performance For example, it is observed that lowlevel layers (near input) are more insensitive to reducing the number of parameters Han et al. (2015).
Because of above two reasons, we propose the multilevel search strategy to achieve better tradeoff between the architecture’s resource costs and accuracy. Specifically, we evenly divide all layers learned by normal cells into groups where cells in each group share the same architecture. In this way, RCDARTS is encouraged to learn architectures adaptively for different cells. In addition, to obtain more lightweight architectures, we learn a new type of cell named connection cell to learn the candidate connections between cells instead of predefining the connections to be 1x1 conv as in Zoph et al. (2017); Liu et al. (2017b, a); Pham et al. (2018). The connection cell can be formulated in the same way as normal/reduction cells (ref. Section 2.1). The differences are there is only one input node and one intermediate node inside the connection cell. More details on connection cell are introduced in the experiment section.
4 Experiments
Similar as DARTS, our experiments consist of three steps. First, RCDARTS is applied for searching convolutional cells on CIFAR10 and the best cells based on their search validation performance are selected as building blocks in following steps. Then, a larger network is constructed by stacking the selected cells in previous step and is retrained on CIFAR10 for comparison between RCDARTS and other stateoftheart methods. Finally, we show that the searched convolutional cells are transferable to large datasets through experimental verification on ImageNet.
4.1 Architecture search on Cifar10
4.1.1 Experimental settings
Dataset Cifar10 Krizhevsky and Hinton (2009) is a popular image classification benchmark containing 50,000 training images and 10,000 testing images. During architecture search process, we follow DARTS Liu et al. (2018) to evenly divide the training set into two parts, which are used as training and validation datasets for architecture search.
Search space.
As described in Section 3.3, there are three types of cells in RCDARTS: normal, reduction and connection. Following Zoph et al. (2017); Liu et al. (2017b, a); Pham et al. (2018); Liu et al. (2018), reduction cells are placed at the and of the total depth of the network between two normal cells. Connection cells are placed between every connection of two cells to transform the number of channels of the previous cell’s output to that of the input of the current cell.
For each reduction cell and normal cell, there are nodes, including two input nodes and one output node. (See the method section for the definition of input and output nodes). We include the following 8 types of operations in for each edge at normal or reduction cell: separable convolutions with kernel size of and ; dilated separable convolutions with kernel size of and ; max pooling;
average pooling; identity connect (i.e. the output is equal to input); and zero (i.e. no connection). Both separable and dilated convolution operators are consists of ReLUConvBN. For separable convolution, two independently learned “ReLUConvBN” on
and axis are stacked.There are four candidate operations at each edge in a connection cells: dilated convolution with kernel size of ; group convolutions with #groups equivalent to 1, 2 and 4 respectively. For group convolution with #groups larger than one, channel shuffling Zhang et al. (2017b) is used.
Training settings
Following DARTS Liu et al. (2018), half of the CIFAR10 training data are held out as the validation set for architecture search. A small network consisting of
cells is trained using RCDARTS for 50 epochs, with batch size 64 (for both the training and validation iterations) and the initial number of channels 16. Momentum SGD and Adam optimizer are used to optimize the weights
and iteratively in unconstrained training step. In architectural projection step (ref. Eq. (3.2.2)), we use Adam optimizer with initial learning rate of 3e4, momentum of , and no weight decay. The number of iterations in phase I is set to 150 (i.e. 0.5 epoch); The number of iteration in phase IIis set to 500. The lower/upper bounds for constraints of #params and FLOPs are 1.8e5/2.0e5 and 2.8e7/3.3e7 respectively. These numbers are decided by experiments. All experiments run with PyTorch using NVIDIA V100 GPUs.
4.2 Interpreting architecture search
In Figure 2, we compare the normal and reduction cells learned by RCDARTS with those learned by DARTS. We have following observations as seen in the figure:

Compared with the normal cell in DARTS, the normal cells in RCDARTS contains more inexpensive operations. For example, as shown in Figure LABEL:fig:rcdarts_normal_1 and Figure LABEL:fig:rcdarts_normal_2, There are more skip connection, max pooling in the midlevel/highlevel cells in RCDARTS. Compared to DARTS, RCDARTS also uses more dilated convolution instead of separable convolution, because the latter is more computationally expensive. The above results demonstrate that RCDARTS effectively learns lightweight architectures when resource constraints exist.

Compared to DARTS, the normal cells in RCDARTS have more connections between intermediate nodes. For example, in Figure LABEL:fig:darts_normal, only the node indexed by 3 has connection with other intermediate nodes (node indexed by 2). In contrast, the connections among nodes in RCDARTS is much denser, as shown in Figure LABEL:fig:rcdarts_normal_0 to Figure LABEL:fig:rcdarts_normal_1. Since connections between nodes implicitly increase the depth of network, we conjecture that the networks with more connections between intermediate nodes have stronger learning capabilities when the number of parameters is the same. This is consistent with the expert knowledge on designing effective neural network architectures. For example, in the efficient inception models Szegedy et al. (2016, 2017)
, there are rich connections inside each inception module.
4.3 Architecture evaluation
4.3.1 Cifar10
Experimental settings
For a fair comparison with DARTS, we follow Liu et al. (2018) to build large network of cells, which is trained for 600 epochs with batch size 96. Other training hyperparameters is also the same DARTS. Following Pham et al. (2018); Zoph et al. (2017); Liu et al. (2017b); Real et al. (2018), we add additional improvements include cutout DeVries and Taylor (2017)
, path dropout of probability 0.3 and auxiliary towers with weight 0.4. We report the mean results of 4 independent runs of training our full model. In order to compare with DARTS with the same number of parameters, we adjust the multiplier of #channels for both DARTS and RCDARTS to make the same model size of them roughly the same. The resulted models are denoted by “{modelC#channels}”.
Results
The comparison of RCDARTS with other stateofthearts on CIFAR10 are presented in Table LABEL:table:cifar10. Notably, RCDARTS achieved comparable results with the state of the art Zoph et al. (2017); Real et al. (2018) while using three orders of magnitude less computation resources (i.e. 1 GPU day vs 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet). Compared with fullsized DARTS in Liu et al. (2018) which has an initial number of channels 16, RCDARTSC42 achieves competitive performance but with less parameters and less search cost (1 versus 4). Since RCDARTS aims to learn lightweight models under resource constraints, we put more emphasis on the comparison results of models with low resource occupations. Compared with compressed DARTS, (i.e. DARTSC20 and DARTSC12), RCDARTSC22 and RCDARTSC14 outperforms them by a large margin, which evidently demonstrates the effectiveness of RCDARTS on learning resource constrained architectures.
Compared with DPPNet, as shown in Table 2, on Cifar10, RCDARTSC14 largely reduces the error rate of DPPNet by 1.67% (a 28.6% relative improvements). We use RCDARTSC54 model for clearer comparions with DPPNet on ImageNet. With similar FLOPs (520M vs 523M), RCDARTSC54 achieves lower error rate (25.7% vs 26.0%), smaller model size (4.4M vs 4.8M) and search costs (1 vs 8 GPU days). Above results demonstrate the advantages of RCDARTS over DPPNet.
Architectures  Test Error (%)  Params (M)  Search Cost (GPU days)  Search Method 
DenseNetBC Huang et al. (2017b)  3.46  25.6    mannual 
NASNetA+cutout Zoph et al. (2017)  2.65  3.3  1800  RL 
NASNetA +cutout Zoph et al. (2017)  2.83  3.1  3150  RL 
AmoebaNetA + cutout Real et al. (2018)  3.34 0.06  3.2  3150  evolution 
AmoebaNetA + cutout Real et al. (2018)  3.12  3.1  3150  evolution 
AmoebaNetB + cutout Real et al. (2018)  2.55 0.05  2.8  3150  evolution 
Hierarchical Evo Liu et al. (2017b)  3.75 0.12  15.7  300  evolution 
PNAS Liu et al. (2017a)  3.41 0.09  3.2  225  SMBO 
ENAS + cutout Pham et al. (2018)  2.89  4.6  0.5  RL 
DPPNet Dong et al. (2018)  5.84  0.45  8  RL 
DARTS Liu et al. (2018)  2.83 0.06  3.4  4  gradientbased 
DARTSC12  3.44  1.0  4  gradientbased 
DARTSC20  4.86  0.48  4  gradientbased 
RCDARTSC42  2.81 0.03  3.3  1  gradientbased 
RCDARTSC22  3.02  1.0  1  gradientbased 
RCDARTSC14  4.17  0.43  1  gradientbased 
Architectures  Error rates (%)  Params (M)  FLOPs (M)  Search Cost  Search Method  
Top1  Top5  (GPU days)  
InceptonV1 Szegedy et al. (2015)  30.2  10.1  6.6  1448    manual 
MobileNet Howard et al. (2017)  29.4  10.5  4.2  569    manual 
ShuffleNet 2 (v1) Zhang et al. (2017b)  29.1  10.2  5  524    manual 
ShuffleNet 2 (v1) Zhang et al. (2017b)  26.3    5  524    manual 
MnasNet Tan et al. (2018)  26.0  8.25  4.2  317  1800  RL 
NASNetA Zoph et al. (2017)  26.0  8.4  5.3  564  1800  RL 
NASNetB Zoph et al. (2017)  27.2  8.7  5.3  488  1800  RL 
NASNetC Zoph et al. (2017)  27.5  9.0  4.9  558  1800  RL 
AmoebaNetA Real et al. (2018)  25.5  8.0  5.1  555  3150  evolution 
AmoebaNetB Real et al. (2018)  26.0  8.5  5.3  555  3150  evolution 
AmoebaNetC Real et al. (2018)  24.3  7.6  6.4  570  3150  evolution 
PNAS Liu et al. (2017a)  25.8  8.1  5.1  588  225  SMBO 
DPPNet Dong et al. (2018)  26.0  8.2  4.8  523  8  SMBO 
DARTS  26.9  9.0  4.9  595  4  gradientbased 
DARTSC24  34.8  13.8  1.4  140  4  gradientbased 
RCDARTSC58  25.1  7.8  4.9  590  1  gradientbased 
RCDARTSC28  32.9  13.3  1.4  138  1  gradientbased 
4.3.2 ImageNet
Experimental settings
Ablative study
We evaluate the effects of MLS and CC on DARTS and RCDARTS based on four combinations (with/without MLS/CC). We adjust the multiplier of #channels of different models to compare them with comparative model sizes. Two model sizes (4.9M and 1.4M) are used for evaluating models with different resource costs. All other training settings are the same as in Sec. 4.1.1. The average values of four independent runs are reported to reduce randomness.
Architectures  Error rates (%) on ImageNet test set  Params  

w/o (MLS+CC)  w/ MLS  w/ CC  w/ (MLS+CC)  (M)  
DARTS  26.9  29.3  26.4  28.6  4.9 
RCDARTS  26.0  25.6  25.4  25.1  4.9 
DARTS  34.8  37.7  34.3  36.2  1.4 
RCDARTS  33.7  33.4  33.5  32.9  1.4 
When none of MLS and CC is used, compared with DARTS with similar model sizes (4.9M/1.4M), RCDARTS reduce the error rates by 0.9%/1.1% respectively. These results verify the effectiveness of the proposed iterative projection algorithm. By using MLS, the performance of RCDARTS is improved as it adaptively learns architectures at different layers (see Sec. 3.3). Interestingly, we find MLS incurs accuracy loss for DARTS. This may be due to the overfitting problem caused by the increased architecture parameter (note its size is proportional to #levels in MLS). Experimental results verify this point as compared with RCDARTS, DARTS has higher error rates in test set but lower error rates in training set. In contrast, the resource constrains (i.e. Eq. (5)) in RCDARTS can be seen as regularizers for the parameter space, thereby reducing the risk of overfitting. Using CC consistently reduces error rates of both DARTS and RCDARTS. Finally, RCDARTS with both MLS and CC achieves the lowest error rates among all models.
Results
Table LABEL:table:imagenet lists the comparison results of the evaluation on ImageNet dataset. Results in Table 2 show that the cell learned on CIFAR10 is transferable to ImageNet. Notably, RCDARTS achieves competitive performance with the stateoftheart RL method Zoph et al. (2017) while using three orders of magnitude less computation resources for architecture training. Compared with DARTS with the same number of parameters and FLOPs, RCDARTS significantly outperforms DARTS in terms of accuracy.
Compared with MobileNetV2x1.4, RCDARTSC58 achieves the same error rate and comparable FLOPs, but using only 71% of the former’s model size (4.9M vs 6.9M). Besides, we also compare RCDARTSC46 (i.e. set the #channels to 46) with MobileNetV2. The former achieves: top1 error rate=27.6%, FLOPs=380M and #params=3.2. It thus can be seen that with comparable FLOPs, RCDARTSC46 achieves lower error rate with smaller model size. Above experiments demonstrate the advantage of RCDARTS over MobileNetV2 in learning lightweight models. Although MNasNet92 has smaller FLOPs vs RCDARTSC58 with comparable error rate, the former takes >2000 times more search costs than the latter, thus prohibiting it from being used in computational budgets limited circumstances.
Lastly, we note both MobileNetV2 and MNasNet use the more efficient inverted residual block (IRB) Sandler et al. (2018) as basic block/cell. Since the set of candidate operations in RCDARTS is orthogonal to our proposed methods, we can conveniently replace the ordinary convolutional operations in RCDARTS with IRB to boost the performance. Since it is out of the scope of this paper (i.e. learning oneshot NAS with resource constraints), we plan to leave this investigation to our future work. Although RCDARTS has larger FLOPs than MNasnet Tan et al. (2018) with comparable error rate, but the search cost of former is an order of magnitude smaller Tan et al. (2018), which is essential for oneshot architecture search.
5 Related Work
Recently, automatic deep neural network architecture search (NAS) has been demonstrated effective in multiple AI tasks by achieving the stateoftheart results. We classify those works on NAS in three different types according to the employed optimization approaches: reinforcement learning based methods, evolutionary algorithm based ones and methods using other optimization approaches. In addition, we also review works that contain multiple objectives in optimization and manually designed mobile architectures.
RLbased methods
In the seminal work of Zoph and Le (2016), a Neural Architecture Search (NAS) method is developed to search network architectures. NAS consists of two basic component, i.e. a LSTMbased controller which aims to generate layerwise architecture options (e.g.
filter size/stride, pooling etc.) and the REINFORCE algorithm
Williams (1992) which updates the weights of controller using the accuracy of sampled architectures as rewards. Many works follow using the similar pipeline. Zoph et al. (2017) proposes NASNet by introducing a modular search space to reduce the search cost and enhance the generalization capability of searched architectures. Specifically, the basic architectural building block (named “cell” therein) is first learnt in smallscale dataset and then transfered to largescale dataset by stacking together multiple copies. ENAS Pham et al. (2018) speeds up NAS by sharing parameters across child models during the architectural search process. It is observed that such a weight sharing scheme also improve the performance of searched models. Cai et al. (2018) proposes an efficient architecture search method by restricting the actions output by controller to be varying the depth/width of network with functionpreserving transformations. In such a way, the previously validated child model can be reused to save search cost. Instead of using REINFORCE, Baker et al. (2016) and Zhong et al. (2017) use Qlearning in architecture search. Above methods focus on improving the accuracy of searched models and ignore the resource cost of models. In comparison, RCDARTS aims to maximize accuracy under resource constraints, thus being more suitable for learning lightweight architectures in mobile platforms.More recently, there are several works integrating resource constraints as objectives in architecture searching. Zhou et al. (2018) modifies the reward function to penalize searched models when constraints are not met. The method is only tested on smallscale dataset. MNasNet Tan et al. (2018) explictly adds resource constraints in rewards and achieves promising results in tradingoff accuracy and model complexity. Compared with MNasNet, RCDARTS achieves comparable performance under similiar resource constraints, but using only three orders less search time (see experimental section).
EAbased methods
Early attemps using evolutionary algorithm to optimize neural architecture includes Angeline et al. (1994); Stanley and Miikkulainen (2002); Floreano et al. (2008). However, since EA is used for learning both architectures and weights, these methods are difficult to be applied in modern deep networks which have a large amount of parameters. Recent works including Real et al. (2017); Xie and Yuille (2017); Liu et al. (2017b); Real et al. (2018) separate the learning of architecture and weights, i.e. the former is learned using EA while the later is learned using conventional gradient descent. During the process of learning architecture using EA, “mutation” represents operations to choose different architectural options, e.g. the filter size, layer number, etc. [19] proposed to treat neural network architecture search as a multiobjective optimization task and adopt an evolutionary algorithm to search models with two objectives, runtime speed, and classification accuracy. However, the performances of the searched models are not comparable to handcrafted small CNNs, and the numbers of GPUs they required are enormous. A prominent disadvantage of EAbased methods is that they require enormous computational resources (generally a few hundreds of GPU days), thus prohibiting its usage in budget limited circumstances.
Other optimizationbased methods
A gaussian processbased bayesian optimization is used in Kandasamy et al. (2018); Swersky et al. (2014) for neural architecture search. However, its performance is unsatisfying compared with previous two categories of methods. To reduce the excessive computation costs on evaluating candidate architectures in original NAS method Zoph and Le (2016), Baker et al. (2017) proposes to predict performance of model architectures instead of performing conventional training. Brock et al. (2017) adopts a similar idea by training an extra network to generate the weights of candidate architectures and using random search to search for good models. In another direction, the sequential modelbased optimization, i.e. SMBO Hutter et al. (2011) algorithm is adopted to guide the selection of model architectures through learning a predictive model. Based on SMBO, Liu et al. (2017a) achieves comparable performance to NASNet with significantly smaller computational budget. Following Liu et al. (2017a), DPPNet Dong et al. (2018) integrates resource constraints in the objective function for searching efficient models. Compared with DPPNet, since RCDARTS performs network architecture search in a oneshot manner, thus RCDARTS not only has higher searching efficiency, but also achieves better tradeoff between accuracy and model complexity as validated by extensive experimental results.
Recently, Liu et al. (2018) solves the NAS problem from a different angle, and proposes a method for efficient architecture search named DARTS (Differentiable Architecture Search). Instead of searching over a discrete set of candidate architectures, they relax the search space to be continuous, so that the architecture can be optimized with respect to its validation set performance by gradient descent. However, above methods ignores the computational costs of obtained architectures, making the learned models suboptimal in computational resource usages. RCDARTS is built upon DARTS, but with significant improvement in the effeciency of learned models with better accuracies. Different from DARTS, RCDARTS aims to learn optimal architectures that satisfy customized resource constrains. To the best of our knowledge, RCDARTS is the first to resolve oneshot NAS with resource constraints. (2) As the exact resource costs of output architectures are discrete, it is challenging to incorporate resource constraints into training and solve the optimization problem. We approximate the resource costs with continuous relaxation and address the nonconvex constrained optimization with a novel iterative projection algorithm. In addition, we propose multilevel search strategy and new types of connection cell to achieve better tradeoff between accuracy and resource costs.
Handcrafted lightweight models
There have been a lot of handcrafted models proposed to achieve high performance on mobile devices with limited computational resources. Among these lightweight models, group convolution and its variants play critical roles in improving models’ efficiency. The IGCV models Zhang et al. (2017a); Xie et al. (2018) introduce novel interleaved group convolutions to improve the representation capablilities of models with the same number of parameters and model complexities. Both MobileNet Howard et al. (2017) and ShuffleNet Zhang et al. (2017b) employ depthwise convolution (an extreme case of group convolution when the number of groups is equal to the input’s number of channels) to significantly reduce model sizes and complexities while retaining comparable accurate. CondenseNet Huang et al. (2017a) proposes a novel learnable group convolution to improve the computational efficiency of DenseNet. MobileNet V2 Sandler et al. (2018) proposes an inverted residual block (IRB) to achieve better tradeoff between accuracy and computational costs than Howard et al. (2017). ShuffleNet V2 Ma et al. (2018) proposes pratical guidelines for designing efficient networks and achieves stateoftheart results in terms of efficiency and accuracy tradeoff. Although above models achieve promising results in balancing models’ accuracy and computational costs, their design process are usually timeconsuming and costly and are subject to trialanderrors by experts in the field. In comparison, our proposed RCDARTS is a general architecture search method which can automatically learn architectures from data in a oneshot way. More importantly, we experimentally verify that RCDARTS is advantageous over handcrafted models in the synthetical criteria of model size/complexity, search costs and accuracy.
6 Conclusion
In this work, we presented RCDARTS, a novel endtoend neural architecture search framework, for oneshot neural architecture search under resource constraints, where a customized network architecture is learned for any particular dataset. RCDARTS employs differentiable architectural techniques while taking the resource constraints into consideration by adding them as constraints in the objective function. RCDARTS achieves a good balance between architecture search speed, resource constraints, and quality of the architecture. It is ideal for the resource constrained neural architecture search problem. On Cifar10 and ImageNet datasets, RCDARTS achieves stateofthearts performance in terms of accuracy, model size and complexity. In the future, we plan to explore using neural network to learn the mapping function between network hyperparameters and a more complicated resource cost, such as inference speed on specific hardware.
References

Angeline et al. [1994]
Angeline, P.J., Saunders, G.M.,
Pollack, J.B., 1994.
An evolutionary algorithm that constructs recurrent neural networks.
IEEE transactions on Neural Networks 5, 54–65.  Baker et al. [2016] Baker, B., Gupta, O., Naik, N., Raskar, R., 2016. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 .
 Baker et al. [2017] Baker, B., Gupta, O., Raskar, R., Naik, N., 2017. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823 .
 Brock et al. [2017] Brock, A., Lim, T., Ritchie, J.M., Weston, N., 2017. Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344 .
 Cai et al. [2018] Cai, H., Chen, T., Zhang, W., Yu, Y., Wang, J., 2018. Efficient architecture search by network transformation, AAAI.
 DeVries and Taylor [2017] DeVries, T., Taylor, G.W., 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 .
 Dong et al. [2018] Dong, J.D., Cheng, A.C., Juan, D.C., Wei, W., Sun, M., 2018. Dppnet: Deviceaware progressive search for paretooptimal neural architectures. arXiv preprint arXiv:1806.08198 .
 Floreano et al. [2008] Floreano, D., Dürr, P., Mattiussi, C., 2008. Neuroevolution: from architectures to learning. Evolutionary Intelligence 1, 47–62.
 Han et al. [2015] Han, S., Mao, H., Dally, W.J., 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 .
 Hannun et al. [2014] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al., 2014. Deep speech: Scaling up endtoend speech recognition. arXiv preprint arXiv:1412.5567 .
 Howard et al. [2017] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 .
 Huang et al. [2017a] Huang, G., Liu, S., van der Maaten, L., Weinberger, K.Q., 2017a. Condensenet: An efficient densenet using learned group convolutions. CoRR abs/1711.09224. URL: http://arxiv.org/abs/1711.09224, arXiv:1711.09224.
 Huang et al. [2017b] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017b. Densely connected convolutional networks., in: CVPR, p. 3.
 Hutter et al. [2011] Hutter, F., Hoos, H.H., LeytonBrown, K., 2011. Sequential modelbased optimization for general algorithm configuration, in: International Conference on Learning and Intelligent Optimization, Springer. pp. 507523.
 Kandasamy et al. [2018] Kandasamy, K., Neiswanger, W., Schneider, J.G., Poczos, B., Xing, E.P., 2018. Neural architecture search with bayesian optimisation and optimal transport. neural information processing systems , 20162025.
 Krizhevsky and Hinton [2009] Krizhevsky, A., Hinton, G., 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.

Krizhevsky et al. [2012]
Krizhevsky, A., Sutskever, I.,
Hinton, G.E., 2012.
Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 10971105.
 Liu et al. [2017a] Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.J., FeiFei, L., Yuille, A., Huang, J., Murphy, K., 2017a. Progressive neural architecture search. arXiv preprint arXiv:1712.00559 .
 Liu et al. [2017b] Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K., 2017b. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436 .
 Liu et al. [2018] Liu, H., Simonyan, K., Yang, Y., 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 .
 Ma et al. [2018] Ma, N., Zhang, X., Zheng, H., Sun, J., 2018. Shufflenet V2: practical guidelines for efficient CNN architecture design. CoRR abs/1807.11164. URL: http://arxiv.org/abs/1807.11164, arXiv:1807.11164.
 Negrinho and Gordon [2017] Negrinho, R., Gordon, G., 2017. Deeparchitect: Automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792 .
 Pham et al. [2018] Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J., 2018. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268 .
 Real et al. [2018] Real, E., Aggarwal, A., Huang, Y., Le, Q.V., 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548 .
 Real et al. [2017] Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q., Kurakin, A., 2017. Largescale evolution of image classifiers. arXiv preprint arXiv:1703.01041 .
 Sandler et al. [2018] Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L., 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation, CVPR.
 Stanley and Miikkulainen [2002] Stanley, K.O., Miikkulainen, R., 2002. Evolving neural networks through augmenting topologies. Evolutionary computation 10, 99127.
 Sutskever et al. [2014] Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural networks, in: Advances in neural information processing systems, pp. 31043112.
 Swersky et al. [2014] Swersky, K., Duvenaud, D.K., Snoek, J., Hutter, F., Osborne, M.A., 2014. Raiders of the lost architecture: Kernels for bayesian optimization in conditional parameter spaces. arXiv: Machine Learning .

Szegedy et al. [2017]
Szegedy, C., Ioffe, S.,
Vanhoucke, V., Alemi, A.A.,
2017.
Inceptionv4, inceptionresnet and the impact of residual connections on learning., in: AAAI, p. 12.
 Szegedy et al. [2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: CVPR, pp. 19.

Szegedy et al. [2016]
Szegedy, C., Vanhoucke, V.,
Ioffe, S., Shlens, J.,
Wojna, Z., 2016.
Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 28182826.
 Tan et al. [2018] Tan, M., Chen, B., Pang, R., Vasudevan, V., Le, Q.V., 2018. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626 .
 Williams [1992] Williams, R.J., 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8, 229256.
 Xie et al. [2018] Xie, G., Wang, J., Zhang, T., Lai, J., Hong, R., Qi, G., 2018. IGCV2: interleaved structured sparse convolutional neural networks, in: CVPR.
 Xie and Yuille [2017] Xie, L., Yuille, A.L., 2017. Genetic cnn., in: ICCV, pp. 13881397.
 Yosinski et al. [2014] Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are features in deep neural networks?, in: Advances in neural information processing systems, pp. 33203328.
 Zhang et al. [2017a] Zhang, T., Qi, G., Xiao, B., Wang, J., 2017a. Interleaved group convolutions for deep neural networks. arXiv: Computer Vision and Pattern Recognition .
 Zhang et al. [2017b] Zhang, X., Zhou, X., Lin, M., Sun, J., 2017b. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR abs/1707.01083. URL: http://arxiv.org/abs/1707.01083, arXiv:1707.01083.
 Zhong et al. [2017] Zhong, Z., Yan, J., Liu, C.L., 2017. Practical network blocks design with qlearning. arXiv preprint arXiv:1708.05552 .
 Zhou et al. [2018] Zhou, Y., Ebrahimi, S., Arik, S.Ö., Yu, H., Liu, H., Diamos, G., 2018. Resourceefficient neural architect. CoRR abs/1806.07912. URL: http://arxiv.org/abs/1806.07912, arXiv:1806.07912.
 Zoph and Le [2016] Zoph, B., Le, Q.V., 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 .
 Zoph et al. [2017] Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V., 2017. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012 2.
Comments
There are no comments yet.