Exploiting Operation Importance for Differentiable Neural Architecture Search

11/24/2019 ∙ by Xukai Xie, et al. ∙ Princeton University Tianjin University 0

Recently, differentiable neural architecture search methods significantly reduce the search cost by constructing a super network and relax the architecture representation by assigning architecture weights to the candidate operations. All the existing methods determine the importance of each operation directly by architecture weights. However, architecture weights cannot accurately reflect the importance of each operation; that is, the operation with the highest weight might not related to the best performance. To alleviate this deficiency, we propose a simple yet effective solution to neural architecture search, termed as exploiting operation importance for effective neural architecture search (EoiNAS), in which a new indicator is proposed to fully exploit the operation importance and guide the model search. Based on this new indicator, we propose a gradual operation pruning strategy to further improve the search efficiency and accuracy. Experimental results have demonstrated the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.50% on CIFAR-10, which significantly outperforms state-of-the-art methods. When transferred to ImageNet, it achieves the top-1 error of 25.6%, comparable to the state-of-the-art performance under the mobile setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Designing reasonable network architecture for specific problems is a challenging task. Better designed network architectures usually lead to significant performance improvement. In recent years, neural architecture search (NAS) [39, 40, 2, 8, 21, 23, 28, 24] has demonstrated success in designing neural architectures automatically. Many architectures produced by NAS methods have achieved higher accuracy than those manually designed in tasks such as image classification [39]

, super resolution 

[5], semantic segmentation [3, 20] and object detection [9]. NAS methods not only boost the model performance, but also liberate human experts from the tedious architecture tweaking work.

Figure 1: Correlation between stand-alone model and learned architecture weights. We replace the selected operation in the first edge of the first cell for the final architecture with all the other candidate operations both in the DARTS [23] and GDAS [7], and fully train them until converge.

So far, there exist three basic frameworks that have gained a growing interest, i.e.

, evolutionary algorithm (EA)-based NAS 

[22, 28, 35]

, reinforcement learning (RL)-based NAS 

[39, 40, 27], and gradient-based NAS [23, 36, 7]. In both EA-based and RL-based approaches, their searching procedures require the validation accuracy of numerous architecture candidates, which is computationally expensive. For example, the reinforcement learning method [39, 40]

trains and evaluates more than 20,000 neural networks across 500 GPUs over 4 days. These approaches use a large amount of computational resources, which is inefficient and unaffordable.

Figure 2: Illustration of the NAS procedure. (a) Cell. (b) Candidate operations. (c) Rank Difference: we visualize the corresponding importance of each operation by the colors and numbers; a change in ranking occurs between architecture weight ranking and the true one. (d) Selected architectures.

To eliminate such deficiency, gradient-based NAS methods [23, 36, 7, 4] such as DARTS [23] and GDAS [7] are recently presented. They construct a super network and relax the architecture representation by assigning continuous weights to the candidate operations. In DARTS, a computation cell is searched as the building block of the final architecture and each cell is represented as a directed acyclic graph (DAG) consisting of an ordered sequence of N nodes. Then the concrete search space is relaxed into a continuous one, so that network and architecture parameters can be well-optimized by gradient descent. It achieved comparable performance to EA-based [22] and RL-based [39] methods while only requiring a search cost of a few GPU-days. In order to further accelerate the searching procedure, GDAS [7] samples one sub-graph according to the architecture weights in a differentiable way at each training iteration.

Existing methods select the candidate operations based on their architecture weights to derive the target architecture. Stand-alone models are constructed to generate weights for all possible architectures in the search space. However, architecture weights cannot accurately reflect the importance of each operation. To illustrate this issue, the obtained accuracy of stand-alone model is compared with the corresponding architecture weights. Their correlation is plotted in Figure 1. We can see that the operation with the highest architecture weight dose not achieves the best accuracy. Furthermore, the architecture weights of candidate operations are often close to each other; in this case, it is difficult to decide which candidate operation is the optimal one. Figure 2 illustrates the procedure of NAS.

Given the limitation of architecture weights, it is natural to ask the question: will we be able to improve architecture search performance if we apply a more effective indicator to guide the model search? To this end, we propose a simple yet effective solution to neural architecture search, termed as exploiting operation importance for effective neural architecture search (EoiNAS). The main idea of our method has two parts:

1) It is well-recognized that operation A is better than operation B if A has fewer training epochs and higher validation accuracy during the search process. According to this criterion,

a new indicator is proposed to fully exploit the operation importance and guide the model search. Training iterations and validation accuracy for each operation can be recorded in the search space.

2) Based on this new indicator, we propose a gradual operation pruning strategy to further improve the search efficiency and accuracy. We denote the training of every k epochs as a step. In each step, we prune the most inferior operation according to the new indicator. This process continues until only one operation remains; this operation can be regarded as the best operation to derive the final architecture. Owing to the gradual operation pruning strategy, our super network exhibits fast convergence.

The effectiveness of EoiNAS is verified on the standard vision setting, i.e., searching on CIFAR-10, and evaluating on both CIFAR-10/100 and ImageNet datasets. We achieve state-of-the-art performance of 2.50% test error on CIFAR-10 using 3.4M parameters. When transferred to ImageNet, it achieves top-1/5 errors of 25.6%/8.3% respectively, comparable to the state-of-the-art performance under the mobile setting.

The remainder of this paper is organized as follows: In Section 2, we review the related work of recent neural architecture search algorithms and describe our search method in Section 3. After experiments are shown in Section 4, we conclude this paper in Section 5.

2 Related Work

With the rapid development of deep learning, significant gain in performance has been brought to a wide range of computer vision problems, most of which owed to manually designed network architectures 

[11, 15, 19, 33, 34, 10]. Recently, a new research field named neural architecture search (NAS) [39, 40, 2, 8, 21]

has been attracting increasing attentions. The goal is to find automatic ways of designing neural architectures to replace conventional handcrafted ones. According to the heuristics to explore the large architecture space, existing NAS approaches can be roughly divided into three categories, namely, evolutionary algorithm-based approaches 

[22, 28, 35], reinforcement learning-based approaches [39, 40, 27] and gradient-based approaches [23, 36, 7].

Reinforcement learning based NAS. A reinforcement learning based approach has been proposed by Zoph et al. [39, 40] for neural architecture search. They use a recurrent network as a controller to generate the model description of a child neural network designated for a given task. The resulted architecture (NASNet) improved over the existing hand-crafted network models at its time.

Evolutionary algorithm-based NAS. An alternative search technique has been proposed by Real et al. [28]

where an evolutionary (genetic) algorithm has been used to find a neural architecture tailored for a given task. The evolved neural network (AmoebaNet), further improved the performance over NASNet. Although these works achieved state-of-the-art results on various classification tasks, their main disadvantage is the large amount of computational resources they demand.

Gradient-based NAS. Contrary to treating architecture search as a black-box optimization problem, gradient based neural architecture search methods [23, 36, 7] utilized the gradient obtained in the training process to optimize neural architecture. DARTS [23] relaxed the search space to be continuous, so that the architecture can be optimized with respect to its validation set performance by gradient descent. Therefore, gradient-based approaches successfully accelerate the architecture search procedure, only several GPU days are required. Because DARTS optimized the entire super network during the search process, it may suffer from discrepancy between the continuous architecture encoding and the derived discrete architecture. GDAS [7] suggested an alternative method to alleviate this discrepancy. GDAS approaches the search problem as sampling from a distribution of architectures, where the distribution itself is learned in a continuous way. The distribution is expressed via slack softened one-hot variables that multiply the operations and make the sampling procedure differentiable. SNAS [36] applied a similar technique to constrain the architecture parameters to be one-hot to tackle the inconsistency in optimizing objectives between search and evaluation scenarios. In order to bridge the depth gap between search and evaluation scenarios, PDARTS [4] divide the search process into multiple stages and progressively increase the network depth at the end of each stage. In addition, MdeNAS [38] propose a multinomial distribution learning method for extremely effective NAS, which considers the search space as a joint multinomial distribution and the distribution is optimized to have high expectation of the performance.

Figure 3:

Search space. (a) A cell contains 7 nodes, two input nodes, four intermediate nodes that apply sampled operations on the input nodes and upper nodes, and an output node that concatenates the outputs of the four intermediate nodes. (b) The edge between two nodes denotes a possible operation sampled according to the discrete probability distribution

in the search space.

3 Methodology

We first describe our search space and continuous relaxation in general form in Section 3.1, where the computation procedure for an architecture is represented as a directed acyclic graph. We then propose a new indicator to fully exploit the importance of each operation in Section 3.2. Finally, we design an gradual operation pruning strategy to make the super network exhibit fast convergence and high training accuracy in Section 3.3.

3.1 Search Space and Continuous Relaxation

In this work, we leverage GDAS [7] as our baseline framework. Our goal is to search a robust cell and apply it to a network of L cells. As shown in Figure 3, a cell is defined as a directed acyclic graph (DAG) of N nodes, , where each node is a network layer, i.e., performing a specific mathematical function. We denote the operation space as , in which each element represents a candidate operation . An edge represents the information flow connecting node and , which consists of a set of operations weighted by the architecture weights , and is thus formulated as:

(1)
(2)

where is the o-th element of an O

-dimensional learnable vector

, and encodes the sampling distribution of the function between node and , as we will discuss below. Intuitively, a well learned could represent the relative importance of the operation o for transforming the feature map . Similar to GDAS, between node and , we sample one operation from according to a discrete probability distribution which is characterized by Eq. (2). During the search, we calculate each node in a cell as:

(3)

where is sampled from .

Since the operation is sampled from a discrete probability distribution, we cannot back-propagate gradients to optimize . To allow back-propagation, we use the Gumbel-Max trick [10, 25] and softmax function [16] to re-formulate Eq. (3) to Eq. (4), which provides an efficient way to draw samples from a discrete probability distribution in a differentiable way.

(4)
(5)

Here are i.i.d samples drawn from Gumbel (0,1); indicates the o-th function in O; is the o-th element of ; is the weight of for the transformation function between node and ; T is the temperature parameter [10], which controlls the Gumbel-Softmax distribution. As the parameter T approaches zero, the Gumbel-Softmax distribution becomes equivalent to the discrete probability distribution. The temperature parameter is annealed from 5.0 to 0.0 during our search.

Our candidate operation set contains the following 8 operations: (1) identity, (2) zero, (3) separable convolutions, (4) dilated separable convolutions, (5) separable convolutions, (6) dilated separable convolutions, (7) average pooling, (8) max pooling. We search for two kinds of cells, i.e., the normal cell and the reduction cell. When searching the normal cell, each operation in

has the stride of 1. For the reduction cell, the stride of operations on 2 input nodes is 2. Once we discover the best normal cell and reduction cell, we stack copies of these best cells to make up a neural network.

Figure 4: Ideal and reality cases of the weight deviation. The dark dashed curve indicates the ideal weight distribution, and the red solid curve denotes the weight distribution might occur in real cases. The deviation of architecture weights should be large enough, so that we can clearly judge which operation is more important.
Figure 5: Distribution of architecture weights. The small circles, ’

’ and solid curves denote each operation weight and Kernel Distribution Estimate (KDE) of the weight distribution respectively. Architecture weights are distributed too densely, which makes it difficult to distinguish the important operations from the others.

3.2 Operation Importance Indicator

Architecture Weights Deviation. In previous algorithms, operation importance is ranked by the architecture weights , which is supposed to represent the relative importance of a candidate operation verse the others. When the search process is over, they select the most important operation and prune other inferior operations according to the value of the architecture weights.

However, architecture weights cannot accurately reflect the importance of each operation. As shown in Figure 4, the dark dashed curve and the red solid curve indicates the architecture distribution in ideal and real cases respectively. In ideal cases, the deviation of architecture weights is large enough, so that we can clearly judge which operation is more important. However, this requirement may not always hold and it might lead to unexpected results. In Figure 5, statistical information collected from the cell of DARTS and GDAS demonstrates this analysis. The small circles and ’’ show each observation in this weight distribution, and the solid curves denote the Kernel Distribution Estimate (KDE) [32]

, which is a non-parametric way to estimate the probability density function of a random variable. As shown in Figure 5, there is a large quantity of operations whose architecture weights are distributed on a small interval, which makes it difficult to distinguish the important operations from the others. Figure 2 (c) also illustrates this issue: a change in ranking occurs between architecture weight ranking and the true one.

The Proposed Indicator. It is well-recognized that operation A is better than operation B if A has fewer training epochs and higher validation accuracy during the search process. Therefore, For each operation, the ratio of training iterations and validation accuracy can be used to determine the operation importance. This ratio is represented by

(6)

where and is the validation accuracy and training iterations of each operation on each edge respectively. The value of accuracy parameters might also close to each other, which will affect importance judgement; in this case, we consider the operation with higher architecture weights will be more important. Therefore, we combine accuracy parameters with architecture weights to obtain an effective indicator I as Eq. (7), which can fully exploit the importance of each operation.

(7)

where is the architecture weights of the o-th operation between node and , is a parameter to control the balance between the two parts, which is set to 0.5 in this work.

Compared to previous methods [23, 7] that judge the operation importance directly by architecture weights, our proposed indicator can effectively reflect the operation importance, which can help to select the optimal operation, so as to achieve the highest accuracy. Apply this effective indicator can be able to improve architecture search performance significantly.

Based on this new indicator I, gradual operation pruning strategy is proposed during the search process to further improve the search efficiency and accuracy, as we will discuss next.

Type Architecture GPUs Times Params Test Error Search Method
(days) (million) C10(%) C100(%)
Human
expert
ResNet + CutOut [11] 1.7 4.61 22.10 manual
DenseNet [15] 25.6 3.46 17.18 manual
SENet [14] 11.2 4.05 manual
Neural
architecture
search
MetaQNN [1] 10 8-10 11.2 6.92 27.14 RL
NAS [39] 800 21-28 7.1 4.47 RL
NASNet-A [40] 450 3-4 3.3 3.41 RL
NASNet-A + CutOut [40] 450 3-4 3.3 2.65 RL
ENAS [27] 1 0.45 4.6 3.54 19.43 RL
ENAS + CutOut [27] 1 0.45 4.6 2.89 RL
AmoebaNet-A + CutOut [28] 450 7.0 3.2 3.34 18.93 evolution
AmoebaNet-B + CutOut [28] 450 7.0 2.8 2.55 evolution
Hierarchical NAS [22] 200 1.5 61.3 3.63 evolution
Progressive NAS [21] 100 1.5 3.2 3.63 19.53 SMBO
DARTS (1st) + CutOut [23] 1 0.38 3.3 3.00 17.76 gradient-based
DARTS (2nd) + CutOut [23] 1 1.0 3.4 2.82 17.54 gradient-based
SNAS + CutOut [36] 1 1.5 2.9 2.98 gradient-based
GDAS [7] 1 1.0 3.4 3.87 19.68 gradient-based
GDAS + CutOut [7] 1 1.0 3.4 2.93 18.38 gradient-based
MdeNAS + CutOut [38] 1 0.16 3.61 2.55 MDL
Random Search + CutOut [23] 1 4.0 3.2 3.29 random
EoiNAS 1 0.6 3.4 3.42 18.4 gradient-based
EoiNAS + CutOut 1 0.6 3.4 2.50 17.3 gradient-based
Table 1: Classification errors of EOINAS and benchmarks on CIFAR-10 and CIFAR-100.

3.3 Gradual Operation Pruning Strategy

In existing methods, all candidate operations are always kept during the search process and unimportant operations are removed directly by the architecture weights until the search is over to derive the final architecture. However, for some unimportant operations, we do not need to waste time and computation resources to sample and train.

Therefore, we propose a gradual operation pruning strategy to further improve the search efficiency and accuracy. We denote the training of every k epochs as a step. In a step, we prune the most inferior operation according to the new indicator. In the next step, we judge the most inferior operation in the remaining operations and prune it. This process continues until only one operation remains; this operation can be regarded as the best operation to derive the final architecture. Owing to the gradual operation pruning strategy, our super network exhibits fast convergence.

Our definite searching algorithm is presented in Algorithm 1. At the initialization of the search process, we perform gradient-descent based optimization over the network parameters in the first 20 epochs. It helps obtaining balanced architecture weights between parameterized operations (e.g. convolution operation) and non-parameterized operations (e.g. skip-connect operation). Then, we perform a gradient-descent based optimization for the architecture parameters and network parameters in an alternating manner. Specifically, we optimize the operation weights by descending on the training set, and optimize the architecture parameters by descending on the validation set. An operation will be pruned after 20 epochs if its corresponding operation importance indicator , which is updated along training iterations, is the lowest. When the search procedure is finished, we decode the discrete cell architecture by first retaining the two strongest predecessors for each node (with the strength from node and , being ), and then choose the most likely operation by taking the argmax.

Input: Training set: ; Validation set: ;
          Operation set:
Init: Network parameters: ; Architecture parameters: ;
       Validation accuracy: ; Training iterations: ;
       Temperature parameter:

1:while not converge do
2:     Sample a sub-graph to train according to Eq. (5);
3:     Update by on ;
4:     Update by on ;
5:     Update , ;
6:     if epoch>20 and epoch % 20 == 0 then
7:         Calculate operation importance by Eq. (7);
8:               
9:     end if
10:end while
11:Derive the final architecture based on the indicator I;
12:Optimize the architecture on the training set
Algorithm 1 Efficient Neural Architecture Search

4 Experiments

4.1 Datasets

We conduct experiments on three popular image classification datasets, including CIFAR-10, CIFAR-100 [18] and ImageNet [29]. Architecture search is performed on CIFAR-10, and the discovered architectures are evaluated on all three datasets.

Both CIFAR-10 and CIFAR-100 have 50K training and 10K testing RGB images with a fixed spatial resolution of . These images are equally distributed over 10 classes and 100 classes in CIFAR-10 and CIFAR-100 respectively. In the architecture search scenario, the training set is equally split into two subsets, one for updating network parameters and the other for updating the architecture parameters. In the evaluation scenario, the standard training/testing split is used.

We use ImageNet to test the transferability of the architectures discovered on CIFAR-10. Specificaly, we use a subset of ImageNet, namely ILSVRC2012, which contains 1,000 object categories and 1.28M training and 50K validation images. Following the conventions [40, 23], we apply the mobile setting where the input image size is .

Type Architecture GPUs Times Params MAdds Test Error (%) Search Method
(days) (million) (million) Top-1 Top-5
Human
expert
Inception-v1 [10] 6.6 1448 30.2 10.1 manual
MobileNet-V2 [30] 3.4 300 28.0 manual
MobileNet-V3 [13] 5.4 219 24.8 manual
ShuffleNet [37] 5.0 524 26.3 manual
Neural
architecture
search
NASNet-A [40] 450 3-4 5.3 564 26.0 8.4 RL
NASNet-B [40] 450 3-4 5.3 488 27.2 8.7 RL
NASNet-C [40] 450 3-4 4.9 558 27.5 9.0 RL
AmoebaNet-A [28] 450 7.0 5.1 555 25.5 8.0 evolution
AmoebaNet-B [28] 450 7.0 5.3 555 26.0 8.5 evolution
AmoebaNet-C [28] 450 7.0 6.4 570 24.3 7.6 evolution
Progressive NAS [21] 100 1.5 5.1 588 25.8 8.1 SMBO
DARTS (2nd) [23] 1 1.0 4.9 595 26.9 9.0 gradient-based
SNAS [36] 1 1.5 4.3 522 27.3 9.2 gradient-based
GDAS [7] 1 1.0 5.3 581 26.0 8.5 gradient-based
MdeNAS [38] 1 0.16 6.1 596 25.5 7.9 MDL
EoiNAS 1 0.6 5.0 570 25.6 8.3 gradient-based
Table 2: Comparison with the state-of-the-art architectures on ImageNet (mobile setting). All the NAS networks are searched on CIFAR-10, and then directly transferred to ImageNet.
Figure 6: Detailed structure of the best cells discovered on CIFAR-10 by our EoiNAS. (a) Normal cell. (b) Reduction cell. The definition of the operations on the edges is in Section 3.1. In the normal cell, the stride of operations on 2 input nodes is 1 and the stride is 2 in the reduction cell.

4.2 Implementation Details

Following the pipeline in GDAS [7], our experiments consist of three stages. First, EoiNAS is applied to search for the best normal/reduction cells on CIFAR-10. Then, a larger network is constructed by stacking the learned cells and retrained on both CIFAR-10 and CIFAR-100. The performance of EoiNAS is compared with other state-of-the-art NAS methods. Finally, we transfer the cells learned on CIFAR-10 to ImageNet to evaluate their performance on larger datasets.

Network Configrations. The neural cells for CNN are searched on CIFAR-10 following [23, 36, 7]. The candidate function set O has 8 different functions as introduced in Section 3.1. By default, we train a small network of 8 cells for 160 epochs in total and set the number of initial channels in the first convolution layer C as 16. Cells located at the 1/3 and 2/3 of the total depth of the network are reduction cells, in which all the operations adjacent to the input nodes are of stride two.

Parameter Settings. For network parameters w, we use the SGD optimization. We start with a learning rate of 0.025 and anneal it down to 0.001 following a cosine schedule. We use the momentum of 0.9 and the weight decay of 0.0003. For architecture parameters ,we use zero initialization which implies equal amount of attention over all possible operations. And we use the Adam optimization [17] with the learning rate of 0.0003, momentum (0.5; 0.999) and the weight decay of 0.001. To control the temperature parameter T of the Gumbel Softmax in Eq. (5), we use an exponentially decaying schedule. The T is initialized as 5 and finally reduced to 0. Following [7]

, we run EoiNAS 4 times with different random seeds and pick the best cell based on its validation performance. This procedure can reduce the high variance of the searched results.

Our EoiNAS takes about 0.6 GPU days to finish the search procedure on a single NVIDIA 1080Ti GPU. The best cells searched by EoiNAS is shown in Figure 6.

4.3 Results on CIFAR-10 and CIFAR-100

For CIFAR, we built a network with 20 cells and 36 input channels, and trained it by 600 epochs with batch size 128. Cutout regularization [6] of length 16, drop-path of probability 0.3 and auxiliary towers of weight 0.4 [39] are applied. A standard SGD optimizer with a weight decay of 0.0003 and a momentum of 0.9 is used. The initial learning rate is 0.025, which is decayed to 0 following the cosine rule.

Evaluation results and comparison with state-of-the-art approaches are summarized in Table 1. As demonstrated in Table 1, EoiNAS achieves test errors of 2.50% and 17.3% on CIFAR-10 and CIFAR-100, respectively, with a search cost of only 0.6 GPU-days. To obtain the same performance, AmoebaNet [28] spent four orders of magnitude more computational resources (0.6 GPU-days vs 3150 GPU-days). Our EoiNAS also outperforms GDAS [7] and SNAS [36] by a large margin. Notably, architectures discovered by EoiNAS outperform MdeNAS [38], the previously most efficient approach, while with fewer parameters. In addition, we compare our method to random search (RS) [23], which is considered as a very strong baseline. Note that the accuracy of the model searched by EoiNAS is 0.7% higher than that of RS.

4.4 Results on ImageNet

The ImageNet dataset is used to test the transferability of architectures discovered on CIFAR-10. We adopt the same network configurations as GDAS [7], i.e.

, a network of 14 cells and 48 input channels. The network is trained by 250 epochs with batch size 128 on a single NVIDIA 1080Ti GPU, which takes 12 days with the PyTorch 

[26] implementation. The network parameters are optimized using an SGD optimizer with an initial learning rate of 0.1 (decayed linearly after each epoch), a momentum of 0.9 and a weight decay of . Additional enhancements including label smoothing and auxiliary loss tower are applied during training.

Evaluation results and comparison with state-of-the-art approaches are summarized in Table 2. Architecture discovered by EoiNAS outperforms that by GDAS by a large margin in terms of classification accuracy and model size. It demonstrates the transfer capability of the discovered architecture from small dataset to large dataset.

4.5 Ablation Studies

In addition, we have conducted a series of ablation studies that validate the importance and effectiveness of the proposed operation importance indicator as well as gradual operation pruning strategy incorporated in the design of EoiNAS.

In Table 3, we show ablation studies on CIFAR-10. GDAS [7] is our baseline frame work. The GOP means the gradual operation pruning strategy, the OII means the proposed operation importance indicator. All architectures are trained by 600 epochs. As the results show, our super network exhibits fast convergence and high training accuracy owing to the gradual operation pruning strategy. The structure of the best cells discovered on CIFAR-10 is shown in Figure 7. Through prune inferior operations gradually during the search process, we achieve much improvement in performance while using less search times.

Table 3 also demonstrated the effectiveness of the proposed operation importance indicator. The proposed indicator can better judge the importance of each operation and achieve higher accuracy. Such results reveal the necessity of the operation importance indicator.

Architecture GOP OII Times Params Error
(days) (million) (%)
Baseline 1.0 3.4 2.93
Baseline+GOP 0.6 3.4 2.72
EoiNAS 0.6 3.4 2.50
Table 3: Ablation studies on CIFAR-10. The GDAS [7] is our baseline. The GOP means the gradual operation pruning strategy, the OII means the proposed operation importance indicator. All architectures are trained by 600 epochs.

4.6 Searched Architecture Analysis

In differentiable NAS methods, architecture weights is not able to accurately reflect the importance of each operation as discussed in Section 1, because the accuracy of the fully trained stand-alone model and their corresponding architecture weights have low correlation. The proposed operation importance indicator can better decide which operation should be keep on each edge and which edges should be the input of each node, especially for the selection of skip-connect.

The skip-connect operation plays an important role in cell structure. As well studied in [12, 31], including a reasonable number and location of skip connections would make the gradient flows easier and optimization of deep neural network more stable. Compared the searched results in Figure 6 and Figure 7, architecture discovered by EoiNAS on CIFAR-10 tend to preserve the skip-connect operations in a hierarchical way, which can facilitate gradient back propagation and make the network have a better convergence

Figure 7: Detailed structure of the best cells discovered on CIFAR-10 only by gradual operation pruning strategy. When pruning inferior operations and derive the final architecture during the search process, the operation importance are determined only by architecture weights.

Besides, compared with Figure 6 and Figure 7, we can see that EoiNAS encourages connections in a cell to cascade more levels, in other words, there are more layers in the cell, making the evaluation network further deeper and achieving better classification performance.

Finally, the combination of the operation importance indicator with the gradual operation pruning strategy can further enhance each other. The indicator is able to accurately represent the importance of operation and determine the remaining and pruning operations. Meanwhile, through gradually prune inferior operations, we can obtain more accurate indicator.

5 Conclusion

In this paper, we presented EoiNAS, a simple yet efficient architecture search algorithm for convolutional networks, in which a new indicator was proposed to fully exploit the operation importance to guide the model search. A gradual operation pruning strategy was proposed during the search process to further improve the search efficiency. By gradually pruning the inferior operations based on the proposed operation importance indicator, EoiNAS drastically reduced the computation consumption while achieving excellent model accuracies on CIFAR-10/100 and ImageNet, which outperformed the human-designed networks and other state-of-the-art NAS methods.

References