AtomNAS
Code for ICLR 2020 paper 'AtomNAS: FineGrained EndtoEnd Neural Architecture Search'
view repo
Designing of search space is a critical problem for neural architecture search (NAS) algorithms. We propose a finegrained search space comprised of atomic blocks, a minimal search unit much smaller than the ones used in recent NAS algorithms. This search space facilitates direct selection of channel numbers and kernel sizes in convolutions. In addition, we propose a resourceaware architecture search algorithm which dynamically selects atomic blocks during training. The algorithm is further accelerated by a dynamic network shrinkage technique. Instead of a searchandretrain twostage paradigm, our method can simultaneously search and train the target architecture in an endtoend manner. Our method achieves stateoftheart performance under several FLOPS configurations on ImageNet with a negligible searching cost. We open our entire codebase at: https://github.com/meijieru/AtomNAS
READ FULL TEXT VIEW PDFCode for ICLR 2020 paper 'AtomNAS: FineGrained EndtoEnd Neural Architecture Search'
Humandesigned neural networks are already surpassed by machinedesigned ones. Neural Architecture Search (NAS) has become the mainstream approach to discover efficient and powerful network structures (
Zoph and Le (2017); Pham et al. (2018); Tan et al. (2019); Liu et al. (2019a)). Although the tedious searching process is conducted by machines, humans still involve extensively in the design of the NAS algorithms. Designing of search spaces is critical for NAS algorithms and different choices have been explored. Cai et al. (2019) and Wu et al. (2019) utilize supernets with multiple choices in each layer to accommodate a sampled network on the GPU. Chen et al. (2019b) progressively grow the depth of the supernet and remove unnecessary blocks during the search. Tan and Le (2019a) propose to search the scaling factor of image resolution, channel multiplier and layer numbers in scenarios with different computation budgets. Stamoulis et al. (2019b) propose to use different kernel sizes in each layer of the supernet and reuse the weights of larger kernels for small kernels. Howard et al. (2019); Tan and Le (2019b) adopts Inverted Residuals with Linear Bottlenecks (MobileNetV2 block) (Sandler et al., 2018), a building block with lightweighted depthwise convolutions for highly efficient networks in mobile scenarios.However, the proposed search spaces generally have only a small set of choices for each block. DARTS and related methods (Liu et al., 2019a; Chen et al., 2019b; Liang et al., 2019) use around 10 different operations between two network nodes. Howard et al. (2019); Cai et al. (2019); Wu et al. (2019); Stamoulis et al. (2019b) search the expansion ratios in the MobileNetV2 block but still limit them to a few discrete values. We argue that more finegrained search space is essential to find optimal neural architectures. Specifically, the searched building block in a supernet should be as small as possible to generate the most diversified model structures.
We revisit the architectures of stateoftheart networks (Howard et al. (2019); Tan and Le (2019b); He et al. (2016)) and find a commonly used building block: convolution  channelwise operation  convolution. We reinterpret such structure as an ensemble of computationally independent blocks, which we call atomic blocks. This new formulation enables a much larger and more finegrained search space. Starting from a supernet which is built upon atomic blocks, the search for exact channel numbers and various operations can be achieved by selecting a subset of the atomic blocks.
For the efficient exploration of the new search space, we propose a NAS algorithm named AtomNAS to conduct architecture search and network training simultaneously. Specifically, an importance factor is introduced to each atomic block. A penalty term proportional to the computation cost of the atomic block is enforced on the network. By jointly learning the importance factors along with the weights of the network, AtomNAS selects the atomic blocks which contribute to the model capacity with relatively small computation cost.
Training on large supernets is computationally demanding. We observe that the scaling factors of many atomic blocks permanently vanish at the early stage of model training. We propose a dynamic network shrinkage technique which removes the ineffective atomic blocks on the fly and greatly reduce the computation cost of AtomNAS.
In our experiment, our method achieves 75.9% top1 accuracy on ImageNet dataset around 360M FLOPs, which is 0.9% higher than stateoftheart model (Stamoulis et al., 2019b). By further incorporating additional modules, our method achieves 77.6% top1 accuracy. It outperforms MixNet by 0.6% using 363M FLOPs, which is a new stateoftheart under the mobile scenario.
In summary, the major contributions of our work are:
We propose a finegrained search space which includes the exact number of channels and mixed operations (e.g., combination of different convolution kernels).
We propose an efficient endtoend NAS algorithm named AtomNAS which can simultaneously search the network architecture and train the final model. No finetuning is needed after AtomNAS finishes.
With the proposed search space and AtomNAS, we achieve stateoftheart performance on ImageNet dataset under mobile setting.
Recently, there is a growing interest in automated neural architecture design. Reinforce learning based NAS methods
(Zoph and Le, 2017; Tan et al., 2019; Tan and Le, 2019b, a) are usually computational intensive, thus hampering its usage with limited computational budget. To accelerate the search procedure, ENAS (Pham et al., 2018) represents the search space using a directed acyclic graph and aims to search the optimal subgraph within the large supergraph. A training strategy of parameter sharing among subgraphs is proposed to significantly increase the searching efficiency. The similar idea of optimizing optimal subgraphs within a supergraph is also adopted by Liu et al. (2019a); Wu et al. (2019); Guo et al. (2019); Cai et al. (2019). A prominent disadvantage of the above methods is their coarse search spaces only include limited categories of properties, e.g. kernel size, expansion ratio, the number of layer, etc. Because of the restriction of search space, it is difficult to learn optimal architectures under computational resource constraints. On the contrary, our method proposes the finegrained search space to enable searching more flexible network architectures under various resource constraints.Assuming that many parameters in the network are unnecessary, network pruning methods start from a computationintensive model, identify the unimportant connections and remove them to get a compact and efficient network. Early method (Han et al., 2016) simultaneously learns the important connections and weights. However, nonregularly removing connections in these works makes it hard to achieve theoretical speedup ratio on realistic hardwares due to extra overhead in caching and indexing. To tackle this problem, structured network pruning methods (He et al., 2017b; Liu et al., 2017; Luo et al., 2017; Ye et al., 2018; Gordon et al., 2018) are proposed to prune structured components in networks, e.g. the entire channel and kernel. In this way, empirical acceleration can be achieved on modern computing devices. Liu et al. (2017); Ye et al. (2018); Gordon et al. (2018) encourage channellevel sparsity by imposing the L1 regularizer on the channel dimension. Recently, Liu et al. (2019b) show that in structured network pruning, the learned weights are unimportant. This suggests structured network pruning is actually a neural architecture search focusing on channel numbers. Our method jointly searches the channel numbers and a mix of operations, which is a much larger search space.
We formulate our neural architecture search method in a finegrained search space with the atomic block used as the basic search unit. An atomic block is comprised of two convolutions connected by a channelwise operation. By stacking atomic blocks, we obtain larger building blocks (e.g. residual block and MobileNetV2 block proposed in a variety of stateoftheart models including ResNet, MobileNet V2/V3 (He et al., 2016; Howard et al., 2019; Sandler et al., 2018). In Section 3.1, We first show larger network building blocks (e.g. MobileNetV2 block) can be represented by an ensembles of atomic blocks. Based on this view, we propose a finegrained search space using atomic blocks. In Section 3.2, we propose a resourceaware atomic block selection method for endtoend architecture search. Finally, we propose a dynamic network shrinkage technique in Section 3.3, which greatly reduces the search cost.
Under the typical blockwise NAS paradigm (Tan et al., 2019; Tan and Le, 2019b), the search space of each block in a neural network is represented as the Cartesian product , where each is the set of all choices of the th configuration such as kernel size, number of channels and type of operation. For example, represents a search space of three types of convolutions by two kernel sizes and four options of channel number. A block in the resulting model can only pick one convolution type from the three and one output channel number from the four values. This paradigm greatly limits the search space due to the few choices of each configuration. Here we present a more finegrained search space by decomposing the network into smaller and more basic building blocks.
We denote as a convolution operator, where is the input tensor and , are the input and output channel numbers respectively. A wide range of manuallydesigned and NAS architectures share a structure that joins two convolutions by a channelwise operation:
(1) 
where is a channelwise operator. For example, in VGG (Simonyan and Zisserman, 2015) and a Residual Block (He et al., 2016), and are convolutions and
is one of Maxpool, ReLU and BNReLU; in a MobileNetV2 block
(Sandler et al., 2018), and are pointwise convolutions and is depthwise convolution with BNReLU in the MobileNetV2 block. Eq. (1) can be reformulated as follows:(2) 
where is the th convolution kernel of , is the operator of the th channel of , and are obtained by splitting the kernel tensor of along the the input channel dimension. Each term in the summation can be seen as a computationally independent block, which is called atomic block. Fig. (1) demonstrate this reformulation. By determining whether to keep each atomic block in the final model individually, the search of channel number is enabled through channel selection, which greatly enlarges the search space.
This formulation also naturally includes the selection of operators. To gain a better understanding, we first generalize Eq. (2) as:
(3) 
Note the array indices are moved to subscripts. In this formulation, we can use different types of operators for , and ; in other words, , and can each be a combination of different operators and each atomic block can use different operators such as convolution with different kernel sizes.
Formally, the search space is formulated as a supernet which is built based on the structure in Eq. (1); such structure satisfies Eq. (3) and thus can be represented by atomic blocks; each of , and is a combination of operators. The new search space includes some stateoftheart network architectures. For example, by allowing to be a combination of convolutions with different kernel sizes, the MixConv block in MixNet (Tan and Le, 2019b) becomes a special case in our search space. In addition, our search space facilitates discarding any number of channels in
, resulting in a more finegrained channel configuration. In comparison, the channels numbers are determined heuristically in
Tan and Le (2019b).In this work, we adopt a differentiable neural architecture search paradigm where the model structure is discovered in a full pass of model training. With the supernet defined above, the final model can be produced by discarding part of the atomic blocks during training. Following DARTS (Liu et al. (2019a)), we introduce a scaling factor to scale the output of each atomic block in the supernet. Eq. (3) then becomes
(4) 
Here, each is tied with an atomic block comprised of three operators , and . The scaling factors are learned jointly with the network weights. Once the training finishes, the atomic blocks with factors smaller than a threshold are discarded.
We still need to address two issues related to the factor . First, where should we put them in the supernet? The scaling parameters in the BN layers can be directly used as such scaling factors( Liu et al. (2017)). In most cases, contains at least one BN layer and we use the scaling parameters of the last BN layer in as . If has no BN layers, which is rare, we can place anywhere between and , as long as we apply regularization terms to the weights of and (e.g., weight decays) in order to prevent weights in and from getting too large and canceling the effect of .
The second issue is how to avoid performance deterioration after discarding some of the atomic blocks. For example, DARTS discards operations with small scale factors after iterative training of model parameters and scale factors. Since the scale factors of the discarded operations are not small enough, the performance of the network will be affected which needs retraining to adjust the weights again. In order to maintain the performance of the supernet after dropping some atomics blocks, the scaling factors of those atomic blocks should be sufficiently small. Inspired by the channel pruning work in Liu et al. (2017), we add L1 norm penalty loss on , which effectively pushes many scaling factors to nearzero values. At the end of learning, atomic blocks with close to zero are removed from the supernet. Note that since the BN scales change more dramatically during training due to the regularization term, the running statistics of BNs might be inaccurate and needs to be calculated again using the training set.
With the added regularization term, the training loss is
(5) 
(6) 
where is the coefficient of L1 penalty term, is the index set of all atomic blocks, and is the conventional training loss (e.g. crossentropy loss combined with the weight decay term). is weighted by coefficient which is proportional to the computation cost of th atomic block, i.e.
. By using computation costs aware regularization, we encourage the model to learn network structures that strike a good balance between accuracy and efficiency. In this paper, we use FLOPs as the criteria of computation cost. Other metrics such as latency and energy consumption can be used similarly. As a result, the whole loss function
trades off between accuracy and FLOPs.Usually, the supernet is much larger than the final search result. We observe that many atomic blocks become “dead” starting from the early stage of the search, i.e., their scaling factors are close to zero till the end of the search. To utilize computational resources more efficiently and speed up the search process, we propose a dynamic network shrinkage algorithm which cuts down the network architecture by removing atomic blocks once they are deemed “dead”.
We adopt a conservative strategy to decide whether an atomic block is “dead”: for scaling factors , we maintain its momentum which is updated as
(7) 
where is the scaling factors at th iteration and is the decay term. An atomic block is considered “dead” if both and are smaller than a threshold, which is set to 1e3 throughout experiments.
Once the total FLOPs of “dead” blocks reach a predefined threshold, we remove those blocks from the supernet. As discussed above, we recalculate BN’s running statistics before deploying the network. The whole training process is presented in Algorithm 1.
We show the FLOPs of a sample network during the search process in Fig. 2. We start from a supernet with 1521M FLOPs and dynamically discard “dead” atomic blocks to reduce search cost. The overall search and train cost only increases by compared to that of training the searched model from scratch.
We first describe the implementation details in Section 4.1 and then compare AtomNAS with previous stateoftheart methods under various FLOPs constraints in Section 4.2. Finally, we provide more analysis about AtomNAS in Section 4.3.
The picture on the left of Fig. 3 illustrates a search block in the supernet. Within this search block, is a pointwise convolutions that expands the input channel number from to ; is a mix of three depthwise convolutions with kernel sizes of , and , and is another pointwise convolutions that projects the channel number to the output channel number. Similar to Sandler et al. (2018), if the output dimension stays the same as the input dimension, we use a skip connection to add the input to the output. In total, there are atomic blocks in the search block. The overall architecture of the supernet is shown in the table on the right of Fig. 3. The supernet has 21 search blocks.
We use the same training configuration (e.g., RMSProp optimizer, EMA on weights and exponential learning rate decay) as
Tan et al. (2019); Stamoulis et al. (2019b) and do not use extra data augmentation such as MixUp (Zhang et al., 2018) and AutoAugment (Cubuk et al., 2018). We find that using this configuration is sufficient for our method to achieve good performance. Our results are shown in Table 1 and Table 3. When training the supernet, we use a total batch size of 2048 on 32 Tesla V100 GPUs and train for 350 epochs. For our dynamic network shrinkage algorithm, we set the momentum factor
in Eq. (7) to . At the beginning of the training, all of the weights are randomly initialized. To avoid removing atomic blocks with high penalties (i.e., FLOPs) prematurely, the weight of the penalty term in Eq. (5) is increased from 0 to the target by a linear scheduler during the first 25 epochs. By setting the weight of the L1 penalty term to be , and respectively, we obtain networks with three different sizes: AtomNASA, AtomNASB, and AtomNASC. They have the similar FLOPs as previous stateoftheart networks under M: MixNetS (Tan and Le, 2019b), MixNetM (Tan and Le, 2019b) and SinglePath (Stamoulis et al., 2019b).Input Shape  Block  f  n  stride 

3x3 conv  32(16)  1  2  
3x3 MB  16  1  1  
searchable  24  4  2  
searchable  40  4  2  
searchable  80  4  2  
searchable  96  4  1  
searchable  192  4  2  
searchable  320  1  1  
avgpool    1  1  
fc  1000  1   
We apply AtomNAS to search high performance lightweight model on ImageNet 2012 classification task (Deng et al., 2009). Table 1 compares our methods with previous stateoftheart models, either manually designed or searched.
With models directly produced by AtomNAS, our method achieves the new stateoftheart under all FLOPs constraints. Especially, AtomNASC achieves top1 accuracy with only 360M FLOPs, and surpasses all other models, including models like PDARTS and DenseNAS which have much higher FLOPs.
Techniques like Swish activation function
(Ramachandran et al., 2018) and SqueezeandExcitation (SE) module (Hu et al., 2018) consistently improve the accuracy with marginal FLOPs cost. For a fair comparison with methods that use these techniques, we directly modify the searched network by replacing all ReLU activation with Swish and add SE module with ratio 0.5 to every block and then retrain the network from scratch. Note that unlike other methods, we do not search the configuration of Swish and SE, and therefore the performance might not be optimal. Extra data augmentations such as MixUp and AutoAugment are still not used. We train the models from scratch with a total batch size of 4096 on 32 Tesla V100 GPUs for 250 epochs.Simply adding these techniques improves the results further. AtomNASA+ achieves top1 accuracy with 260M FLOPs, which outperforms many heavier models including MnasNetA2. It performs as well as EfficientB0 (Tan and Le, 2019a) by using 130M less FLOPs and without extra data augmentations. It also outperforms the previous stateoftheart MixNetS by . In addition, AtomNASC+ improves the top1 accuracy on ImageNet to , surpassing previous stateoftheart MixNetM by and becomes the overall best performing model under 400M FLOPs.
Fig. 4 visualizes the top1 accuracy on ImageNet for different models. It’s clear that our finegrained search space and the endtoend resourceaware search method boost the performance significantly.
Model  Parameters  FLOPs  Top1(%)  Top5(%) 

MobileNetV1 (Howard et al., 2017)  4.2M  575M  70.6  89.5 
MobileNetV2 (Sandler et al., 2018)  3.4M  300M  72.0  91.0 
MobileNetV2 (our impl.)  3.4M  301M  73.6  91.5 
MobileNetV2 (1.4)  6.9M  585M  74.7  92.5 
ShuffleNetV2 (Ma et al., 2018)  3.5M  299M  72.6   
ShuffleNetV2 2  7.4M  591M  74.9   
FBNetA (Wu et al., 2019)  4.3M  249M  73.0   
FBNetC  5.5M  375M  74.9   
Proxyless (mobile) (Cai et al., 2019)  4.1M  320M  74.6  92.2 
SinglePath (Stamoulis et al., 2019b)  4.4M  334M  75.0  92.2 
NASNetA (Zoph and Le, 2017)  5.3M  564M  74.0  91.6 
DARTS (second order) (Liu et al., 2019a)  4.9M  595M  73.1   
PDARTS (cifar 10) (Chen et al., 2019b)  4.9M  557M  75.6  92.6 
DenseNASA (Fang et al., 2019)  7.9M  501M  75.9  92.6 
FairNASA (Chu et al., 2019b)  4.6M  388M  75.3  92.4 
AtomNASA 
3.9M  258M  74.6  92.1 
AtomNASB  4.4M  326M  75.5  92.6 
AtomNASC  4.7M  360M  75.9  92.7 
SCARLETA (Chu et al., 2019a) 
6.7M  365M  76.9  93.4 
MnasNetA1 (Tan et al., 2019)  3.9M  312M  75.2  92.5 
MnasNetA2  4.8M  340M  75.6  92.7 
MixNetS (Tan and Le, 2019b)  4.1M  256M  75.8  92.8 
MixNetM  5.0M  360M  77.0  93.3 
EfficientNetB0 (Tan and Le, 2019a)  5.3M  390M  76.3  93.2 
SEDARTS+ (Liang et al., 2019)  6.1M  594M  77.5  93.6 
AtomNASA+ 
4.7M  260M  76.3  93.0 
AtomNASB+  5.5M  329M  77.2  93.5 
AtomNASC+  5.9M  363M  77.6  93.6 
We plot the structure of the searched architecture AtomNASC in Fig. 5, from which we see more flexibility of channel number selection, not only among different operators within each block, but also across the network. In Fig. 5(a), we visualize the ratio between atomic blocks with different kernel sizes in all 21 search blocks. First, we notice that all search blocks have convolutions of all three kernel sizes, showing that AtomNAS learns the importance of using multiple kernel sizes in network architecture. Another observation is that AtomNAS tends to keep more atomic blocks at the later stage of the network. This is because in earlier stage, convolutions of the same kernel size costs more FLOPs; AtomNAS is aware of this (thanks to its resourceaware regularization) and try to keep as less as possible computationally costly atomic blocks.
To demonstrate the effectiveness of the resourceaware regularization in Section 3.2, we compare it with a baseline without FLOPsrelated coefficients , which is widely used in network pruning (Liu et al., 2017; He et al., 2017b). Table 2 shows the results. First, by using the same L1 penalty coefficient , the baseline achieves a network with similar performance but using much more FLOPs; then by increasing to , the baseline obtain a network which has similar FLOPs but inferior performance (i.e., about lower). In Fig. 5(b) we visualized the ratio of different types of atomic blocks of the baseline network obtained by . The baseline network keeps more atomic blocks in the earlier blocks, which have higher computation cost due to higher input resolution. On the contrary, AtomNAS is aware of the resource constraint, thus keeping more atomic blocks in the later blocks and achieving much better performance.
FLOPs  Top1(%)  

445M  76.1  
370M  74.9  
360M  75.9 
As the BN’s running statistics might be inaccurate as explained in Section 3.2 and Section 3.3, we recalculate the running statistics of BN before inference, by forwarding 131k randomly sampled training images through the network. Table 3 shows the impact of the BN recalibration. The top1 accuracies of AtomNASA, AtomNASB, and AtomNASC on ImageNet improve by , , and respectively, which clearly shows the benefit of BN recalibration.
Model  w/o Recalibration  w/ Recalibration 

AtomNASA  73.2  74.6 (+1.4) 
AtomNASB  73.8  75.5 (+1.7) 
AtomNASC  74.7  75.9 (+1.2) 
Our dynamic network shrinkage algorithm speedups the search and train process significantly. For AtomNASC, the total time for searchandtraining is 25.5 hours. For reference, training the final architecture from scratch takes 22 hours. Note that as the supernet shrinks, both the GPU memory consumption and forwardbackward time are significantly reduced. Thus it’s possible to dynamically change the batch size once having sufficient GPU memory, which would further speed up the whole procedure.
In this paper, we revisit the common structure, i.e., two convolutions joined by a channelwise operation, and reformulate it as an ensemble of atomic blocks. This perspective enables a much larger and more finegrained search space. For efficiently exploring the huge finegrained search space, we propose an endtoend algorithm named AtomNAS, which conducts architecture search and network training jointly. The searched networks achieve significantly better accuracy than previous stateoftheart methods while using small extra cost.
IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017
, pp. 2980–2988. Cited by: §A.1.MobileNets: efficient convolutional neural networks for mobile vision applications
. CoRR abs/1704.04861. Cited by: Table 1.Singlepath mobile automl: efficient convnet design and NAS hyperparameter optimization
. CoRR abs/1907.00959. Cited by: §A.1, Table 4.In this section, we assess the performance of AtomNAS models as feature extractors for object detection and instance segmentation on COCO dataset (Lin et al., 2014). We first pretrain AtomNAS models (without Swish activation function (Ramachandran et al., 2018) and SqueezeandExcitation (SE) module (Hu et al., 2018)) on ImageNet, use them as dropin replacements for the backbone in the MaskRCNN model (He et al., 2017a) by building the detection head on top of the last feature map, and finetune the model on COCO dataset.
We use the opensource code MMDetection
(Chen et al., 2019a). All the models are trained on COCO train2017 with batch size 16 and evaluated on COCO val2017. Following the schedule used in the opensource implementation of TPUtrained MaskRCNN ^{†}^{†}https://github.com/tensorflow/tpu/tree/master/models/official/mask_rcnn, the learning rate starts at and decreases by a scale of 10 at 15th and 20th epoch respectively. The models are trained for 23 epochs in total.Table 4 compares the results with other baseline backbone models. The detection results of baseline models are from Stamoulis et al. (2019a). We can see that all three AtomNAS models outperform the baselines on object detection task. The results demonstrate that our models have better transferability than the baselines, which may due to mixed operations, a.k.a multiscale here, are more important to object detection and instance segmentation.
Model  FLOPs  Cls (%)  detectmAP (%)  segmAP (%) 
MobileNetV2 (Sandler et al., 2018)  301M  73.6  30.5   
Proxyless (mobile) (Cai et al., 2019)  320M  74.6  32.9   
Proxyless (mobile) (our impl.)  320M  74.9  32.7  30.0 
SinglePath+ (Stamoulis et al., 2019a)    75.6  33.0   
SinglePath (our impl.)  334M  75.0  32.0  29.7 
AtomNASA  258M  74.6  32.7  30.1 
AtomNASB  326M  75.5  33.6  30.8 
AtomNASC  360M  75.9  34.1  31.4 

Comments
There are no comments yet.