Neural architecture search has been recently dominated by one-shot methods [Brock et al.2018, Bender et al.2018, Stamoulis et al.2019, Guo et al.2019, Cai, Zhu, and Han2019]. Fundamentally, a supernet which incorporates the whole search space enjoys fast convergence through weight sharing. Evaluating the performance of models by picking a single path from the supernet then becomes handy. According to [Chu et al.2019], fair training of supernet shows a remarkable improvement in model ranking. However, the scalability of a supernet is quite limited. By contrast, as pure reinforcement or evolutionary approaches train each model independently for evaluation, shallower models can also stand out if they exhibit good performance. This step is very beneficial as it achieves automatic architectural compression. To enable the similar property for the family of one-shot methods, we install identity choice blocks for network downscaling, which are accompanied with linearly equivalent transformation as a relay for inter-block information.
The accuracy of supernet with LET has much smaller standard variance.Bottom:
Histogram of training accuracies of sampled one-shot models within last epoch.
To summarize, our main contributions are threefold.
We uncover what is an important but neglected illness of being unscalable in previous one-shot architecture search methods.
We present proofs on why normal identity would fail to assure fair scalability. We propose a naive adjustment with linearly equivalent transformation that distinctly improves the supernet strength.
Our generated networks enjoy overwhelming performance compared to their counterparts especially EfficientNet-B0 [Tan and Le2019]. SCARLET-A claims a new state of the art 76.9% top-1 accuracy on ImageNet at the level of 400M multiply-adds. SCARLET-C also reaches competitive 75.6% with much decreased multiply-adds. More importantly, the closest model to EfficientNet-B0, SCARLET-B, which is a shallow version from our search space, distinguishes itself with 76.3%.
One-Shot Neural Architecture Search
In one-shot approaches, a supernet is constructed to represent the whole search space, within which each path is a stand-alone model. The supernet is trained only once, child models can inherit the weights of supernet thus it is easier and faster to evaluate its performance compared with other incomplete training techniques. Notable works are [Bender et al.2018, Stamoulis et al.2019, Guo et al.2019, Cai, Zhu, and Han2019]. Recent advances are concerned with training stability and mismatching accuracy range between one-shot models and stand-alone ones, FairNAS achieves better convergence and results by enforcing strict fairness constraints [Chu et al.2019].
Scalability and Network Transformation
Given a neural network, it has been experimentally studied to scale its size up or down for various application scenarios. Common model upscaling practices include increasing depth, width, as well as input image resolution[He et al.2016, Zagoruyko and Komodakis2016, Huang et al.2018]. Recent work by [Tan and Le2019] proposes an effective compound scaling method that incorporates all three with a balance achieved by grid search. Nevertheless, these methods leave parameter sharing out of the discussion, each scaled network has to be trained from scratch. Network transformation is a solution for model scaling by reusing the weights from the original structure. For instance, Net2Net by [Chen, Goodfellow, and Shlens2015] invented two transformation schemes to pass on parameters to either wider or deeper student networks.
Review of the Supernet Training with Variable Depths
It has been unheeded about the training of scalable supernets, at least not carefully dealt with. Generally, mainstream one-shot approaches suffer from unstable training [Bender et al.2018]. This problem deteriorates when scalability is considered. Zero operations are added to skip blocks in [Cai, Zhu, and Han2019] to have flexibility in width and depth, however, its training details are not thoroughly discussed. [Guo et al.2019] also adopts the same search space design to draw a fair comparison. The intermediate process is also not reported, we can not decide how much difference do skip connections make.
We are motivated by [Chu et al.2019] which displays interesting feature maps after first choice blocks. It attests that all choice blocks learn similar knowledge, even at correspondent channels. This is an important incentive for us to investigate further whether a fair training of scalable supernet is possible. We argue that ensuring fairness for common scalability practices like identity blocks is necessary. As pure identity blocks are direct short paths and don’t learn any information, we have to accommodate this defect by injecting a learning unit. Here we remedy the issue with convolution without non-linear activations. It is named as linearly equivalent transformation (LET), see Figure 2. Be aware that layer in stand-alone models have parameters, while its one-shot counterparts have .
Given an input of a chickadee222ImageNet ID: n01592084_7680 image from ImageNet, we illustrate both high-level and low-level feature maps of the trained supernet with our proposed improvements in Figure 5. Pure identity easily interferes with the training process as it causes possible channel reduction and non-congruence with other choice blocks. Note the channel size of feature map after Choice 6 in Figure 5 (a) is half of others because the previous channel size is 16, while other choice blocks output 32 channels. This effect is largely attenuated with LET. As it goes deeper, we still observe consistent high-level features. Specifically, when LET is not enforced, high-level features of deeper channels easily get blurred out, while the supernet with LET enabled continues to learn useful features in deeper channels. We will discuss it formally in the next section.
Stabilize Training by Equivalent Transformation
A critical requirement for transformation is equivalence. Thus, the transformed model behaves exactly as the original entity. To proceed, we first define the equivalence.
Given a valid space , a function f is equivalent to g on if and only if for , where , and are tensors of any shape.
are tensors of any shape.
The equivalence of two neural networks: for models and with weights and , if and only if and are equivalent.
The 2D convolution and fully-connected layer are two of the most widely used operations in deep neural networks. For the image classification task, a mini-batch of images with channels can be denoted by . Let A be a deep neural network with layers and be the feature maps of the -th layer. For simplicity, we omit the batch dimension . Hence,
In general, the shape of can be or , where the former is generated from 2D convolutions and the latter is from fully-connected operations. We denote a 2D convolution as . For convenience, we use square kernels and neglect dilation rates, considering only . Other setups can be easily proven in the same manner. Here, we consider two scenarios:
and have shape and respectively. is a linear operation without activation and the first operation of is a linear (fully-connected) ,
and each have shape and . is a 2D convolution without activation or bias. The first operation of is 2D convolution .
Note that the choice block in a layer can either be a simple or a complex block and we only require that the beginning part of it is a 2D convolution or FC. This requirement is weak enough to cover most neural networks.
For the first scenario, we replace and with an identity operation and another linear to construct B from A, then we can ensure .
First, we copy ’s weights from except for and . We can make if we can let be equivalent to the above two successive operations. For any , let and denote the weight matrices of and . Let be the weight matrix of .
Second, we can calibrate
We can make by combining them both. ∎
For the second scenario, we substitute and with an identity operation and another 2D convolution to construct from , then we can ensure .
First, we copy ’s weights from except for the and . The only thing to prove is that we can replace and with equivalently. Second, We prove that any , the above declaration holds. Let and be the weight matrices of and . Let be the weight matrix of .
We can make by setting
Since we don’t search the number of channels, we replace the choice identity with convolution without bias or activation for convolution networks and FC layer without activation when the identity is needed between FC layers. This procedure is illustrated in Figure 2. The left-most architecture is the original and commonly used version. The middle one is its equivalent version and the right-most version is the architecture for stand-alone training. Attention should be paid for training stand-alone models. The number of feature maps usually increases with depth. For instance, in Figure 2, , we should adjust the input channel number to make convolution works.
We perform the search directly on ImageNet 1k dataset [Deng et al.2009] and randomly select 50k images from the training set as a validation set. The original validation set is used as the test set to report accuracy.
We consider two search spaces in this paper. The difference lies in whether the squeeze and excitation (SE) [Hu, Shen, and Sun2018] is inclusively searched. The version without SE is utilized to make comparisons within a shorter time. The other one is utilized to make fair comparisons with MnasNet [Tan et al.2019] and EfficientNet [Tan and Le2019].
Scalable MBV2 Search Space. We utilize standard MobileNetV2 inverted bottleneck blocks [Sandler et al.2018] to build up our search space after [Cai, Zhu, and Han2019]. We let the convolutional kernels be within and expansion rates , the channel of filters per layer is retained. On top of this, we include an identity block with linearly equivalent transformation for scalability. The size of this search space is .
Scalable MBV2 Search Space with SE. We align our search space by including Squeeze-and-excitation blocks with the latest works. Specifically, we give each MBV2 block an SE option, in total, we have 13 options per layer. The overall size of search space now becomes . The detailed choice blocks per layer are displayed in Table 1. Note that Index 12 refers to an identity block with the equivalent transformation ( Conv). The rest choices are typical MBV2 blocks.
We utilize a multi-objective approach to serve as our pipeline [Chu, Zhang, and Xu2019]. We consider three objectives: classification error, multiply-adds and the number of parameters. We choose multiply-adds because we don’t search models for specific hardwares. We also impose constraints for FLOPs to act as a mobile requirement. NSGA-II [Deb et al.2002] is a powerful approach to address this kind of problem, so we start our work on top of it.
As [Zhang et al.2018] states, mobile models are prone to underfitting instead of overfitting. Therefore, we try to maximize the number of parameters and utilize the weighted NSGA-II approach to weight different objectives as [Chu, Zhang, and Xu2019]. In a word, our problem can be defined as follows,
The weight of the above three objectives are .
We use choice index to encode the chromosome. Therefore, we can use to represent a model. Besides, we utilize the standard NSGA-II procedure, and only point out something different if necessary.
We initialize population to introduce various choice blocks to encourage exploration.
To be simple, we use single-point crossover.
Weighted Non-dominated Sorting
We weigh different preferences for various objectives by defining weighted crowding distance.
We run 120 epochs with population size of 70 to get 8400 models. This search stage takes about 1.5 GPU days on a Tesla V100. Then we sample 4 models from the final Pareto front with approximately equal crowding distance and train them completely.
As for the training of the supernet, we use the same setting as FairNAS [Chu et al.2019], except we train for 60 epochs and it takes about 10 GPU days.
As for the full training stage, we use the same configuration as FairNAS [Chu et al.2019]. And we use standard Inception pre-processing tricks [Szegedy et al.2017] . Unlike EfficientNet, we don’t apply AutoAugment policy [Cubuk et al.2018]
because many state-of-the-art algorithms report their results without it. Moreover, we use RMSProp optimizer with 0.9 momentum. We use a batch size of 4096 and the initial learning rate of 0.256, which decays 0.01 every 2.4 epochs. We use a dropout rate of 0.2[Srivastava et al.2014] before the last FC layer and a weight decay () rate of .
|Population N||70||Mutation Ratio||0.8|
|MobileNetV2 1.0 [Sandler et al.2018]||300||3.4||72.0||91.0|
|MobileNetV3 Large 1.0 [Howard et al.2019]||219||5.4||75.2||92.2|
|MnasNet -A1 [Tan et al.2019]||312||3.9||75.2||92.5|
|MnasNet-A2 [Tan et al.2019]||340||4.8||75.6||92.7|
|FBNet-B [Wu et al.2019]||295||4.5||74.1|
|Proxyless-R Mobile [Cai, Zhu, and Han2019]||320||4.0||74.6||92.2|
|Proxyless GPU [Cai, Zhu, and Han2019]||465||7.1||75.1||-|
|Single-Path NAS [Stamoulis et al.2019]||365||4.3||75.0||92.2|
|FairNAS-A [Chu et al.2019]||388||4.6||75.3||92.4|
|MoGA-A [Chu, Zhang, and Xu2019]||304||5.1||75.9||92.8|
|EfficientNet B0 [Tan and Le2019]||390||5.3||76.3||93.2|
Comparison with State of the art methods
Table 3 gives a clear comparison of state-of-the-art models on ImageNet dataset. We pick models within the range of FLOPs from 200M to 400M. It is clear that our SCARLET series marks a new state of the art, with SCARLET-A surpassing EfficientNet-B0 with +0.5% increase on top-1 accuracy and 25M fewer FLOPs, SCARLET-B achieves the same top-1 accuracy with 61M fewer FLOPs. Although both A and B have higher numbers of parameters, this treatment should be encouraged as it is related to representational power [Chu, Zhang, and Xu2019] and doesn’t necessarily increase inference latency. SCARLET-C also achieves a competing result with +0.3% increase on top-1 accuracy compared with FairNAS-A, while costing 108M fewer FLOPs.
SCARLET-A makes full use of large kernels (five and seven 77 kernels) to enlarge receptive field. Besides it activates many squeezing and excitation (12 out of 19) blocks to improve its classification performance. At the early stage, it appreciates either large kernels and small expansion ratios or small kernels and large expansion ratios to balance the trade-off between accuracy and FLOPs.
SCARLET-B chooses two identity operations. Compared with A, it shortens network depth at the last stages. Besides, it utilizes squeezing and excitation block extensively (14 out of 17). It places a large expansion block with large kernels at the tail stage.
SCARLET-C uses three identity operations and utilizes small expansion ratio extensively to cut down the FLOPs, large expansion ratio at the tail stage whose resolution is . It prefers large kernels before the downsampling layers. Besides, it makes an extensive use of squeeze and excitation to boost accuracy.
Equivalent Transformation vs. Identity
To check the validness of our method, we just utilize identity as a basic choice to act as the baseline group, which is commonly used in prior works. We train the two supernets under the same training setting for 60 epochs. We report the average running accuracy and standard variance based on the epoch scale in Figure 1. Our method with linearly equivalent transformation can obtain about higher than the baseline in case of the top-1 accuracy on the training set. Moreover, it has much lower variance, which indicates each model is trained fairly. To further verify this, we sample all the models in the last epoch (60-th) and report their metrics by a histogram, which is shown at the bottom of Figure 1. Identity makes troubles for the training and quite a few models suffer seriously and their metrics are below
. Therefore, they are severely under-estimated by the supernet. Whereas, LET can compensate and bring the models to a reasonable range.
Equivalent vs Non-equivalent Transformation
Here we show that non-equivalent transformation changes the representative power of neural networks. A simple modification by adding ReLU function can violate the equivalence. To prove this, we randomly sample a model meta and then forcibly flip some choice blocks to identity : [1, 3, 1, 0, 12, 0, 0, 0, 12, 12, 12, 12, 12, 0, 0, 0, 12, 12, 9]. We train this model with ReLU (non-equivalent) and without (equivalent) for identity layer333Note 12 means identity block. on the basis of same seeds, training tricks, initialization strategy, and hyper-parameters. Figure 3 indicates that a trivial modification non-negligibly affects its representative power. While in our scalable supernet, we have to guarantee a equivalent transformation and this is why ReLU can’t apply.
Discussion and Future Work
Weight sharing is one of the most critical features for efficient neural architecture search. Most of the one-shot approaches concentrate on how to find useful networks from choosing parallel choices. This schema hardly meets the requirement for flexibility, it even causes conflicts inherently. Since a neural network learns features layer by layer, it’s highly sensitive to any scaling operation. Our equivalent transformation can be regarded as a buffer for such operations. Nevertheless, it works best under some conditions. In our search spaces, a linear transformation is used to match a single inverted bottleneck layer, where only two non-linear activation functions are involved. When the matched function is too complicated, it will be more difficult to compensate.
In this paper, we unveil the overlooked scalability issue in one-shot neural architecture search approaches. We show that simply adding identity blocks introduces training instability. By compensating the learning process with linearly equivalent transformation, we fill the gap between scalability and stability. We prove and demonstrate such transformation is identical in terms of representational power. The renewed supernet then can be trained with desired convergence and delivers competitive neural architectures. Namely, with fewer FLOPs than EfficientNet-B0, SCARLET-A achieves 76.9% Top-1 accuracy on ImageNet. SCARLET-B illustrates that shallow models can perform better which hits the same 76.3% as EfficientNet-B0 with much reduced FLOPs. SCARLET-C strikes 75.6%, also exceeds its peers of similar sizes.
[Bender et al.2018]
Bender, G.; Kindermans, P.-J.; Zoph, B.; Vasudevan, V.; and Le, Q.
Understanding and simplifying one-shot architecture search.
International Conference on Machine Learning, 549–558.
- [Brock et al.2018] Brock, A.; Lim, T.; Ritchie, J. M.; and Weston, N. 2018. Smash: one-shot model architecture search through hypernetworks. International Conference on Learning Representations.
- [Cai, Zhu, and Han2019] Cai, H.; Zhu, L.; and Han, S. 2019. Proxylessnas: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations.
- [Chen, Goodfellow, and Shlens2015] Chen, T.; Goodfellow, I.; and Shlens, J. 2015. Net2net: Accelerating learning via knowledge transfer. International Conference on Learning Representations.
- [Chu et al.2019] Chu, X.; Zhang, B.; Xu, R.; and Li, J. 2019. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845.
- [Chu, Zhang, and Xu2019] Chu, X.; Zhang, B.; and Xu, R. 2019. Moga: Searching beyond mobilenetv3. arXiv preprint arXiv:1908.01314.
- [Cubuk et al.2018] Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501.
[Deb et al.2002]
Deb, K.; Pratap, A.; Agarwal, S.; and Meyarivan, T.
A fast and elitist multiobjective genetic algorithm: Nsga-ii.
IEEE Transactions on Evolutionary Computation6(2):182–197.
- [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In , 248–255. Ieee.
- [Guo et al.2019] Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.; Wei, Y.; and Sun, J. 2019. Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
- [Howard et al.2019] Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. 2019. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244.
- [Hu, Shen, and Sun2018] Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141.
- [Huang et al.2018] Huang, Y.; Cheng, Y.; Chen, D.; Lee, H.; Ngiam, J.; Le, Q. V.; and Chen, Z. 2018. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965.
- [Sandler et al.2018] Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520.
- [Schulman et al.2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
- [Stamoulis et al.2019] Stamoulis, D.; Ding, R.; Wang, D.; Lymberopoulos, D.; Priyantha, B.; Liu, J.; and Marculescu, D. 2019. Single-path nas: Designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877.
[Szegedy et al.2017]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A.
Inception-v4, inception-resnet and the impact of residual connections on learning.In
Thirty-First AAAI Conference on Artificial Intelligence.
[Tan and Le2019]
Tan, M., and Le, Q. V.
Efficientnet: Rethinking model scaling for convolutional neural networks.International Conference on Machine Learning.
- [Tan et al.2019] Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; and Le, Q. V. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- [Wu et al.2019] Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; and Keutzer, K. 2019. Fbnet: Hardware-aware efficient convnet design via differentiableneural architecture search. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- [Zagoruyko and Komodakis2016] Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. Proceedings of the British Machine Vision Conference.
- [Zhang et al.2018] Zhang, X.; Zhou, X.; Lin, M.; and Sun, J. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- [Zoph et al.2018] Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. V. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8697–8710.