Introduction
Neural architecture search has been recently dominated by oneshot methods [Brock et al.2018, Bender et al.2018, Stamoulis et al.2019, Guo et al.2019, Cai, Zhu, and Han2019]. Fundamentally, a supernet which incorporates the whole search space enjoys fast convergence through weight sharing. Evaluating the performance of models by picking a single path from the supernet then becomes handy. According to [Chu et al.2019], fair training of supernet shows a remarkable improvement in model ranking. However, the scalability of a supernet is quite limited. By contrast, as pure reinforcement or evolutionary approaches train each model independently for evaluation, shallower models can also stand out if they exhibit good performance. This step is very beneficial as it achieves automatic architectural compression. To enable the similar property for the family of oneshot methods, we install identity choice blocks for network downscaling, which are accompanied with linearly equivalent transformation as a relay for interblock information.
The accuracy of supernet with LET has much smaller standard variance.
Bottom:Histogram of training accuracies of sampled oneshot models within last epoch.
To summarize, our main contributions are threefold.

We uncover what is an important but neglected illness of being unscalable in previous oneshot architecture search methods.

We present proofs on why normal identity would fail to assure fair scalability. We propose a naive adjustment with linearly equivalent transformation that distinctly improves the supernet strength.

Our generated networks enjoy overwhelming performance compared to their counterparts especially EfficientNetB0 [Tan and Le2019]. SCARLETA claims a new state of the art 76.9% top1 accuracy on ImageNet at the level of 400M multiplyadds. SCARLETC also reaches competitive 75.6% with much decreased multiplyadds. More importantly, the closest model to EfficientNetB0, SCARLETB, which is a shallow version from our search space, distinguishes itself with 76.3%.
Related Works
OneShot Neural Architecture Search
In oneshot approaches, a supernet is constructed to represent the whole search space, within which each path is a standalone model. The supernet is trained only once, child models can inherit the weights of supernet thus it is easier and faster to evaluate its performance compared with other incomplete training techniques. Notable works are [Bender et al.2018, Stamoulis et al.2019, Guo et al.2019, Cai, Zhu, and Han2019]. Recent advances are concerned with training stability and mismatching accuracy range between oneshot models and standalone ones, FairNAS achieves better convergence and results by enforcing strict fairness constraints [Chu et al.2019].
Scalability and Network Transformation
Given a neural network, it has been experimentally studied to scale its size up or down for various application scenarios. Common model upscaling practices include increasing depth, width, as well as input image resolution
[He et al.2016, Zagoruyko and Komodakis2016, Huang et al.2018]. Recent work by [Tan and Le2019] proposes an effective compound scaling method that incorporates all three with a balance achieved by grid search. Nevertheless, these methods leave parameter sharing out of the discussion, each scaled network has to be trained from scratch. Network transformation is a solution for model scaling by reusing the weights from the original structure. For instance, Net2Net by [Chen, Goodfellow, and Shlens2015] invented two transformation schemes to pass on parameters to either wider or deeper student networks.Review of the Supernet Training with Variable Depths
It has been unheeded about the training of scalable supernets, at least not carefully dealt with. Generally, mainstream oneshot approaches suffer from unstable training [Bender et al.2018]. This problem deteriorates when scalability is considered. Zero operations are added to skip blocks in [Cai, Zhu, and Han2019] to have flexibility in width and depth, however, its training details are not thoroughly discussed. [Guo et al.2019] also adopts the same search space design to draw a fair comparison. The intermediate process is also not reported, we can not decide how much difference do skip connections make.
We are motivated by [Chu et al.2019] which displays interesting feature maps after first choice blocks. It attests that all choice blocks learn similar knowledge, even at correspondent channels. This is an important incentive for us to investigate further whether a fair training of scalable supernet is possible. We argue that ensuring fairness for common scalability practices like identity blocks is necessary. As pure identity blocks are direct short paths and don’t learn any information, we have to accommodate this defect by injecting a learning unit. Here we remedy the issue with convolution without nonlinear activations. It is named as linearly equivalent transformation (LET), see Figure 2. Be aware that layer in standalone models have parameters, while its oneshot counterparts have .
Given an input of a chickadee^{2}^{2}2ImageNet ID: n01592084_7680 image from ImageNet, we illustrate both highlevel and lowlevel feature maps of the trained supernet with our proposed improvements in Figure 5. Pure identity easily interferes with the training process as it causes possible channel reduction and noncongruence with other choice blocks. Note the channel size of feature map after Choice 6 in Figure 5 (a) is half of others because the previous channel size is 16, while other choice blocks output 32 channels. This effect is largely attenuated with LET. As it goes deeper, we still observe consistent highlevel features. Specifically, when LET is not enforced, highlevel features of deeper channels easily get blurred out, while the supernet with LET enabled continues to learn useful features in deeper channels. We will discuss it formally in the next section.
Stabilize Training by Equivalent Transformation
A critical requirement for transformation is equivalence. Thus, the transformed model behaves exactly as the original entity. To proceed, we first define the equivalence.
Definition 1.
Given a valid space , a function f is equivalent to g on if and only if for , where , and
are tensors of any shape.
Definition 2.
The equivalence of two neural networks: for models and with weights and , if and only if and are equivalent.
The 2D convolution and fullyconnected layer are two of the most widely used operations in deep neural networks. For the image classification task, a minibatch of images with channels can be denoted by . Let A be a deep neural network with layers and be the feature maps of the th layer. For simplicity, we omit the batch dimension . Hence,
(1) 
In general, the shape of can be or , where the former is generated from 2D convolutions and the latter is from fullyconnected operations. We denote a 2D convolution as . For convenience, we use square kernels and neglect dilation rates, considering only . Other setups can be easily proven in the same manner. Here, we consider two scenarios:

and have shape and respectively. is a linear operation without activation and the first operation of is a linear (fullyconnected) ,

and each have shape and . is a 2D convolution without activation or bias. The first operation of is 2D convolution .
Note that the choice block in a layer can either be a simple or a complex block and we only require that the beginning part of it is a 2D convolution or FC. This requirement is weak enough to cover most neural networks.
Lemma 1.
For the first scenario, we replace and with an identity operation and another linear to construct B from A, then we can ensure .
Proof.
First, we copy ’s weights from except for and . We can make if we can let be equivalent to the above two successive operations. For any , let and denote the weight matrices of and . Let be the weight matrix of .
Second, we can calibrate
We can make by combining them both. ∎
Lemma 2.
For the second scenario, we substitute and with an identity operation and another 2D convolution to construct from , then we can ensure .
Proof.
First, we copy ’s weights from except for the and . The only thing to prove is that we can replace and with equivalently. Second, We prove that any , the above declaration holds. Let and be the weight matrices of and . Let be the weight matrix of .
We can make by setting
∎
Since we don’t search the number of channels, we replace the choice identity with convolution without bias or activation for convolution networks and FC layer without activation when the identity is needed between FC layers. This procedure is illustrated in Figure 2. The leftmost architecture is the original and commonly used version. The middle one is its equivalent version and the rightmost version is the architecture for standalone training. Attention should be paid for training standalone models. The number of feature maps usually increases with depth. For instance, in Figure 2, , we should adjust the input channel number to make convolution works.
Experiment Setup
Dataset
We perform the search directly on ImageNet 1k dataset [Deng et al.2009] and randomly select 50k images from the training set as a validation set. The original validation set is used as the test set to report accuracy.
Search Space
We consider two search spaces in this paper. The difference lies in whether the squeeze and excitation (SE) [Hu, Shen, and Sun2018] is inclusively searched. The version without SE is utilized to make comparisons within a shorter time. The other one is utilized to make fair comparisons with MnasNet [Tan et al.2019] and EfficientNet [Tan and Le2019].
Scalable MBV2 Search Space. We utilize standard MobileNetV2 inverted bottleneck blocks [Sandler et al.2018] to build up our search space after [Cai, Zhu, and Han2019]. We let the convolutional kernels be within and expansion rates , the channel of filters per layer is retained. On top of this, we include an identity block with linearly equivalent transformation for scalability. The size of this search space is .
Scalable MBV2 Search Space with SE. We align our search space by including Squeezeandexcitation blocks with the latest works. Specifically, we give each MBV2 block an SE option, in total, we have 13 options per layer. The overall size of search space now becomes . The detailed choice blocks per layer are displayed in Table 1. Note that Index 12 refers to an identity block with the equivalent transformation ( Conv). The rest choices are typical MBV2 blocks.
Index  Expansion  Kernel Size  SE 
0  3  3   
1  3  3  ✓ 
2  3  5   
3  3  5  ✓ 
4  3  7   
5  3  7  ✓ 
6  6  3   
7  6  3  ✓ 
8  6  5   
9  6  5  ✓ 
10  6  7   
11  6  7  ✓ 
12    1   
Pipeline
We utilize a multiobjective approach to serve as our pipeline [Chu, Zhang, and Xu2019]. We consider three objectives: classification error, multiplyadds and the number of parameters. We choose multiplyadds because we don’t search models for specific hardwares. We also impose constraints for FLOPs to act as a mobile requirement. NSGAII [Deb et al.2002] is a powerful approach to address this kind of problem, so we start our work on top of it.
As [Zhang et al.2018] states, mobile models are prone to underfitting instead of overfitting. Therefore, we try to maximize the number of parameters and utilize the weighted NSGAII approach to weight different objectives as [Chu, Zhang, and Xu2019]. In a word, our problem can be defined as follows,
(2) 
The weight of the above three objectives are .
We use choice index to encode the chromosome. Therefore, we can use to represent a model. Besides, we utilize the standard NSGAII procedure, and only point out something different if necessary.
Initialization
We initialize population to introduce various choice blocks to encourage exploration.
Crossover
To be simple, we use singlepoint crossover.
Mutation
We use a mixed approach: a PPO based controller to encourage exploitation [Schulman et al.2017] and a Roulette wheel selection to encourege exploration. The hyperparameter is listed in Table 2.
Weighted Nondominated Sorting
We weigh different preferences for various objectives by defining weighted crowding distance.
We run 120 epochs with population size of 70 to get 8400 models. This search stage takes about 1.5 GPU days on a Tesla V100. Then we sample 4 models from the final Pareto front with approximately equal crowding distance and train them completely.
Training Strategy
As for the training of the supernet, we use the same setting as FairNAS [Chu et al.2019], except we train for 60 epochs and it takes about 10 GPU days.
As for the full training stage, we use the same configuration as FairNAS [Chu et al.2019]. And we use standard Inception preprocessing tricks [Szegedy et al.2017] . Unlike EfficientNet, we don’t apply AutoAugment policy [Cubuk et al.2018]
because many stateoftheart algorithms report their results without it. Moreover, we use RMSProp optimizer with 0.9 momentum. We use a batch size of 4096 and the initial learning rate of 0.256, which decays 0.01 every 2.4 epochs. We use a dropout rate of 0.2
[Srivastava et al.2014] before the last FC layer and a weight decay () rate of .Item  value  Item  value 

Population N  70  Mutation Ratio  0.8 
0.2  0.65  
0.15  0.7  
0.3 
Experiment Results
Methods  MultAdds  Params  Top1  Top5 

(M)  (M)  (%)  (%)  
MobileNetV2 1.0 [Sandler et al.2018]  300  3.4  72.0  91.0 
MobileNetV3 Large 1.0 [Howard et al.2019]  219  5.4  75.2  92.2 
MnasNet A1 [Tan et al.2019]  312  3.9  75.2  92.5 
MnasNetA2 [Tan et al.2019]  340  4.8  75.6  92.7 
FBNetB [Wu et al.2019]  295  4.5  74.1  
ProxylessR Mobile [Cai, Zhu, and Han2019]  320  4.0  74.6  92.2 
Proxyless GPU [Cai, Zhu, and Han2019]  465  7.1  75.1   
SinglePath NAS [Stamoulis et al.2019]  365  4.3  75.0  92.2 
FairNASA [Chu et al.2019]  388  4.6  75.3  92.4 
MoGAA [Chu, Zhang, and Xu2019]  304  5.1  75.9  92.8 
EfficientNet B0 [Tan and Le2019]  390  5.3  76.3  93.2 
SCARLETA (Ours)  365  6.7  76.9  93.4 
SCARLETB (Ours)  329  6.5  76.3  93.0 
SCARLETC (Ours)  280  6.0  75.6  92.6 
Comparison with State of the art methods
Table 3 gives a clear comparison of stateoftheart models on ImageNet dataset. We pick models within the range of FLOPs from 200M to 400M. It is clear that our SCARLET series marks a new state of the art, with SCARLETA surpassing EfficientNetB0 with +0.5% increase on top1 accuracy and 25M fewer FLOPs, SCARLETB achieves the same top1 accuracy with 61M fewer FLOPs. Although both A and B have higher numbers of parameters, this treatment should be encouraged as it is related to representational power [Chu, Zhang, and Xu2019] and doesn’t necessarily increase inference latency. SCARLETC also achieves a competing result with +0.3% increase on top1 accuracy compared with FairNASA, while costing 108M fewer FLOPs.
SCARLETA makes full use of large kernels (five and seven 77 kernels) to enlarge receptive field. Besides it activates many squeezing and excitation (12 out of 19) blocks to improve its classification performance. At the early stage, it appreciates either large kernels and small expansion ratios or small kernels and large expansion ratios to balance the tradeoff between accuracy and FLOPs.
SCARLETB chooses two identity operations. Compared with A, it shortens network depth at the last stages. Besides, it utilizes squeezing and excitation block extensively (14 out of 17). It places a large expansion block with large kernels at the tail stage.
SCARLETC uses three identity operations and utilizes small expansion ratio extensively to cut down the FLOPs, large expansion ratio at the tail stage whose resolution is . It prefers large kernels before the downsampling layers. Besides, it makes an extensive use of squeeze and excitation to boost accuracy.
Ablation Study
Equivalent Transformation vs. Identity
To check the validness of our method, we just utilize identity as a basic choice to act as the baseline group, which is commonly used in prior works. We train the two supernets under the same training setting for 60 epochs. We report the average running accuracy and standard variance based on the epoch scale in Figure 1. Our method with linearly equivalent transformation can obtain about higher than the baseline in case of the top1 accuracy on the training set. Moreover, it has much lower variance, which indicates each model is trained fairly. To further verify this, we sample all the models in the last epoch (60th) and report their metrics by a histogram, which is shown at the bottom of Figure 1. Identity makes troubles for the training and quite a few models suffer seriously and their metrics are below
. Therefore, they are severely underestimated by the supernet. Whereas, LET can compensate and bring the models to a reasonable range.
Equivalent vs Nonequivalent Transformation
Here we show that nonequivalent transformation changes the representative power of neural networks. A simple modification by adding ReLU function can violate the equivalence. To prove this, we randomly sample a model meta and then forcibly flip some choice blocks to identity : [1, 3, 1, 0, 12, 0, 0, 0, 12, 12, 12, 12, 12, 0, 0, 0, 12, 12, 9]. We train this model with ReLU (nonequivalent) and without (equivalent) for identity layer
^{3}^{3}3Note 12 means identity block. on the basis of same seeds, training tricks, initialization strategy, and hyperparameters. Figure 3 indicates that a trivial modification nonnegligibly affects its representative power. While in our scalable supernet, we have to guarantee a equivalent transformation and this is why ReLU can’t apply.Discussion and Future Work
Weight sharing is one of the most critical features for efficient neural architecture search. Most of the oneshot approaches concentrate on how to find useful networks from choosing parallel choices. This schema hardly meets the requirement for flexibility, it even causes conflicts inherently. Since a neural network learns features layer by layer, it’s highly sensitive to any scaling operation. Our equivalent transformation can be regarded as a buffer for such operations. Nevertheless, it works best under some conditions. In our search spaces, a linear transformation is used to match a single inverted bottleneck layer, where only two nonlinear activation functions are involved. When the matched function is too complicated, it will be more difficult to compensate.
How to perform flexible search efficiently remains open. Google’s reinforced approach on top of huge computing resource [Zoph et al.2018, Tan et al.2019] is neither affordable nor environmental friendly. One of our future works is making the search process both flexible and efficient.
Conclusion
In this paper, we unveil the overlooked scalability issue in oneshot neural architecture search approaches. We show that simply adding identity blocks introduces training instability. By compensating the learning process with linearly equivalent transformation, we fill the gap between scalability and stability. We prove and demonstrate such transformation is identical in terms of representational power. The renewed supernet then can be trained with desired convergence and delivers competitive neural architectures. Namely, with fewer FLOPs than EfficientNetB0, SCARLETA achieves 76.9% Top1 accuracy on ImageNet. SCARLETB illustrates that shallow models can perform better which hits the same 76.3% as EfficientNetB0 with much reduced FLOPs. SCARLETC strikes 75.6%, also exceeds its peers of similar sizes.
References

[Bender et al.2018]
Bender, G.; Kindermans, P.J.; Zoph, B.; Vasudevan, V.; and Le, Q.
2018.
Understanding and simplifying oneshot architecture search.
In
International Conference on Machine Learning
, 549–558.  [Brock et al.2018] Brock, A.; Lim, T.; Ritchie, J. M.; and Weston, N. 2018. Smash: oneshot model architecture search through hypernetworks. International Conference on Learning Representations.
 [Cai, Zhu, and Han2019] Cai, H.; Zhu, L.; and Han, S. 2019. Proxylessnas: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations.
 [Chen, Goodfellow, and Shlens2015] Chen, T.; Goodfellow, I.; and Shlens, J. 2015. Net2net: Accelerating learning via knowledge transfer. International Conference on Learning Representations.
 [Chu et al.2019] Chu, X.; Zhang, B.; Xu, R.; and Li, J. 2019. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845.
 [Chu, Zhang, and Xu2019] Chu, X.; Zhang, B.; and Xu, R. 2019. Moga: Searching beyond mobilenetv3. arXiv preprint arXiv:1908.01314.
 [Cubuk et al.2018] Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501.

[Deb et al.2002]
Deb, K.; Pratap, A.; Agarwal, S.; and Meyarivan, T.
2002.
A fast and elitist multiobjective genetic algorithm: Nsgaii.
IEEE Transactions on Evolutionary Computation
6(2):182–197. 
[Deng et al.2009]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; and FeiFei, L.
2009.
Imagenet: A largescale hierarchical image database.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 248–255. Ieee.  [Guo et al.2019] Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.; Wei, Y.; and Sun, J. 2019. Single path oneshot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
 [Howard et al.2019] Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. 2019. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244.
 [Hu, Shen, and Sun2018] Hu, J.; Shen, L.; and Sun, G. 2018. Squeezeandexcitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141.
 [Huang et al.2018] Huang, Y.; Cheng, Y.; Chen, D.; Lee, H.; Ngiam, J.; Le, Q. V.; and Chen, Z. 2018. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965.
 [Sandler et al.2018] Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and Chen, L.C. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520.
 [Schulman et al.2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
 [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
 [Stamoulis et al.2019] Stamoulis, D.; Ding, R.; Wang, D.; Lymberopoulos, D.; Priyantha, B.; Liu, J.; and Marculescu, D. 2019. Singlepath nas: Designing hardwareefficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877.

[Szegedy et al.2017]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A.
2017.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
InThirtyFirst AAAI Conference on Artificial Intelligence
. 
[Tan and Le2019]
Tan, M., and Le, Q. V.
2019.
Efficientnet: Rethinking model scaling for convolutional neural networks.
International Conference on Machine Learning.  [Tan et al.2019] Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; and Le, Q. V. 2019. Mnasnet: Platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
 [Wu et al.2019] Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; and Keutzer, K. 2019. Fbnet: Hardwareaware efficient convnet design via differentiableneural architecture search. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Zagoruyko and Komodakis2016] Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. Proceedings of the British Machine Vision Conference.
 [Zhang et al.2018] Zhang, X.; Zhou, X.; Lin, M.; and Sun, J. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Zoph et al.2018] Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. V. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8697–8710.
Comments
There are no comments yet.