ScarletNAS: Bridging the Gap Between Scalability and Fairness in Neural Architecture Search

08/16/2019 ∙ by Xiangxiang Chu, et al. ∙ Xiaomi 0

One-shot neural architecture search features fast training of a supernet in a single run. A pivotal issue for this weight-sharing approach is the lacking of scalability. A simple adjustment with identity block renders a scalable supernet but it arouses unstable training, which makes the subsequent model ranking unreliable. In this paper, we introduce linearly equivalent transformation to soothe training turbulence, providing with the proof that such transformed path is identical with the original one as per representational power. The overall method is named as SCARLET (SCAlable supeRnet with Linearly Equivalent Transformation). We show through experiments that linearly equivalent transformations can indeed harmonize the supernet training. With an EfficientNet-like search space and a multi-objective reinforced evolutionary backend, it generates a series of competitive models: Scarlet-A achieves 76.9 EfficientNet-B0 by a large margin; the shallower Scarlet-B exemplifies the proposed scalability which attains the same accuracy 76.3 with much fewer FLOPs; Scarlet-C scores competitive 75.6 sizes. The models and evaluation code are released online https://github.com/xiaomi-automl/ScarletNAS .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Neural architecture search has been recently dominated by one-shot methods [Brock et al.2018, Bender et al.2018, Stamoulis et al.2019, Guo et al.2019, Cai, Zhu, and Han2019]. Fundamentally, a supernet which incorporates the whole search space enjoys fast convergence through weight sharing. Evaluating the performance of models by picking a single path from the supernet then becomes handy. According to [Chu et al.2019], fair training of supernet shows a remarkable improvement in model ranking. However, the scalability of a supernet is quite limited. By contrast, as pure reinforcement or evolutionary approaches train each model independently for evaluation, shallower models can also stand out if they exhibit good performance. This step is very beneficial as it achieves automatic architectural compression. To enable the similar property for the family of one-shot methods, we install identity choice blocks for network downscaling, which are accompanied with linearly equivalent transformation as a relay for inter-block information.

Figure 1: Training process of scalable supernets with and without linearly equivalent transformation (LET). Top:

The accuracy of supernet with LET has much smaller standard variance.

Bottom:

Histogram of training accuracies of sampled one-shot models within last epoch.

To summarize, our main contributions are threefold.

  • We uncover what is an important but neglected illness of being unscalable in previous one-shot architecture search methods.

  • We present proofs on why normal identity would fail to assure fair scalability. We propose a naive adjustment with linearly equivalent transformation that distinctly improves the supernet strength.

  • Our generated networks enjoy overwhelming performance compared to their counterparts especially EfficientNet-B0 [Tan and Le2019]. SCARLET-A claims a new state of the art 76.9% top-1 accuracy on ImageNet at the level of 400M multiply-adds. SCARLET-C also reaches competitive 75.6% with much decreased multiply-adds. More importantly, the closest model to EfficientNet-B0, SCARLET-B, which is a shallow version from our search space, distinguishes itself with 76.3%.

Related Works

One-Shot Neural Architecture Search

In one-shot approaches, a supernet is constructed to represent the whole search space, within which each path is a stand-alone model. The supernet is trained only once, child models can inherit the weights of supernet thus it is easier and faster to evaluate its performance compared with other incomplete training techniques. Notable works are [Bender et al.2018, Stamoulis et al.2019, Guo et al.2019, Cai, Zhu, and Han2019]. Recent advances are concerned with training stability and mismatching accuracy range between one-shot models and stand-alone ones, FairNAS achieves better convergence and results by enforcing strict fairness constraints [Chu et al.2019].

Scalability and Network Transformation

Given a neural network, it has been experimentally studied to scale its size up or down for various application scenarios. Common model upscaling practices include increasing depth, width, as well as input image resolution

[He et al.2016, Zagoruyko and Komodakis2016, Huang et al.2018]. Recent work by [Tan and Le2019] proposes an effective compound scaling method that incorporates all three with a balance achieved by grid search. Nevertheless, these methods leave parameter sharing out of the discussion, each scaled network has to be trained from scratch. Network transformation is a solution for model scaling by reusing the weights from the original structure. For instance, Net2Net by [Chen, Goodfellow, and Shlens2015] invented two transformation schemes to pass on parameters to either wider or deeper student networks.

Review of the Supernet Training with Variable Depths

It has been unheeded about the training of scalable supernets, at least not carefully dealt with. Generally, mainstream one-shot approaches suffer from unstable training [Bender et al.2018]. This problem deteriorates when scalability is considered. Zero operations are added to skip blocks in [Cai, Zhu, and Han2019] to have flexibility in width and depth, however, its training details are not thoroughly discussed. [Guo et al.2019] also adopts the same search space design to draw a fair comparison. The intermediate process is also not reported, we can not decide how much difference do skip connections make.


Layer

Layer

Layer

CB

ID

Conv

CB

One-shot

CB

Conv

Conv

CB

One-shot

CB

Conv

CB

Stand-alone

Figure 2: Identity with linearly equivalent transformation. CB: Choice Block. ID: Identity. Note means the output channel size of layer . The Conv operation in layer can also be FC.

We are motivated by [Chu et al.2019] which displays interesting feature maps after first choice blocks. It attests that all choice blocks learn similar knowledge, even at correspondent channels. This is an important incentive for us to investigate further whether a fair training of scalable supernet is possible. We argue that ensuring fairness for common scalability practices like identity blocks is necessary. As pure identity blocks are direct short paths and don’t learn any information, we have to accommodate this defect by injecting a learning unit. Here we remedy the issue with convolution without non-linear activations. It is named as linearly equivalent transformation (LET), see Figure 2. Be aware that layer in stand-alone models have parameters, while its one-shot counterparts have .

Given an input of a chickadee222ImageNet ID: n01592084_7680 image from ImageNet, we illustrate both high-level and low-level feature maps of the trained supernet with our proposed improvements in Figure 5. Pure identity easily interferes with the training process as it causes possible channel reduction and non-congruence with other choice blocks. Note the channel size of feature map after Choice 6 in Figure 5 (a) is half of others because the previous channel size is 16, while other choice blocks output 32 channels. This effect is largely attenuated with LET. As it goes deeper, we still observe consistent high-level features. Specifically, when LET is not enforced, high-level features of deeper channels easily get blurred out, while the supernet with LET enabled continues to learn useful features in deeper channels. We will discuss it formally in the next section.

Stabilize Training by Equivalent Transformation

A critical requirement for transformation is equivalence. Thus, the transformed model behaves exactly as the original entity. To proceed, we first define the equivalence.

Definition 1.

Given a valid space , a function f is equivalent to g on if and only if for , where , and

are tensors of any shape.

Definition 2.

The equivalence of two neural networks: for models and with weights and , if and only if and are equivalent.

The 2D convolution and fully-connected layer are two of the most widely used operations in deep neural networks. For the image classification task, a mini-batch of images with channels can be denoted by . Let A be a deep neural network with layers and be the feature maps of the -th layer. For simplicity, we omit the batch dimension . Hence,

(1)

In general, the shape of can be or , where the former is generated from 2D convolutions and the latter is from fully-connected operations. We denote a 2D convolution as . For convenience, we use square kernels and neglect dilation rates, considering only . Other setups can be easily proven in the same manner. Here, we consider two scenarios:

  • and have shape and respectively. is a linear operation without activation and the first operation of is a linear (fully-connected) ,

  • and each have shape and . is a 2D convolution without activation or bias. The first operation of is 2D convolution .

Note that the choice block in a layer can either be a simple or a complex block and we only require that the beginning part of it is a 2D convolution or FC. This requirement is weak enough to cover most neural networks.

Lemma 1.

For the first scenario, we replace and with an identity operation and another linear to construct B from A, then we can ensure .

Proof.

First, we copy ’s weights from except for and . We can make if we can let be equivalent to the above two successive operations. For any , let and denote the weight matrices of and . Let be the weight matrix of .

Second, we can calibrate

We can make by combining them both. ∎

Lemma 2.

For the second scenario, we substitute and with an identity operation and another 2D convolution to construct from , then we can ensure .

Proof.

First, we copy ’s weights from except for the and . The only thing to prove is that we can replace and with equivalently. Second, We prove that any , the above declaration holds. Let and be the weight matrices of and . Let be the weight matrix of .

We can make by setting

Since we don’t search the number of channels, we replace the choice identity with convolution without bias or activation for convolution networks and FC layer without activation when the identity is needed between FC layers. This procedure is illustrated in Figure 2. The left-most architecture is the original and commonly used version. The middle one is its equivalent version and the right-most version is the architecture for stand-alone training. Attention should be paid for training stand-alone models. The number of feature maps usually increases with depth. For instance, in Figure 2, , we should adjust the input channel number to make convolution works.

Experiment Setup

Dataset

We perform the search directly on ImageNet 1k dataset [Deng et al.2009] and randomly select 50k images from the training set as a validation set. The original validation set is used as the test set to report accuracy.

Search Space

We consider two search spaces in this paper. The difference lies in whether the squeeze and excitation (SE) [Hu, Shen, and Sun2018] is inclusively searched. The version without SE is utilized to make comparisons within a shorter time. The other one is utilized to make fair comparisons with MnasNet [Tan et al.2019] and EfficientNet [Tan and Le2019].

Scalable MBV2 Search Space. We utilize standard MobileNetV2 inverted bottleneck blocks [Sandler et al.2018] to build up our search space after [Cai, Zhu, and Han2019]. We let the convolutional kernels be within and expansion rates , the channel of filters per layer is retained. On top of this, we include an identity block with linearly equivalent transformation for scalability. The size of this search space is .

Scalable MBV2 Search Space with SE. We align our search space by including Squeeze-and-excitation blocks with the latest works. Specifically, we give each MBV2 block an SE option, in total, we have 13 options per layer. The overall size of search space now becomes . The detailed choice blocks per layer are displayed in Table 1. Note that Index 12 refers to an identity block with the equivalent transformation ( Conv). The rest choices are typical MBV2 blocks.

Index Expansion Kernel Size SE
0 3 3 -
1 3 3
2 3 5 -
3 3 5
4 3 7 -
5 3 7
6 6 3 -
7 6 3
8 6 5 -
9 6 5
10 6 7 -
11 6 7
12 - 1 -
Table 1: Each layer in our search space has 13 choices.

Pipeline

We utilize a multi-objective approach to serve as our pipeline [Chu, Zhang, and Xu2019]. We consider three objectives: classification error, multiply-adds and the number of parameters. We choose multiply-adds because we don’t search models for specific hardwares. We also impose constraints for FLOPs to act as a mobile requirement. NSGA-II [Deb et al.2002] is a powerful approach to address this kind of problem, so we start our work on top of it.

As [Zhang et al.2018] states, mobile models are prone to underfitting instead of overfitting. Therefore, we try to maximize the number of parameters and utilize the weighted NSGA-II approach to weight different objectives as [Chu, Zhang, and Xu2019]. In a word, our problem can be defined as follows,

(2)

The weight of the above three objectives are .

We use choice index to encode the chromosome. Therefore, we can use to represent a model. Besides, we utilize the standard NSGA-II procedure, and only point out something different if necessary.

Initialization

We initialize population to introduce various choice blocks to encourage exploration.

Crossover

To be simple, we use single-point crossover.

Mutation

We use a mixed approach: a PPO based controller to encourage exploitation [Schulman et al.2017] and a Roulette wheel selection to encourege exploration. The hyper-parameter is listed in Table 2.

Weighted Non-dominated Sorting

We weigh different preferences for various objectives by defining weighted crowding distance.

We run 120 epochs with population size of 70 to get 8400 models. This search stage takes about 1.5 GPU days on a Tesla V100. Then we sample 4 models from the final Pareto front with approximately equal crowding distance and train them completely.

Training Strategy

As for the training of the supernet, we use the same setting as FairNAS [Chu et al.2019], except we train for 60 epochs and it takes about 10 GPU days.

As for the full training stage, we use the same configuration as FairNAS [Chu et al.2019]. And we use standard Inception pre-processing tricks [Szegedy et al.2017] . Unlike EfficientNet, we don’t apply AutoAugment policy [Cubuk et al.2018]

because many state-of-the-art algorithms report their results without it. Moreover, we use RMSProp optimizer with 0.9 momentum. We use a batch size of 4096 and the initial learning rate of 0.256, which decays 0.01 every 2.4 epochs. We use a dropout rate of 0.2

[Srivastava et al.2014] before the last FC layer and a weight decay () rate of .


Item value Item value
Population N 70 Mutation Ratio 0.8
0.2 0.65
0.15 0.7
0.3
Table 2: Hyperparameters for the weighted NSGA-II approach.

Experiment Results

Methods Mult-Adds Params Top-1 Top-5
(M) (M) (%) (%)
MobileNetV2 1.0 [Sandler et al.2018] 300 3.4 72.0 91.0
MobileNetV3 Large 1.0 [Howard et al.2019] 219 5.4 75.2 92.2
MnasNet -A1 [Tan et al.2019] 312 3.9 75.2 92.5
MnasNet-A2 [Tan et al.2019] 340 4.8 75.6 92.7
FBNet-B [Wu et al.2019] 295 4.5 74.1
Proxyless-R Mobile [Cai, Zhu, and Han2019] 320 4.0 74.6 92.2
Proxyless GPU [Cai, Zhu, and Han2019] 465 7.1 75.1 -
Single-Path NAS [Stamoulis et al.2019] 365 4.3 75.0 92.2
FairNAS-A [Chu et al.2019] 388 4.6 75.3 92.4
MoGA-A [Chu, Zhang, and Xu2019] 304 5.1 75.9 92.8
EfficientNet B0 [Tan and Le2019] 390 5.3 76.3 93.2
SCARLET-A (Ours) 365 6.7 76.9 93.4
SCARLET-B (Ours) 329 6.5 76.3 93.0
SCARLET-C (Ours) 280 6.0 75.6 92.6
Table 3: Comparison of neural models on ImageNet. The input size is set to 224224. : Based on its published code.

Comparison with State of the art methods

Table 3 gives a clear comparison of state-of-the-art models on ImageNet dataset. We pick models within the range of FLOPs from 200M to 400M. It is clear that our SCARLET series marks a new state of the art, with SCARLET-A surpassing EfficientNet-B0 with +0.5% increase on top-1 accuracy and 25M fewer FLOPs, SCARLET-B achieves the same top-1 accuracy with 61M fewer FLOPs. Although both A and B have higher numbers of parameters, this treatment should be encouraged as it is related to representational power [Chu, Zhang, and Xu2019] and doesn’t necessarily increase inference latency. SCARLET-C also achieves a competing result with +0.3% increase on top-1 accuracy compared with FairNAS-A, while costing 108M fewer FLOPs.

SCARLET-A makes full use of large kernels (five and seven 77 kernels) to enlarge receptive field. Besides it activates many squeezing and excitation (12 out of 19) blocks to improve its classification performance. At the early stage, it appreciates either large kernels and small expansion ratios or small kernels and large expansion ratios to balance the trade-off between accuracy and FLOPs.

SCARLET-B chooses two identity operations. Compared with A, it shortens network depth at the last stages. Besides, it utilizes squeezing and excitation block extensively (14 out of 17). It places a large expansion block with large kernels at the tail stage.

SCARLET-C uses three identity operations and utilizes small expansion ratio extensively to cut down the FLOPs, large expansion ratio at the tail stage whose resolution is . It prefers large kernels before the downsampling layers. Besides, it makes an extensive use of squeeze and excitation to boost accuracy.

Figure 3:

Training of a random model where identity blocks are enabled with linear transformation vs. with non-linear transformation.

Figure 4: The architectures of SCARLET-A,B,C. Notice the dashed lines refer to downsampling points. The stem and tail parts are omitted. Best viewed in color.

Ablation Study

Equivalent Transformation vs. Identity

To check the validness of our method, we just utilize identity as a basic choice to act as the baseline group, which is commonly used in prior works. We train the two supernets under the same training setting for 60 epochs. We report the average running accuracy and standard variance based on the epoch scale in Figure 1. Our method with linearly equivalent transformation can obtain about higher than the baseline in case of the top-1 accuracy on the training set. Moreover, it has much lower variance, which indicates each model is trained fairly. To further verify this, we sample all the models in the last epoch (60-th) and report their metrics by a histogram, which is shown at the bottom of Figure 1. Identity makes troubles for the training and quite a few models suffer seriously and their metrics are below

. Therefore, they are severely under-estimated by the supernet. Whereas, LET can compensate and bring the models to a reasonable range.

Equivalent vs Non-equivalent Transformation

Here we show that non-equivalent transformation changes the representative power of neural networks. A simple modification by adding ReLU function can violate the equivalence. To prove this, we randomly sample a model meta and then forcibly flip some choice blocks to identity : [1, 3, 1, 0, 12, 0, 0, 0, 12, 12, 12, 12, 12, 0, 0, 0, 12, 12, 9]. We train this model with ReLU (non-equivalent) and without (equivalent) for identity layer

333Note 12 means identity block. on the basis of same seeds, training tricks, initialization strategy, and hyper-parameters. Figure 3 indicates that a trivial modification non-negligibly affects its representative power. While in our scalable supernet, we have to guarantee a equivalent transformation and this is why ReLU can’t apply.

Discussion and Future Work

Weight sharing is one of the most critical features for efficient neural architecture search. Most of the one-shot approaches concentrate on how to find useful networks from choosing parallel choices. This schema hardly meets the requirement for flexibility, it even causes conflicts inherently. Since a neural network learns features layer by layer, it’s highly sensitive to any scaling operation. Our equivalent transformation can be regarded as a buffer for such operations. Nevertheless, it works best under some conditions. In our search spaces, a linear transformation is used to match a single inverted bottleneck layer, where only two non-linear activation functions are involved. When the matched function is too complicated, it will be more difficult to compensate.

How to perform flexible search efficiently remains open. Google’s reinforced approach on top of huge computing resource [Zoph et al.2018, Tan et al.2019] is neither affordable nor environmental friendly. One of our future works is making the search process both flexible and efficient.

Conclusion

In this paper, we unveil the overlooked scalability issue in one-shot neural architecture search approaches. We show that simply adding identity blocks introduces training instability. By compensating the learning process with linearly equivalent transformation, we fill the gap between scalability and stability. We prove and demonstrate such transformation is identical in terms of representational power. The renewed supernet then can be trained with desired convergence and delivers competitive neural architectures. Namely, with fewer FLOPs than EfficientNet-B0, SCARLET-A achieves 76.9% Top-1 accuracy on ImageNet. SCARLET-B illustrates that shallow models can perform better which hits the same 76.3% as EfficientNet-B0 with much reduced FLOPs. SCARLET-C strikes 75.6%, also exceeds its peers of similar sizes.

(a) Identity, first choice block
(b) Identity with LET, first choice block
(c) Identity, high-level choice block
(d) Identity with LET, high-level choice block
Figure 5: Learned low-level and high-level features with and without linearly equivalent transformation (LET).

References