MixPath
None
view repo
The expressiveness of search space is a key concern in neural architecture search (NAS). Previous block-level approaches mainly focus on searching networks that chain one operation after another. Incorporating multi-path search space with the one-shot doctrine remains untackled. In this paper, we investigate the supernet behavior under the multi-path setting, which we call MixPath. For a sampled training, simply switching multiple paths on and off incurs severe feature inconsistency which deteriorates the convergence. To remedy this effect, we employ what we term as shadow batch normalizations (SBN) to follow various path patterns. Experiments performed on CIFAR-10 show that our approach is effective regardless of the number of allowable paths. Further experiments are conducted on ImageNet to have a fair comparison with the latest NAS methods. Our code will be available https://github.com/xiaomi-automl/MixPath.git .
READ FULL TEXT VIEW PDF
The ability to rank models by its real strength is the key to Neural
Arc...
read it
This paper proposes Binary ArchitecTure Search (BATS), a framework that
...
read it
Training a supernet matters for one-shot neural architecture search (NAS...
read it
While recent NAS algorithms are thousands of times faster than the pione...
read it
Differential Architecture Search (DARTS) is now a widely disseminated
we...
read it
Neural Architecture Search (NAS) is an exciting new field which promises...
read it
Face anti-spoofing (FAS) plays a vital role in face recognition systems....
read it
None
Carrying a high promise of complete automation in network design across various domains [31, 24, 8, 6, 29, 14, 26, 23], the research on neural architecture search now enters into a stage of fierce competition [32, 33, 5, 25, 22].
Among various mainstream paradigms, one-shot approaches [2, 1, 15, 7, 22] make use of weight-sharing mechanism that reduces a large amount of computational cost. Typically in its first stage, a supernet is trained to convergence to serve as an evaluator for sub-models’ performance. The second searching stage can either be done with EA or RL, even random sampling. It is thus of utter importance for the supernet evaluator to have accurate ranking ability. FairNAS [7] discusses throughly on this regard, arguing that fairness training for each sampled blocks contributes to the final ranking. As each of its searchable cells is independently supervised by the corresponding block in the teacher network, randomness is largely reduced, also each cell is well-trained, both improves its ranking skill.
Exploring multi-path search space is made possible in a differentiable method Fair DARTS [9]. However, it poses a challenge to think of its one-shot counterpart. It is a non-trivial problem. One-shot [1] can be thought of multi-path training as it dynamically drops paths from the supernet, which reportedly comes with instability even regularization tricks and recalibration of Batch Normalization statistics don’t help much. In this paper, we dive into its real causes and undertake a unified approach, which we call MixPath, to incorporate most of the preceding one-shot works, as well as increased multi-path capability.
Our contributions can be summarized into four aspects.
We propose a uniform approach for one-shot NAS to step out of the existing single-path limitations and empower multi-path (at most paths) expressiveness, which bridges the gap between one shot and multi branch searching. From this perspective, existing single-path weight-sharing approaches become a special case of path .
We disclose the obstacles that make vanilla multi-path approaches fail and propose a new and light-weight mechanism, shadow batch normalization, to stabilize the training process of the over-parameterized supernet with neglect-able costs. Moreover, it can boost the most critical capacity of supernet: model ranking consistence, which breaks a record on NAS-bench-101 search space.
We prove that the number of shadow batch normalizations can be greatly reduced to grow linearly with the number () of maximum activable paths instead of exponentially, exploiting the underlying mechanics that secures weight-sharing.
We search proxylessly on ImageNet at a cost of 10 GPU days. The searched models obtain new state-of-the-art results on the ImageNet 1k classification task, which can be compared with MixNet models searched with 300 more computing powers. Moreover, our model MixPath-B, makes use of multi-branch feature aggregation and obtains higher accuracy than EfficientNetB0 at the cost of fewer FLOPS and parameters.
Weight-sharing Mechanism. In recent one-shot neural architecture search methods which apply the weight-sharing paradigm, intermediate features learned by different operations exhibit high similarities [7, 5]. Ensuring such similarity is important to train a supernet to better convergence with improved stability. For instance, paralleling skip connections with other inverted bottleneck blocks [28] creates large feature discrepancy that harms supernet training. This is rectified by appending an equivariant learnable stabilizer [5] to each skip connection, which is dedicated to boost feature similarities and in turn supernet performance. We can draw a rule of thumb to design advanced one-shot search algorithms: when a supernet fails to converge, we should first look into its feature similarities.
Mixed Depthwise Convolution. MixNet [33] proposes a MixConv operator that processes equally-partitioned channels with different depthwise convolutions, which is proved to be effective for image classification. Still, MixNet follows MnasNet [31] for architectural search that comes with immense cost, which is infeasible in practice. AtomNAS [25] incorporates MixConv with variable channel sizes in its search space. To amortize the search cost, it applies the differentiable method DARTS [24] removing dead blocks on the fly. Despite its high performance of the resulted models, such fine-grained channel partitions leads to large incongruence which requires specific treatment on mobile end.
Multi-branch Feature Aggregation. To our knowledge, the first multi-branch neural architecture dates back to ResNet [16] with a skip connection branch for image classification. ResNeXt pushes the multi-branch design further [35], in which homogeneous convolutions are aggregated by addition. Therefore, the combination of mixed depth-wise convolution and multi branch design is reasonable.
Conditional Batch Normalization. Batch Normalization [19]
has greatly facilitated the training of neural networks by normalizing layer inputs to have fixed means and variances. In the case of training supernets, a single batch normalization has difficulty to capture dynamic inputs from various heterogenous operations. Slimmable neural networks
[40] introduces a shared supernet that can run with switchable channels at four different scales (1, 0.75, 0.5, 0.25). Training such a network suffers from feature inconsistency at different switches, therefore, they apply independent batch normalizations for each switch configuration to encode conditional statistics. Thus jointly trained supernet enjoy improved accuracies at four scales. However, it requires an increased number of batch normalizations when it comes to arbitrary channel widths, which is impractical because of intensive computation. The following work US-Nets [39] circumvents this issue with distributed computing. In addition, post-statistics for networks of different channel widths are computed on a subset of the target dataset to save more time.Informally, existing weight-sharing approaches [24, 9, 15, 7, 5]
can be classified into four categories based on two dimensions: prior-learning type and multi-path support, as shown in Figure
2. Specifically, DARTS [24] and Fair DARTS [9] both learn priors towards a promising network while the latter allows multiple paths between any two nodes. One-shot methods [15, 7, 5] don’t learn priors but train a supernet to evaluate submodels instead. So far, they only consider single-path search space. It is thus natural to devise their multi-path counterpart.Mixture has potentials to balance the trade-off between the performance and model cost better than the monotonous one. Without loss of generality, the computational cost of an inverted bottleneck with input of features, middle and output channels can be formulated as , where is the kernel size of the depth-wise convolution. Usually the value of is set 3 or 5. When dominates , we can boost the representative power of depth wise transformation by mixing more kernels with neglect-able cost increase. This design can be regarded as a straightforward combination by MixConv and ResNeXt.
Multi-path support is not as easy as thought to be. We expect to train a supernet that can accurately predict the performance of multi-path submodels. To do so, we can think of training the supernet by randomly activating a multi-path model at a single step. This is based on the assumption that weights from multi-path training fit well in a multi-path submodel. Here we apply Bernoulli sampling to independently activate or deactivate each operation. It is advantageous to have a steady training like in single-path methods [15, 7]. However, it is not true according to our pilot experiments conducted in MixPath supernet on the ImageNet dataset [12], where its one-shot models have low accuracies swinging back and forth, as shown by the blue line in Figure 3.
on ImageNet. Twenty models are randomly sampled every two epochs.
Bottom: Histogram of randomly sampled 3k one-shot models’ accuracies. Enabling shadow batch normalization improves supernet training and one-shot performance.One-Shot [1] is also a similar case of multi-path training. By gradually dropping out paths, some operations are activated and others are not, it can be seen as a way of multi-path sampling. Also exhibited in Figure 3, it suffers more severe training difficulty compared to vanilla training of MixPath. In a long term of early epochs, One-Shot fails to learn any useful information to pass onto its one-shot submodel.
According to [7, 5], similarities of intermediate features learned by different operations is crucial for the stability of supernet training. In order to solve above problem, an intuitive solution is using variable number of BNs to track the changing features from the combinations of different operations. Take the case where paths at most, the outputs of two paths are and , and there are five kind of combitions: . That is to say there are five different kind of feature combination, thus five BNs are need to track these statistics respectively. We call such BN as Shadow Batch Normalization (SBN), which means shadowing the different features. However, the number of SBNs in such naive multi-path is exponential. Let be the alternative path set in choice block , be the output of , be the maximum of selected paths. There will be possible combinations in a choice block, which grows exponentially with . In fact, if some specific conditions are met, the number of SBNs can be reduced to . The followings are complete explanations and related proofs, all taking m = 2 as an example.
Let be the input images, and be the outputs of selective path and in choice block respectively. Firstly, two important definitions are given.
Condition of Zero Order: Given two high-dimension functions, = and =g, we say that and satisfy the condition of zero order if for any valid .
Note that both and are CNN feature maps with high dimensions such as . Therefore, we reshape them as
and use their cosine similarity to measure the degree of approximation. For Definition 1, considering the mechanism that weight-sharing works, the cosine similarity between the channel-wise feature maps are very high in a steadily trained supernet
[7, 5]. Above conclusion leads that , that is to say satisfy the condition of zero order, which also hold in our case.With above discussion, we can draw the following two lemmas.
If and satisfy the condition of zero order, the expectation and variance are approximate, .
Let , , . First we consider the expectation of and :
Because and are obtained by two functions and , above equations can be written as:
(1) | ||||
According to the condition of zero order, we have . And is same for both and . So we have .
Now we prove . Note that and , thus we only need to prove . It’s similar to the prove of expectation, and can be written as:
(2) | ||||
According to the condition of zero order, we can prove than
∎
In summary, we can draw the conclusion that and y have approximately the same expectation and variance. Similarly, it can be proved that this conclusion holds when . Based on Lemma 1, we can further get the next lemma.
If is the maximum of selected paths and each pair of the outputs of all parallel selective paths in choice block meet Definition 1, there are kingd of expectations and variances in all possible combinations of chosen paths.
This is obviously true when . For the case of , we have . When the two paths are both selected, the output becomes , it’s expectation can be written as:
(3) |
For variance, we have . Let , the variance of can be written as:
(4) |
Therefore, there are two kind of expectations and variances: and for , and and for .
Similarly, in the case where , there will be kinds of expectations and variances. ∎
Until here, the number of SBNs has been reduced to . will track the combination that contains paths. Compared to Switchable Batch Normalization [40]: Switchable BN is applied to catch the statistics of limited number () of sub-architectures with channel configurations^{2}^{2}2The value of is 4 or 8 in the original paper.. Shadow BN is designed to catch the changing statistics from the flexible combination (exponential) of different search-able paths while keeping the fixed number of channels.
We take a popular search space for example to illustrate our unified approach for one-shot neural architecture search, MixPath, which mixes up different number of paths in Figure 1. SBNs ensure the stability for supernet training in MixPath. Particularly, the input tensor is divided into four groups, where at most paths with various kernel sizes are randomly chosen. Note that the actual number of paths in MixPath is randomly selected, i.e. . For example, is equally sampled from if . Then the selected paths are added and followed a SBN. According to above analysis, the output of each path is approximately identically distributed, and the number of SBN increases linearly with . Therefore, each MixPath contains SBNs, . For the actual number of paths , will be activated.
To summarize the overall pipeline of our approach, we illustrate the details of MixPath supernet training in Algorithm 1
. Next we progress with the well-known evolutionary algorithm NSGA-II
[11] for searching. In particular, our objectives are maximizing the classification accuracy while minimizing the FLOPs.With the guidance of MixPath and shadow BN, we design a search space containing 12 inverted bottlenecks, each of which has 4 kernel size choices of (3, 5, 7, 9) for depth-wise layer and 2 choices of (3, 6) for expansion rate. Hence, a huge search space named S can be obtained specifically in the range of (=1) to (=4).
As for each case, we directly train the supernet on CIFAR-10 for the same 600 epochs till it converges. Batch size is set to 256 and use SGD optimizer with 0.9 momentum and weight decay. In the training process, we set the initial learning rate to 0.06 and the cosine scheduler is applied. It takes about 19 GPU hours on the single Tesla V100 card for training and random search. After the training of supernet, random search algorithm is applied to sample 1000 models to obtain the model accuracy distribution.The comparisons with recent state of the art models on CIFAR-10 are listed in Table 1.
Models | Params | Test Error | Type | |
---|---|---|---|---|
(M) | (M) | () | ||
NASNet-A [42] | 3.3 | 608 | 2.65 | RL |
DARTS [24] | 3.3 | 528 | 2.86 | GD |
SNAS [36] | 2.9 | 422 | 2.98 | GD |
GDAS [13] | 3.4 | 519 | 2.93 | GD |
P-DARTS [4] | 3.4 | 532 | 2.50 | GD |
PC-DARTS [37] | 3.6 | 558 | 2.57 | GD |
FairDARTS-a [9] | 2.8 | 371 | 2.54 | GD |
MixNet-M[33] | 5.1 | 360 | 2.10 | TF |
MixPath-a (ours) | 5.3 | 473 | 2.60 | OS |
MixPath-b (ours) | 3.5 | 299 | 2.17 | TF |
Layer-wise search based on inverted bottleneck block is another commonly used space [31, 32, 34, 7, 33]. Therefore, we also search on ImageNet proxylessly based on the search space of MnasNet [31]. However, we fix the expansion rate as [33] and focus29693 on searching the depth-wise convolution layer of the inverted bottleneck block and their combinations (18 layers in total). Specifically, we search the kernel size (3, 5, 7 ,9) and their combinations of the depth-wise layer, thus building a search space with capacity about . Moreover, we also construct a group based kernel search as MixNet to make fair comparisons. Particularly, we evenly categorize the depth wise layer aligned channel dimension by 4 groups and search the kernel size and their combinations within each group, which forms the search space .
We search under two settings of . For each case, we utilize the same hyper-parameters. We use batch size of 1024 and SGD optimizer with 0.9 momentum and weight decay. The initial learning rate is 0.1 and scheduled to zero by cosine decay strategy within 120 epochs, which involves about 150k times of back propagation and takes about 10 GPU days on Tesla V100 machines.
We use the same training tricks as MnasNet [31] to train searched models. Unlike EfficientNet [27], we don’t use auto-augment tricks [10]. The performances of our searched models and comparisons with state-of-the-art architectures are listed in Table 2.
MixPath-A from uses 349M multiply-adds to obtain top-1 accuracy on ImageNet validation dataset. By contrast, MixNet-M uses 10M more flops and 300 times more GPU days ^{3}^{3}3We approximate it by MnasNet (3000+ GPU days). to obtain such level of accuracy.
Compared with EfficientNet-B0, MixPath-B from uses fewer FLOPS and number of parameters to obtain higher top-1 validation accuracy (). It makes extensive uses of larger kernel ( and ) instead of small ones. kernels are mainly used in parallel with large one to balance the trade off between FLOPS and accuracy. We attribute the high accuracy performance to the feature aggregation of multi branches. Moreover, this light weight model benefits from the analysis in Section 3.1, where multi-branch is promising in balancing accuracy and inference complexity.
Models | + | Params | Top-1 | Top-5 |
---|---|---|---|---|
(M) | (M) | () | () | |
MobileNetV2 sandler2018mobilenetv2 | 300 | 3.4 | 72.0 | 91.0 |
DARTS (2nd order)[24] | 595 | 4.9 | 73.1 | - |
PDARTS (CIFAR-10) | 557 | 4.9 | 75.6 | 92.6 |
PCDARTS [37] | 586 | 5.3 | 74.9 | 92.2 |
FairDARTS-C chu2019fair | 380 | 4.2 | 75.1 | 92.4 |
FBNet-B wu2018fbnet | 295 | 4.5 | 74.1 | - |
MnasNet-A2 tan2018mnasnet | 340 | 4.8 | 75.6 | 92.7 |
MobileNetV3 howard2019searching | 219 | 5.4 | 75.2 | 92.2 |
Proxyless-R cai2018proxylessnas | 320 | 4.0 | 74.6 | 92.2 |
FairNAS-A chu2019fairnas | 388 | 4.6 | 75.3 | 92.4 |
Single-Path stamoulis2019single | 365 | 4.3 | 75.0 | 92.2 |
SPOS guo2019single | 328 | 3.4 | 74.9 | 92.0 |
MixNet-M tan2020mixconv | 360 | 5.0 | 76.6 (77) | 93.2 |
AtomNAS-B [25] | 326 | 4.4 | 75.5 | 92.6 |
EfficientNet B0 tan2019efficientnet | 390 | 5.3 | 76.3 | 93.2 |
SCARLET-A chu2019scarletnas | 365 | 6.7 | 76.9 | 93.4 |
MixPath-A (ours) | 349 | 5.0 | 76.9 | 93.4 |
MixPath-B (ours) | 378 | 5.1 | 76.7 | 93.3 |
We also evaluated the transfer ability of MixPath on CIFAR-10 dataset, as shown in Table 1. We fine-tune the model on CIFAR-10 dataset, which is trained on ImageNet from scratch. The settings are referred to [18] and [21]. Compared with MixNet-M[33]
, MixPath achieved 97.83% with only 3.5M number of parameters and 299M FLOPs on CIFAR-10 dataset. It shows that the searched model using our proposed method also have a strong ability to transfer learning.
The theoretical analysis about parameters of batch normalization can be verified by experiments. Without loss of generality, we set = 2 and make statistics on the four parameters of shadow BN across all channels for the first choice block. While goes like a shadow of one branch, does for the case of two paths. The histogram of four parameters is shown in Figure 5. Based on the theoretical analysis of Section 3.3, and , which can be observed in Figure 5. It’s interesting to see that the other two learn-able parameters and are quite similar for and . In such case, vanilla batch normalization plus post calibration for and is promising [1].
We run four experiments with different () to investigate the effect of shadow batch normalization, where all other variables are controlled. Each experiment randomly samples 1k models and reports the test accuracy distributions on CIFAR10 in Figure 6. It’s interesting to see that MixPath falls backs to single path when . Whereas, shadow batch normalization begins to demonstrate its power for
, whose absence leads to a bad supernet with lower accuracy and much larger gap. This means the supernet severely under-estimates the performance of a large proportion of architectures.
We compare the adopted NSGA-II search algorithm with random search by charting the Pareto-front of models found by both methods in Figure 7. NSGA-II has a clear advantage in that the final Pareto-font models have higher accuracies and fewer multi-adds.
As mentioned above, the most critical role of the one-shot supernet is to differentiate good and bad architectures. NAS-bench-101 [41] is a good benchmark to model ranking evaluation and also used to score our methods in this paper. The shadow BN is placed after the input edge of Node 5.
Batch normalization calibration is a kind of BN post processing trick, which is used in [1, 15, 25] to correct the biased mean and variance with extra data and computational resources. As [41], we utilize Kendall Tau to evaluate the model ranking performance for four comparison groups: shadow BN with/without post BN recalibration and vanilla BN with/without post BN re-calibration. We randomly sample 70 models from and look up their top-1 accuracy from NAS-bench 101 table and calculate the metrics. Particularly, we run two experiments for across three seeds and train the supernet for 100 epochs with batchsize 96 and learning rate 0.025, the result is shown in Table 3.
Type | Kendall Tau | Kendall Tau |
---|---|---|
(=4) | (=3) | |
Shadow BN | 0.393 0.017 | 0.318 0.034 |
Shadow BN + CA | 0.597 0.037 | 0.592 0.024 |
wo Shadow BN | 0.167 0.038 | 0.045 0.060 |
wo Shadow BN + CA | 0.368 0.134 | 0.430 0.031 |
It’s notable to see that BN post calibration can boost Kendall Tau in each case, which indicates the validity of this post processing work [1]. For , even without calibration, Shadow BN still ranks architectures better than Vanilla BN with 0.025 higher Kendall Tau value. The composition of two tricks can further boost the score to 0.597, which breaks a new record for model ranking on NAS-bench-101. From theoretical analysis, shadow BN can degrade as vanilla when , the Kendall Tau gap between Shadow BN and vanilla is narrowed when we decease from 4 to 3.
The statement that highly similar features of competitive choices play a critical role in supporting weight sharing under the one shot setting seems promising [7]. We also calculate the similarities of the feature maps between one path and two paths of MixPath for in Table 4. We use the features () of the first searchable block from the supernet in and 10 images to obtain the mean and variance of the similarity value. The mean of cosine similarity is above 0.93, thus meeting the condition of zero order.
We further calculate the Jacobi matrix of these features over input images, shown in Table 4 . Particularly, we forward the neural network for a given image to obtain two feature maps counterpart, one is from single path and another is summation of two paths. Then we use auto-grad to calculate the Jacobi matrix and obtain their cosine similarities. Since network weights are updated by the back propagation algorithm, such high similarity above 0.9 ensure stable update for network weights regardless of the randomly alternated number of paths.
Type | mean | variance |
---|---|---|
Feature Map | 0.9342 | 6e-5 |
Jacobian Matrix | 0.9042 | 1e-5 |
In this paper, we propose a unified approach for one shot neural architecture search, which bridges the gap between one-shot and multi-path. Existing single-path approaches can be regarded as a special case of ours. The proposed method uses shadow batch normalization to catch the changing feature from various branch combinations, which successfully solves two difficulties of vanilla multi-path: the unstable training of supernet and the unbearable weakness of model ranking. Moreover, we can reduce the number of shadow BN to be linear with -paths under some cases. Extensive experiments on NAS-bench-101 show that our method can boost the model ranking capacity of one-shot supernet with clear margins.
Based on thorough theoretical reasoning of weight-sharing mechanism and batch normalization’s functionality, we are able to offer practical guidelines that might shed lights on future design of one-shot NAS algorithms.
We are grateful to Deli Zhao for his insightful discussion about feature aggregations.
International Conference on Machine Learning
, pages 549–558, 2018.International Conference on Computer Vision
, 2019.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2019.A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II.
IEEE Transactions on Evolutionary Computation
, 6(2):182–197, 2002.EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.
In International Conference on Machine Learning, 2019.
Comments
There are no comments yet.