Noisy Differentiable Architecture Search

05/07/2020 ∙ by Xiangxiang Chu, et al. ∙ Xiaomi 0

Simplicity is the ultimate sophistication. Differentiable Architecture Search (DARTS) has now become one of the mainstream paradigms of neural architecture search. However, it largely suffers from several disturbing factors of optimization process whose results are unstable to reproduce. FairDARTS points out that skip connections natively have an unfair advantage in exclusive competition which primarily leads to dramatic performance collapse. While FairDARTS turns the unfair competition into a collaborative one, we instead impede such unfair advantage by injecting unbiased random noise into skip operations' output. In effect, the optimizer should perceive this difficulty at each training step and refrain from overshooting on skip connections, but in a long run it still converges to the right solution area since no bias is added to the gradient. We name this novel approach as NoisyDARTS. Our experiments on CIFAR-10 and ImageNet attest that it can effectively break the skip connection's unfair advantage and yield better performance. It generates a series of models that achieve state-of-the-art results on both datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

NoisyDARTS

Noisy Differentiable Architecture Search


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Performance collapse from excessive number of skip connections in the inferred model is a fatal drawback for the differentiable architecture search approaches (Liu et al. (2019); Chen et al. (2019a); Zela et al. (2020); Chu et al. (2019a)). Quite amount of previous research focus on addressing this issue. FairDARTS (Chu et al. (2019a)) concludes the cause of this collapse to be the unfair advantage in an exclusively competitive environment. Under this perspective, they summarize several current effective approaches as different ways of avoiding the unfair advantage. Inspired by their empirical observations, we adopt quite a different and straightforward approach by injecting unbiased noise to skip connections’ output. The underlying philosophy is simple: the injected noise would bring perturbations into the gradient flow via skip connections, so that its unfair advantage doesn’t comfortably take effect.

Our contributions can be summarized into the following aspects.

  • We propose a simple but effective approach to address the performance collapse issue in the differentiable architecture search: injecting noise into the gradient flow of skip connections.

  • We dive into the requirements of the desired noise, which should be of small variance as well as unbiased for the gradient flow in terms of expectation. Particularly, we found that Gaussian noise with zero mean and small variance is a handy solution.

  • Our research corroborate the root cause analysis of performance collapse in FairDARTS (Chu et al. (2019a)) since suppressing the unfairness from skip connections does generate substantial results.

  • We performed extensive experiments on two widely used search spaces and two standard datasets to verify the effectiveness and robustness of the proposed approach. With efficient search, we achieve state-of-the-art results on both CIFAR-10 (97.61%) and ImageNet (77.9%). Good transferability is also verified on object detection tasks.

2 Related Work

Performance collapse in DARTS. The notorious performance collapse of DARTS is unanimously confirmed by many (Chen et al. (2019a); Zela et al. (2020); Chu et al. (2019a)). To remedy this failure, Chen et al. (2019a) carefully set a hard constraint to limit the number of skip connections. This is a strong prior since architectures within this regularized search space generally perform well, as indicated by Chu et al. (2019a). Meantime, Zela et al. (2020) devised several search spaces to prove that DARTS leads to degenerate models where skip connections are dominant. To robustify the searching process, they proposed to monitor the sharpness of validation loss curvature, which has a correlation to the induced model’s performance. This however adds too much extra computation, as it needs to calculate the eigenspectrum of Hessian matrix. Chu et al. (2019a) instead relaxes the search space to avoid exclusive competition. They allow each operation to have independent architecture weight by using sigmoid to mimic a multi-hot encoding other than the original softmax

for a one-hot encoding. This modification enlarges the search space as it allows multiple choices between every two nodes, whereas the intrinsic collapse in the original DARTS search space still calls for a better solution.

Noise tricks.

Noise injection is a common and effective trick in many deep learning scenarios.

Vincent et al. (2008) add noises into auto-encoders to extract robust features. Fortunato et al. (2018)

propose NoisyNet which utilizes random noise to perform exploration in reinforcement learning.

Chen et al. (2015)

make use of noises to handle network expansion on knowledge transfer tasks. Meanwhile, Injecting noise to gradients has been proved to greatly facilitate the training of deep neural networks (

Neelakantan et al. (2015); Zhang et al. (2019)). Most recently, a noisy student is also devised to progressively manipulate unlabelled data (Xie et al. (2019a)).

3 Noisy DARTS

3.1 Motivation

As enlightened by FairDARTS (Chu et al. (2019a)), a more natural approach to resolve performance collapse is to suppress the unfair advantage. Straightforwardly, injecting noise into the skip connection operations would do. The noise hereby plays a role of troubling the overwhelming gradient flow via skip connections so that optimization proceeds more fairly. In other words, the skip connection operation has to overcome this disturbance from the noise to win robustly. The remaining problem is to design a proper type of noise.

3.2 Requirements for the Injected Noise

We let be the noise injected into the skip operation, and be the corresponding architectural weight. The loss of skip operation can then be written as,

(1)

where

is the validation loss function and

is the softmax operation for . Approximately, when the noise is much smaller than the output features, we also have

(2)

In the noisy scenario, the gradient of the architecture along the skip connection operation becomes,

(3)

As random noise brings uncertainty to the gradient update, skip connections have to overcome this difficulty in order to win over other operations. Their unfair advantage is then much weakened. However, not all types of noise are equally effective in this regard. A basic requirement is that, while assuring suppression on the unfair advantage, it shouldn’t bring in any bias to the gradient in terms of its expectation. Formally, the expectation of the gradient can be written as,

(4)

Based on the premise stated in Equation 2, we take out of the expectation in Equation 4 to make an approximation. As there is still an extra in the gradient of skip connection, to keep the gradient unbiased, must be 0. It’s thus natural to introduce small and unbiased noise, i.e. with zero mean and small variance. For simplicity, we just use Gaussian noise and other options should work as well.

3.3 Stepping out of the Performance Collapse by Noise

Based on the above analysis, we propose NoisyDARTS to step out of the performance collapse of DARTS. In practice, we inject Gaussian noise into skip connections to weaken the unfair advantage.

Formally, the edge from node to in each cell operates on th input feature and its output is denoted as . The intermediate node gathers all inputs from the incoming edges:

(5)

Let be the set of candidate operations on edge . Specially, let be the skip connection . NoisyDARTS injects the additive noise into skip operation to get a mixed output,

(6)

To ensure the gradient of a skip connection is unbiased and the noise is small enough, we set and , where

is a positive coefficient. That is to say, the standard deviation of the noise changes accordingly with a mini-batch of samples

, i.e., times that of . Setting a low , we meet as required by Equation 2.

Compared with DARTS, the only modification is injecting the noise into the skip connections. The architecture search problem remains the same as the original DARTS, which is to interleavely learn and network weights that minimize the validation loss , as shown in Equation 7.

(7)
s.t.

The modified optimization process codenamed NoisyDARTS is shown in Algorithm 1.

1:  Input: Architecture parameters , network weights , noise control parameter , .
2:  while not reach  do
3:     Inject random Gaussian noise into the skip connections’ output.
4:     Update weights by
5:     Update architecture parameters by
6:  end while
7:  Derive the final architecture according to .
Algorithm 1 NoisyDARTS-Noisy Differentiable Architecture Search

4 Experiments

4.1 Search Space

To verify the validity of our method, we adopt two widely used search spaces: the DARTS search space Liu et al. (2019) and MobileNetV2’s search space as in Cai et al. (2019)

. The former consists of a stack of duplicate normal cells and reduction cells, which are represented by a DAG of 7 nodes with each edge among intermediate nodes having 7 possible operations (max pooling, average pooling, skip connection, separable convolution 3

3 and 55, dilation convolution 33 and 55). The latter comprises 19 layers and each contains 7 choices: inverted bottleneck blocks denoted as Ex_Ky (expansion rate , kernel size ) and a skip connection.

4.2 Searching on CIFAR-10

In the search phase, we use the similar hyperparameters and tricks as

Liu et al. (2019)

. All experiments are done on a Tesla V100 with PyTorch 1.0 (

Paszke et al. (2019)). The search phase takes about 7 GPU hours, which is less than the previously reported cost (12 GPU hours) due to a better implementation. We only use the first-order approach for optimization since it is more effective. The best models are selected under the noise with a zero mean and . An example of the evolution of architectural weights of this search phase is exhibited in Figure 1.

(a) Normal cell
(b) Reduction cell
Figure 1: Evolution of architectural weights during the NoisyDARTS searching phase on CIFAR-10. Skip connections in normal cells are largely suppressed.

For training a single model, we use the same strategy and data processing tricks as Chen et al. (2019a); Liu et al. (2019), and it takes about 16 GPU hours. The results are shown in Table 1. The best NoisyDARTS model achieves a new state-of-the-art result of 97.61% with only 534M FLOPS and 3.25M parameters. The searched cells are shown in Figure 2 and Table 6 (supplementary). It’s interesting to see that this model chooses as many as 4 skip connections for reduction cells, it still obtains highly competitive result. However, we can attribute it to the elimination of the unfair advantage (Chu et al. (2019a)). In other words, the unfair advantage is suppressed to a level that the activated skip connections indeed contribute to the performance of the selected model.

Models Params (M) (M) Top-1 (%) Type
NASNet-A (Zoph et al. (2018)) 3.3 608 97.35 RL
ENAS (Pham et al. (2018)) 4.6 626 97.11 RL
MdeNAS (Zheng et al. (2019)) 3.6 599 97.45 MDL
DARTS(first order)(Liu et al. (2019)) 3.3 528 97.00 GD
SNAS( Xie et al. (2019b)) 2.8 422 97.15 GD
GDAS (Dong and Yang (2019)) 3.37 519 97.07 GD
SGAS (Cri.2 avg.) (Li et al. (2019)) 3.9 - 97.33 GD
P-DARTS (Chen et al. (2019a)) 3.4 532 97.5 GD
PC-DARTS (Xu et al. (2020)) 3.6 558 97.43 GD
RDARTS (Zela et al. (2020)) - - 97.05 GD
FairDARTS (Chu et al. (2019a)) 3.320.46 45861 97.46 GD
NoisyDARTS-a (Ours) 3.25 534 97.61 GD
NoisyDARTS-b (Ours) 3.01 494 97.53 GD
NoisyDARTS-A-t (Ours) 4.3 447 98.28 TF
Table 1: Results on CIFAR-10. : MultAdds computed using the genotypes provided by the authors. : Averaged on training the best model for several times. GD: Gradient-based, TF: Transferred from ImageNet.
(a) Normal cell
(b) Reduction cell
Figure 2: NoisyDARTS-a cells searched on CIFAR-10.
(a) Normal cell
(b) Reduction cell
Figure 3: NoisyDARTS-b cells searched on CIFAR-10.

4.3 Searching Proxylessly on ImageNet

For ImageNet experiments, we use exactly the same space as Cai et al. (2019); Chu et al. (2019b)

. In the search phase, the Gaussian noise with zero mean and a variance of 0.2 is injected along the skip connection operations. We split the original training set into two datasets with equal capacity to act as our training and validation dataset. The original validation set is treated as the test set. We use the SGD optimizer with a batch size of 768. The learning rate for the network weights is initialized as 0.045 and it decays to 0 within 30 epochs following the cosine decay strategy. Besides, we utilize Adam optimizer (

) and a constant learning rate of 0.001. This stage takes about 12 GPU days on Tesla V100 machines. In the training phase for inferred standalone models, we use similar training tricks as EfficientNet (Tan and Le (2019)). Similar to Liu et al. (2019, 2018), the objective of this experiment is to find the best model without considering other constraints on FLOPS or number of parameters. The evolution of dominating operations during the search is illustrated in Figure 4. Compared with DARTS, the injected noise in NoisyDARTS successfully eliminates the unfair advantage.

Figure 4: Stacked plot of dominant operations during searching on ImageNet. The inferred model of DARTS (left) obtains 66.4% top-1 accuracy on ImageNet validation dataset. The inferred model of NoisyDARTS (right) obtains 76.1% top-1 accuracy.

The ImageNet classification results are shown in Table 2. Our model NoisyDARTS-A obtains new state of the art results: top-1 accuracy on ImageNet validation dataset with 4.9M number of parameters. After being equipped with more tricks as in EfficientNet, namely squeeze and excitation (Hu et al. (2018)), Swish activation, Auto-Augment (Cubuk et al. (2019)), it obtains top-1 accuracy with 5.5M parameters. The searched architecture is shown in Figure 5.

Models (M) Params (M) Top-1 (%) Top-5 (%)
MobileNetV2(1.4) (Sandler et al. (2018)) 585 6.9 74.7 92.2
NASNet-A (Zoph et al. (2018)) 564 5.3 74.0 91.6
AmoebaNet-A(Real et al. (2018)) 555 5.1 74.5 92.0
MnasNet-92 (Tan et al. (2019)) 388 3.9 74.79 92.1
MdeNAS(Zheng et al. (2019)) - 6.1 74.5 92.1
DARTS (Liu et al. (2019)) 574 4.7 73.3 91.3
SNAS (Xie et al. (2019b)) 522 4.3 72.7 90.8
GDAS (Dong and Yang (2019)) 581 5.3 74.0 91.5
PNAS (Liu et al. (2018)) 588 5.1 74.2 91.9
FBNet-C (Wu et al. (2019)) 375 5.5 74.9 92.3
FairNAS-C (Chu et al. (2019c)) 321 4.4 74.7 92.1
P-DARTS (Chen et al. (2019a)) 577 5.1 74.9 92.3
FairDARTS-B (Chu et al. (2019a)) 541 4.8 75.1 92.5
NoisyDARTS-A (Ours) 446 4.9 76.1 93.0
MobileNetV3 (Howard et al. (2019)) 219 5.4 75.2 92.2
MoGA-A (Chu et al. (2020)) 304 5.1 75.9 92.8
MixNet-M (Tan and Le. (2019)) 360 5.0 77.0 93.3
EfficientNet B0 (Tan and Le (2019)) 390 5.3 76.3 93.2
NoisyDARTS-A (Ours) 449 5.5 77.9 94.0

Table 2: Classification results on ImageNet. : Based on its published code. : Searched on CIFAR-10. : Searched on CIFAR-100. : Searched on ImageNet. : w/ SE and Swish
Figure 5: NoisyDARTS-A searched on ImageNet. Different colors represent different stages.

4.4 Transferring to CIFAR-10

We transferred our NoisyDARTS-A model searched on ImageNet to CIFAR-10. The model is trained for 200 epochs with a batch size of 256 and a learning rate of 0.05. We set the weight decay to be 0.0, a dropout rate of 0.1 and a drop connect rate of 0.1. In addition, we also use AutoAugment to avoid overfitting. Specifically, the transferred model NoisyDARTS-A-t achieved top-1 accuracy with only 447M FLOPS, as shown in Table 1.

4.5 Transferred Results on Object Detection

We further evaluate the transferability of our searched models on the COCO objection task (Lin et al. (2014)). Particularly, we utilize a drop-in replacement for the backbone based on the Retina framework (Lin et al. (2017)). Besides, we use the MMDetection tool box since it provides a good implementation for various detection algorithms (Chen et al. (2019b)). Following the same training setting as Lin et al. (2017), all models in Table 3 are trained and evaluated on the COCO dataset for 12 epochs. The learning rate is initialized as 0.01 and decayed by 0.1 at epoch 8 and 11. As shown in Table 3, our model obtains the best transferability than other models under the mobile settings.

Backbones Acc AP AP AP AP AP AP
(M) (%) (%) (%) (%) (%) (%) (%)
MobileNetV2 (Sandler et al. (2018)) 300 72.0 28.3 46.7 29.3 14.8 30.7 38.1
SingPath NAS (Stamoulis et al. (2019)) 365 75.0 30.7 49.8 32.2 15.4 33.9 41.6
MnasNet-A2 (Tan et al. (2019)) 340 75.6 30.5 50.2 32.0 16.6 34.1 41.1
MobileNetV3 (Howard et al. (2019)) 219 75.2 29.9 49.3 30.8 14.9 33.3 41.1
MixNet-M (Tan and Le. (2019)) 360 77.0 31.3 51.7 32.4 17.0 35.0 41.9
NoisyDARTS-A (Ours) 449 77.9 33.1 53.4 34.8 18.5 36.6 44.4
Table 3: Object detection of various drop-in backbones. : w/ SE and Swish

4.6 Ablation Study

4.6.1 With vs Without Noise

We compare the searched models with and without noises on two commonly used search spaces in Table 4. NoisyDARTS robustly escape from the performance collapse across different search spaces and datasets. Note that without noise, such differentiable approach performs severely worse and obtains only top-1 on the ImageNet classification task. In constract, our simple yet effective method can find a state-of-the-art model with top-1 accuracy.

Type Dataset Top-1 (%)
w/ Noise CIFAR-10 97.6
w/o Noise CIFAR-10 97.0
w/ Noise ImageNet 76.1
w/o Noise ImageNet 66.4
Table 4: NoisyDarts can robustly escape from the performance collapse across different search spaces and datasets.

4.6.2 Zero-mean (unbiased) Noise vs. Biased Noise

According to Equation 4, good noise should interfere little with overall gradient flow in terms of expectation but prevent skip connections from overbenefiting locally. Our ablation experiments confirm this hypothesis, as shown in Table 5. The searched cells are shown in Table 7 (supplementary). When a non-zero mean noise is injected, it brings in a deterministic bias which overshoots the gradient and misguides the whole optimization process.

Noise Type Top-1 (%)
Gaussian 0.0 0.1 97.53
Gaussian 0.5 0.1 96.91
Gaussian 1.0 0.1 96.90
Gaussian 0.0 0.2 97.61
Gaussian 0.5 0.2 97.35
Gaussian 1.0 0.2 97.07
Table 5: Ablation experiments on Gaussian noise of different standard deviations.

5 Conclusion

In this paper, we proposed a novel differentiable architecture search approach, NoisyDARTS. By injecting unbiased Gaussian noise into skip connections’ output, we successfully let the optimization process be perceptible about the disturbed gradient flow. In such a way, the unfair advantage is largely attenuated. Experiments show that NoisyDARTS can work both effectively and robustly. The searched models achieved state-of-the-art results on CIFAR-10 and ImageNet. NoisyDARTS-a and NoisyDARTS-b also confirm that our proposed method can allow many skip connections as long as it does substantially contribute to the performance of the derived model.

References

Appendix A Experiments

Model Architecture Genotype
NoisyDARTS-a Genotype(normal=[(’sep_conv_3x3’, 1), (’sep_conv_3x3’, 0), (’skip_connect’, 0), (’dil_conv_3x3’, 2), (’dil_conv_3x3’, 3), (’sep_conv_3x3’, 1), (’dil_conv_5x5’, 4), (’dil_conv_3x3’, 3)], normal_concat=range(2, 6), reduce=[(’max_pool_3x3’, 0), (’dil_conv_5x5’, 1), (’skip_connect’, 2), (’max_pool_3x3’, 0), (’skip_connect’, 2), (’skip_connect’, 3), (’skip_connect’, 2), (’dil_conv_5x5’, 4)], reduce_concat=range(2, 6))
NoisyDARTS-b Genotype(normal=[(’sep_conv_3x3’, 1), (’sep_conv_3x3’, 0), (’skip_connect’, 2), (’sep_conv_3x3’, 0), (’dil_conv_3x3’, 3), (’skip_connect’, 0), (’dil_conv_3x3’, 3), (’dil_conv_3x3’, 4)], normal_concat=range(2, 6), reduce=[(’max_pool_3x3’, 0), (’skip_connect’, 1), (’max_pool_3x3’, 0), (’skip_connect’, 2), (’skip_connect’, 2), (’max_pool_3x3’, 0), (’skip_connect’, 2), (’avg_pool_3x3’, 0)], reduce_concat=range(2, 6))
Table 6: NoisyDARTS architecture genotypes searched on CIFAR-10.
() Architecture Genotype
(0.5, 0.1) Genotype(normal=[(’sep_conv_3x3’, 1), (’skip_connect’, 0), (’skip_connect’, 0), (’dil_conv_3x3’, 2), (’dil_conv_5x5’, 3), (’skip_connect’, 0), (’dil_conv_5x5’, 3), (’dil_conv_3x3’, 4)], normal_concat=range(2, 6), reduce=[(’avg_pool_3x3’, 0), (’skip_connect’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1)], reduce_concat=range(2, 6))
(1.0, 0.1) Genotype(normal=[(’sep_conv_3x3’, 0), (’sep_conv_5x5’, 1), (’dil_conv_3x3’, 2), (’dil_conv_3x3’, 0), (’dil_conv_5x5’, 3), (’dil_conv_5x5’, 2), (’dil_conv_3x3’, 4), (’dil_conv_3x3’, 3)], normal_concat=range(2, 6), reduce=[(’max_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’skip_connect’, 2), (’avg_pool_3x3’, 0), (’skip_connect’, 3), (’skip_connect’, 2), (’dil_conv_5x5’, 4), (’skip_connect’, 3)], reduce_concat=range(2, 6))
(0.5, 0.2) Genotype(normal=[(’sep_conv_3x3’, 1), (’skip_connect’, 0), (’dil_conv_3x3’, 2), (’skip_connect’, 0), (’dil_conv_5x5’, 3), (’dil_conv_3x3’, 2), (’dil_conv_3x3’, 4), (’dil_conv_3x3’, 3)], normal_concat=range(2, 6), reduce=[(’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’dil_conv_5x5’, 4), (’avg_pool_3x3’, 0)], reduce_concat=range(2, 6))
(1.0, 0.2) Genotype(normal=[(’sep_conv_3x3’, 1), (’skip_connect’, 0), (’dil_conv_5x5’, 2), (’sep_conv_3x3’, 1), (’dil_conv_3x3’, 3), (’dil_conv_3x3’, 2), (’dil_conv_3x3’, 4), (’dil_conv_5x5’, 3)], normal_concat=range(2, 6), reduce=[(’max_pool_3x3’, 0), (’max_pool_3x3’, 1), (’max_pool_3x3’, 0), (’max_pool_3x3’, 1), (’dil_conv_5x5’, 3), (’max_pool_3x3’, 0), (’dil_conv_5x5’, 3), (’avg_pool_3x3’, 0)], reduce_concat=range(2, 6))
Table 7: NosiyDARTS architecture genotypes searched on CIFAR-10 under biased noise.