Noisy Differentiable Architecture Search
Simplicity is the ultimate sophistication. Differentiable Architecture Search (DARTS) has now become one of the mainstream paradigms of neural architecture search. However, it largely suffers from several disturbing factors of optimization process whose results are unstable to reproduce. FairDARTS points out that skip connections natively have an unfair advantage in exclusive competition which primarily leads to dramatic performance collapse. While FairDARTS turns the unfair competition into a collaborative one, we instead impede such unfair advantage by injecting unbiased random noise into skip operations' output. In effect, the optimizer should perceive this difficulty at each training step and refrain from overshooting on skip connections, but in a long run it still converges to the right solution area since no bias is added to the gradient. We name this novel approach as NoisyDARTS. Our experiments on CIFAR-10 and ImageNet attest that it can effectively break the skip connection's unfair advantage and yield better performance. It generates a series of models that achieve state-of-the-art results on both datasets.READ FULL TEXT VIEW PDF
Differential Architecture Search (DARTS) is now a widely disseminated
Despite the fast development of differentiable architecture search (DART...
Despite its high search efficiency, differential architecture search (DA...
This paper proposes a novel differentiable architecture search method by...
Existing neural architecture search (NAS) methods often return an
Recently, there has been a growing interest in automating the process of...
Differentiable neural architecture search has been a popular methodology...
Noisy Differentiable Architecture Search
Performance collapse from excessive number of skip connections in the inferred model is a fatal drawback for the differentiable architecture search approaches (Liu et al. (2019); Chen et al. (2019a); Zela et al. (2020); Chu et al. (2019a)). Quite amount of previous research focus on addressing this issue. FairDARTS (Chu et al. (2019a)) concludes the cause of this collapse to be the unfair advantage in an exclusively competitive environment. Under this perspective, they summarize several current effective approaches as different ways of avoiding the unfair advantage. Inspired by their empirical observations, we adopt quite a different and straightforward approach by injecting unbiased noise to skip connections’ output. The underlying philosophy is simple: the injected noise would bring perturbations into the gradient flow via skip connections, so that its unfair advantage doesn’t comfortably take effect.
Our contributions can be summarized into the following aspects.
We propose a simple but effective approach to address the performance collapse issue in the differentiable architecture search: injecting noise into the gradient flow of skip connections.
We dive into the requirements of the desired noise, which should be of small variance as well as unbiased for the gradient flow in terms of expectation. Particularly, we found that Gaussian noise with zero mean and small variance is a handy solution.
Our research corroborate the root cause analysis of performance collapse in FairDARTS (Chu et al. (2019a)) since suppressing the unfairness from skip connections does generate substantial results.
We performed extensive experiments on two widely used search spaces and two standard datasets to verify the effectiveness and robustness of the proposed approach. With efficient search, we achieve state-of-the-art results on both CIFAR-10 (97.61%) and ImageNet (77.9%). Good transferability is also verified on object detection tasks.
Performance collapse in DARTS. The notorious performance collapse of DARTS is unanimously confirmed by many (Chen et al. (2019a); Zela et al. (2020); Chu et al. (2019a)). To remedy this failure, Chen et al. (2019a) carefully set a hard constraint to limit the number of skip connections. This is a strong prior since architectures within this regularized search space generally perform well, as indicated by Chu et al. (2019a). Meantime, Zela et al. (2020) devised several search spaces to prove that DARTS leads to degenerate models where skip connections are dominant. To robustify the searching process, they proposed to monitor the sharpness of validation loss curvature, which has a correlation to the induced model’s performance. This however adds too much extra computation, as it needs to calculate the eigenspectrum of Hessian matrix. Chu et al. (2019a) instead relaxes the search space to avoid exclusive competition. They allow each operation to have independent architecture weight by using sigmoid to mimic a multi-hot encoding other than the original softmax
for a one-hot encoding. This modification enlarges the search space as it allows multiple choices between every two nodes, whereas the intrinsic collapse in the original DARTS search space still calls for a better solution.
Noise injection is a common and effective trick in many deep learning scenarios.Vincent et al. (2008) add noises into auto-encoders to extract robust features. Fortunato et al. (2018)
propose NoisyNet which utilizes random noise to perform exploration in reinforcement learning.Chen et al. (2015)
make use of noises to handle network expansion on knowledge transfer tasks. Meanwhile, Injecting noise to gradients has been proved to greatly facilitate the training of deep neural networks (Neelakantan et al. (2015); Zhang et al. (2019)). Most recently, a noisy student is also devised to progressively manipulate unlabelled data (Xie et al. (2019a)).
As enlightened by FairDARTS (Chu et al. (2019a)), a more natural approach to resolve performance collapse is to suppress the unfair advantage. Straightforwardly, injecting noise into the skip connection operations would do. The noise hereby plays a role of troubling the overwhelming gradient flow via skip connections so that optimization proceeds more fairly. In other words, the skip connection operation has to overcome this disturbance from the noise to win robustly. The remaining problem is to design a proper type of noise.
We let be the noise injected into the skip operation, and be the corresponding architectural weight. The loss of skip operation can then be written as,
is the validation loss function andis the softmax operation for . Approximately, when the noise is much smaller than the output features, we also have
In the noisy scenario, the gradient of the architecture along the skip connection operation becomes,
As random noise brings uncertainty to the gradient update, skip connections have to overcome this difficulty in order to win over other operations. Their unfair advantage is then much weakened. However, not all types of noise are equally effective in this regard. A basic requirement is that, while assuring suppression on the unfair advantage, it shouldn’t bring in any bias to the gradient in terms of its expectation. Formally, the expectation of the gradient can be written as,
Based on the premise stated in Equation 2, we take out of the expectation in Equation 4 to make an approximation. As there is still an extra in the gradient of skip connection, to keep the gradient unbiased, must be 0. It’s thus natural to introduce small and unbiased noise, i.e. with zero mean and small variance. For simplicity, we just use Gaussian noise and other options should work as well.
Based on the above analysis, we propose NoisyDARTS to step out of the performance collapse of DARTS. In practice, we inject Gaussian noise into skip connections to weaken the unfair advantage.
Formally, the edge from node to in each cell operates on th input feature and its output is denoted as . The intermediate node gathers all inputs from the incoming edges:
Let be the set of candidate operations on edge . Specially, let be the skip connection . NoisyDARTS injects the additive noise into skip operation to get a mixed output,
To ensure the gradient of a skip connection is unbiased and the noise is small enough, we set and , where
is a positive coefficient. That is to say, the standard deviation of the noise changes accordingly with a mini-batch of samples, i.e., times that of . Setting a low , we meet as required by Equation 2.
Compared with DARTS, the only modification is injecting the noise into the skip connections. The architecture search problem remains the same as the original DARTS, which is to interleavely learn and network weights that minimize the validation loss , as shown in Equation 7.
The modified optimization process codenamed NoisyDARTS is shown in Algorithm 1.
. The former consists of a stack of duplicate normal cells and reduction cells, which are represented by a DAG of 7 nodes with each edge among intermediate nodes having 7 possible operations (max pooling, average pooling, skip connection, separable convolution 33 and 55, dilation convolution 33 and 55). The latter comprises 19 layers and each contains 7 choices: inverted bottleneck blocks denoted as Ex_Ky (expansion rate , kernel size ) and a skip connection.
In the search phase, we use the similar hyperparameters and tricks asLiu et al. (2019)
. All experiments are done on a Tesla V100 with PyTorch 1.0 (Paszke et al. (2019)). The search phase takes about 7 GPU hours, which is less than the previously reported cost (12 GPU hours) due to a better implementation. We only use the first-order approach for optimization since it is more effective. The best models are selected under the noise with a zero mean and . An example of the evolution of architectural weights of this search phase is exhibited in Figure 1.
For training a single model, we use the same strategy and data processing tricks as Chen et al. (2019a); Liu et al. (2019), and it takes about 16 GPU hours. The results are shown in Table 1. The best NoisyDARTS model achieves a new state-of-the-art result of 97.61% with only 534M FLOPS and 3.25M parameters. The searched cells are shown in Figure 2 and Table 6 (supplementary). It’s interesting to see that this model chooses as many as 4 skip connections for reduction cells, it still obtains highly competitive result. However, we can attribute it to the elimination of the unfair advantage (Chu et al. (2019a)). In other words, the unfair advantage is suppressed to a level that the activated skip connections indeed contribute to the performance of the selected model.
|Models||Params (M)||(M)||Top-1 (%)||Type|
|NASNet-A (Zoph et al. (2018))||3.3||608||97.35||RL|
|ENAS (Pham et al. (2018))||4.6||626||97.11||RL|
|MdeNAS (Zheng et al. (2019))||3.6||599||97.45||MDL|
|DARTS(first order)(Liu et al. (2019))||3.3||528||97.00||GD|
|SNAS( Xie et al. (2019b))||2.8||422||97.15||GD|
|GDAS (Dong and Yang (2019))||3.37||519||97.07||GD|
|SGAS (Cri.2 avg.) (Li et al. (2019))||3.9||-||97.33||GD|
|P-DARTS (Chen et al. (2019a))||3.4||532||97.5||GD|
|PC-DARTS (Xu et al. (2020))||3.6||558||97.43||GD|
|RDARTS (Zela et al. (2020))||-||-||97.05||GD|
|FairDARTS (Chu et al. (2019a))||3.320.46||45861||97.46||GD|
. In the search phase, the Gaussian noise with zero mean and a variance of 0.2 is injected along the skip connection operations. We split the original training set into two datasets with equal capacity to act as our training and validation dataset. The original validation set is treated as the test set. We use the SGD optimizer with a batch size of 768. The learning rate for the network weights is initialized as 0.045 and it decays to 0 within 30 epochs following the cosine decay strategy. Besides, we utilize Adam optimizer () and a constant learning rate of 0.001. This stage takes about 12 GPU days on Tesla V100 machines. In the training phase for inferred standalone models, we use similar training tricks as EfficientNet (Tan and Le (2019)). Similar to Liu et al. (2019, 2018), the objective of this experiment is to find the best model without considering other constraints on FLOPS or number of parameters. The evolution of dominating operations during the search is illustrated in Figure 4. Compared with DARTS, the injected noise in NoisyDARTS successfully eliminates the unfair advantage.
The ImageNet classification results are shown in Table 2. Our model NoisyDARTS-A obtains new state of the art results: top-1 accuracy on ImageNet validation dataset with 4.9M number of parameters. After being equipped with more tricks as in EfficientNet, namely squeeze and excitation (Hu et al. (2018)), Swish activation, Auto-Augment (Cubuk et al. (2019)), it obtains top-1 accuracy with 5.5M parameters. The searched architecture is shown in Figure 5.
|Models||(M)||Params (M)||Top-1 (%)||Top-5 (%)|
|MobileNetV2(1.4) (Sandler et al. (2018))||585||6.9||74.7||92.2|
|NASNet-A (Zoph et al. (2018))||564||5.3||74.0||91.6|
|AmoebaNet-A(Real et al. (2018))||555||5.1||74.5||92.0|
|MnasNet-92 (Tan et al. (2019))||388||3.9||74.79||92.1|
|MdeNAS(Zheng et al. (2019))||-||6.1||74.5||92.1|
|DARTS (Liu et al. (2019))||574||4.7||73.3||91.3|
|SNAS (Xie et al. (2019b))||522||4.3||72.7||90.8|
|GDAS (Dong and Yang (2019))||581||5.3||74.0||91.5|
|PNAS (Liu et al. (2018))||588||5.1||74.2||91.9|
|FBNet-C (Wu et al. (2019))||375||5.5||74.9||92.3|
|FairNAS-C (Chu et al. (2019c))||321||4.4||74.7||92.1|
|P-DARTS (Chen et al. (2019a))||577||5.1||74.9||92.3|
|FairDARTS-B (Chu et al. (2019a))||541||4.8||75.1||92.5|
|MobileNetV3 (Howard et al. (2019))||219||5.4||75.2||92.2|
|MoGA-A (Chu et al. (2020))||304||5.1||75.9||92.8|
|MixNet-M (Tan and Le. (2019))||360||5.0||77.0||93.3|
|EfficientNet B0 (Tan and Le (2019))||390||5.3||76.3||93.2|
We transferred our NoisyDARTS-A model searched on ImageNet to CIFAR-10. The model is trained for 200 epochs with a batch size of 256 and a learning rate of 0.05. We set the weight decay to be 0.0, a dropout rate of 0.1 and a drop connect rate of 0.1. In addition, we also use AutoAugment to avoid overfitting. Specifically, the transferred model NoisyDARTS-A-t achieved top-1 accuracy with only 447M FLOPS, as shown in Table 1.
We further evaluate the transferability of our searched models on the COCO objection task (Lin et al. (2014)). Particularly, we utilize a drop-in replacement for the backbone based on the Retina framework (Lin et al. (2017)). Besides, we use the MMDetection tool box since it provides a good implementation for various detection algorithms (Chen et al. (2019b)). Following the same training setting as Lin et al. (2017), all models in Table 3 are trained and evaluated on the COCO dataset for 12 epochs. The learning rate is initialized as 0.01 and decayed by 0.1 at epoch 8 and 11. As shown in Table 3, our model obtains the best transferability than other models under the mobile settings.
|MobileNetV2 (Sandler et al. (2018))||300||72.0||28.3||46.7||29.3||14.8||30.7||38.1|
|SingPath NAS (Stamoulis et al. (2019))||365||75.0||30.7||49.8||32.2||15.4||33.9||41.6|
|MnasNet-A2 (Tan et al. (2019))||340||75.6||30.5||50.2||32.0||16.6||34.1||41.1|
|MobileNetV3 (Howard et al. (2019))||219||75.2||29.9||49.3||30.8||14.9||33.3||41.1|
|MixNet-M (Tan and Le. (2019))||360||77.0||31.3||51.7||32.4||17.0||35.0||41.9|
We compare the searched models with and without noises on two commonly used search spaces in Table 4. NoisyDARTS robustly escape from the performance collapse across different search spaces and datasets. Note that without noise, such differentiable approach performs severely worse and obtains only top-1 on the ImageNet classification task. In constract, our simple yet effective method can find a state-of-the-art model with top-1 accuracy.
According to Equation 4, good noise should interfere little with overall gradient flow in terms of expectation but prevent skip connections from overbenefiting locally. Our ablation experiments confirm this hypothesis, as shown in Table 5. The searched cells are shown in Table 7 (supplementary). When a non-zero mean noise is injected, it brings in a deterministic bias which overshoots the gradient and misguides the whole optimization process.
|Noise Type||Top-1 (%)|
In this paper, we proposed a novel differentiable architecture search approach, NoisyDARTS. By injecting unbiased Gaussian noise into skip connections’ output, we successfully let the optimization process be perceptible about the disturbed gradient flow. In such a way, the unfair advantage is largely attenuated. Experiments show that NoisyDARTS can work both effectively and robustly. The searched models achieved state-of-the-art results on CIFAR-10 and ImageNet. NoisyDARTS-a and NoisyDARTS-b also confirm that our proposed method can allow many skip connections as long as it does substantially contribute to the performance of the derived model.
Extracting and composing robust features with denoising autoencoders.In
Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9348–9355, 2019.
Regularized Evolution for Image Classifier Architecture Search.International Conference on Machine Learning, AutoML Workshop, 2018.
|NoisyDARTS-a||Genotype(normal=[(’sep_conv_3x3’, 1), (’sep_conv_3x3’, 0), (’skip_connect’, 0), (’dil_conv_3x3’, 2), (’dil_conv_3x3’, 3), (’sep_conv_3x3’, 1), (’dil_conv_5x5’, 4), (’dil_conv_3x3’, 3)], normal_concat=range(2, 6), reduce=[(’max_pool_3x3’, 0), (’dil_conv_5x5’, 1), (’skip_connect’, 2), (’max_pool_3x3’, 0), (’skip_connect’, 2), (’skip_connect’, 3), (’skip_connect’, 2), (’dil_conv_5x5’, 4)], reduce_concat=range(2, 6))|
|NoisyDARTS-b||Genotype(normal=[(’sep_conv_3x3’, 1), (’sep_conv_3x3’, 0), (’skip_connect’, 2), (’sep_conv_3x3’, 0), (’dil_conv_3x3’, 3), (’skip_connect’, 0), (’dil_conv_3x3’, 3), (’dil_conv_3x3’, 4)], normal_concat=range(2, 6), reduce=[(’max_pool_3x3’, 0), (’skip_connect’, 1), (’max_pool_3x3’, 0), (’skip_connect’, 2), (’skip_connect’, 2), (’max_pool_3x3’, 0), (’skip_connect’, 2), (’avg_pool_3x3’, 0)], reduce_concat=range(2, 6))|
|(0.5, 0.1)||Genotype(normal=[(’sep_conv_3x3’, 1), (’skip_connect’, 0), (’skip_connect’, 0), (’dil_conv_3x3’, 2), (’dil_conv_5x5’, 3), (’skip_connect’, 0), (’dil_conv_5x5’, 3), (’dil_conv_3x3’, 4)], normal_concat=range(2, 6), reduce=[(’avg_pool_3x3’, 0), (’skip_connect’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1)], reduce_concat=range(2, 6))|
|(1.0, 0.1)||Genotype(normal=[(’sep_conv_3x3’, 0), (’sep_conv_5x5’, 1), (’dil_conv_3x3’, 2), (’dil_conv_3x3’, 0), (’dil_conv_5x5’, 3), (’dil_conv_5x5’, 2), (’dil_conv_3x3’, 4), (’dil_conv_3x3’, 3)], normal_concat=range(2, 6), reduce=[(’max_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’skip_connect’, 2), (’avg_pool_3x3’, 0), (’skip_connect’, 3), (’skip_connect’, 2), (’dil_conv_5x5’, 4), (’skip_connect’, 3)], reduce_concat=range(2, 6))|
|(0.5, 0.2)||Genotype(normal=[(’sep_conv_3x3’, 1), (’skip_connect’, 0), (’dil_conv_3x3’, 2), (’skip_connect’, 0), (’dil_conv_5x5’, 3), (’dil_conv_3x3’, 2), (’dil_conv_3x3’, 4), (’dil_conv_3x3’, 3)], normal_concat=range(2, 6), reduce=[(’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’avg_pool_3x3’, 1), (’avg_pool_3x3’, 0), (’dil_conv_5x5’, 4), (’avg_pool_3x3’, 0)], reduce_concat=range(2, 6))|
|(1.0, 0.2)||Genotype(normal=[(’sep_conv_3x3’, 1), (’skip_connect’, 0), (’dil_conv_5x5’, 2), (’sep_conv_3x3’, 1), (’dil_conv_3x3’, 3), (’dil_conv_3x3’, 2), (’dil_conv_3x3’, 4), (’dil_conv_5x5’, 3)], normal_concat=range(2, 6), reduce=[(’max_pool_3x3’, 0), (’max_pool_3x3’, 1), (’max_pool_3x3’, 0), (’max_pool_3x3’, 1), (’dil_conv_5x5’, 3), (’max_pool_3x3’, 0), (’dil_conv_5x5’, 3), (’avg_pool_3x3’, 0)], reduce_concat=range(2, 6))|