An Empirical Study on the Robustness of NAS based Architectures

07/16/2020 ∙ by Chaitanya Devaguptapu, et al. ∙ Indian Institute of Technology Hyderabad 0

Most existing methods for Neural Architecture Search (NAS) focus on achieving state-of-the-art (SOTA) performance on standard datasets and do not explicitly search for adversarially robust models. In this work, we study the adversarial robustness of existing NAS architectures, comparing it with state-of-the-art handcrafted architectures, and provide reasons for why it is essential. We draw some key conclusions on the capacity of current NAS methods to tackle adversarial attacks through experiments on datasets of different sizes.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The choice of neural network architecture plays a crucial role in many challenging applications like image classification (

lecunetal98; Krizhevskyetal2012), object detection (renetal2015), image segmentation (heetal2017), etc. However, in most of the cases, these architectures are typically designed by experts in an ad-hoc, trial-and-error fashion. Early efforts on NAS (Zophetal2016) alleviate the pain of hand designing these architectures by partially automating the process of finding the best performing architectures. Since the work by Zophetal2016, there has been much interest in this space. Many researchers have come up with unique approaches to improve the performance besides decreasing the computational cost. Some of the popular examples include yanetal2019, phametal2018, and chenetal2019. As of writing this, the current SOTA on image classification and object detection are NAS-based models (Tanetal20191), (tanetal2019

), which shows that NAS plays an important role in solving standard learning tasks, especially in computer vision.

Adversarial robustness is defined as the accuracy of a model when adversarial examples (images perturbed with some imperceptible noise) are provided as input. NAS is used in real-world applications such as medical image segmentation (baeetal2019) and autonomous driving (haoetal2019

). Considering the criticality of these applications, the adversarial robustness of these architectures also plays a crucial role besides achieving good performance. Most existing NAS methods focus only on achieving SOTA performance and do not discuss the adversarial robustness of the final architecture. While NAS can be used to search for adversarially robust architectures, it remains an unexplored area to a significant extent. Most NAS methods come up with the best performing architecture (in terms of accuracy) on ImageNet by using smaller datasets like CIFAR-10/100 (

cifar10; cifar100) as a proxy for searching the architecture. As shown in Figure 1, We show that this may not be useful, and the performance of NAS on small-scale datasets may not translate to large-scale ones, especially in terms of robustness. In this work, we take the first steps to compare the robustness of handcrafted models with NAS-based architectures. In particular, we seek to answer the following questions:

  • How do NAS-based methods compare with handcrafted models in terms of robustness?

  • How does the robustness of NAS-based architectures vary concerning the dataset size?

  • Does increasing parameters of NAS-based architectures help improve robustness?

The remainder of this paper is organized as follows. Section 2 provides an overview of early and recent works on NAS, along with a brief description of how our work differs from existing efforts. Section 3 discusses our experimental setup, followed by a detailed discussion on our inferences from our experiments in Section 4. Section 5 summarizes our findings and key takeaways.

2 Related Work

Adversarial Attacks and Robustness:

Adversarial examples, in general, refers to samples that are imperceptible to the human eye but can fool a deep classifier to predict a non-true class with high confidence. Several effective adversarial attacks have been proposed over the years like FGSM (

fgsm), R-FGSM (rfgsm), StepLL (stepll), PGD (pgd) and SparseFool (Modas_2019_CVPR). Please see chakrborty2018 for more information.

Figure 1: Difference in best of NAS and best of Handcrafted Accuracies(DNHA) as size of the dataset increases. DNHA = max(NAS) - max(Hand-crafted). The figure shows that as dataset scale increases or attack is stronger, handcrafted models are more robust.

Neural Architecture Search (NAS)

proposes a methodology to automate the design of neural network architectures for a given task. Over the years, several approaches have emerged to search architectures using methods ranging from Reinforcement Learning (RL) (

Zophetal2016), Neuro-evolutionary approaches (real2019regularized), Sequential Decision Processes (liu2018progressive), One-shot methods (phametal2018) and fully differentiable Gradient-based methods (Darts2018

). RL-based methods consider the generation of architecture as a sequence of optimal actions of an agent in a search space with the reward function based on the performance of the generated architecture. Neuro-evolutionary algorithms, on the other hand, evolve a population of models with every generation promoting the optimal cell architectures ahead and introducing mutations in terms of operations and connections. Both kinds of methods are, in general, computationally expensive. Techniques using sequential model-based optimization, such as Progressive NAS (

liu2018progressive), follow a greedy approach to reduce the cost by incrementally growing the cell architecture in depth from a previous optimal sub-structure. One-shot NAS methods like Efficient NAS (phametal2018) improve search time by treating all possible architectures as different subgraphs of a supergraph and sharing weights among common edges, thereby searching 1000x faster than RL-based (Zophetal2016).

One-shot fully-differentiable NAS methods, such as DARTS (Darts2018), are based on a continuous relaxation of the architecture representation that allows the use of gradient descent for efficient architecture search. These gradient-based methods are orders of magnitude faster than non-differentiable techniques but they require the entire supergraph to reside in GPU memory during architecture search. P-DARTS (pdarts) improves over DARTS by progressively increasing search depth and bridging the gap between search and evaluation in comparison to DARTS (which searches in a shallow setting) and evaluates in a deep setting. Partially-Connected DARTS (pcdarts), a SOTA approach in NAS, significantly improves the efficiency of one-shot NAS by sampling parts of the super-network and adding edge normalization to reduce redundancy and uncertainty in search respectively. DenseNAS (densenas) is another state-of-the-art method that attempts to improve search space design by further searching block counts and block widths in a densely connected search space. Despite a plethora of these methods, there is little effort to understand the adversarial robustness of the final learned architectures.

NAS and Robustness: While dongetal2018, pgd and xieetal2019 show the role of network architectures in adversarial robustness through several experiments, they only focus on handcrafted architectures. Very recently, there have been limited efforts to improve adversarial robustness using architecture search (guo2019meets), (vargas2019evolving). guo2019meets proposes a robust architecture search framework by leveraging one-shot NAS. However, this method does not portray the true picture of architecture robustness as they use adversarial training in all their experiments. They mainly focus on searching networks that are adversarially robust on CIFAR-10, and the final architecture is transferred directly to CIFAR-100, SVHN, Tiny-Imagenet. As our analysis shows, current NAS approaches are already adequately robust on datasets like CIFAR-10 and CIFAR-100. It is, in fact, larger datasets that show a significant reduction in adversarial robustness. In addition to the use of adversarial training, the clean accuracy (test set accuracy) of the models used in guo2019meets is very less when compared to the SOTA numbers on the respective datasets. This restricts the deployment of these models in real-world applications. Though guo2019meets makes comparisons with three variants of ResNets (heetal2015) in case of the Imagenet (imagenet_cvpr09), there are better alternatives (handcrafted models) to ResNets both in terms of clean accuracy and robustness (as discussed in Section 4). vargas2019evolving uses black-box attacks to generate a fixed set of adversarial examples on CIFAR-10 and uses these examples to search a robust architecture. The experimental setting is constrained and does not reflect the true model robustness as the adversarial examples are fixed a priori, and no study is done on white-box attacks. Both guo2019meets and vargas2019evolving do not make any comparisons with existing NAS methods, which, as per our study are already robust to an extent.

In an attempt to understand the robustness of NAS-based methods completely from an architecture perspective, we mainly focus on evaluating the trend in the robustness of SOTA NAS methods on white-box attacks such as FGSM and PGD on datasets of different sizes, including large-scale datasets such as ImageNet (imagenet_cvpr09). We provide an extensive evaluation of the adversarial robustness of several NAS approaches and compare it with the standard and widely used handcrafted models.

3 Robustness of NAS Models: A Study

In this work, we empirically study the robustness of NAS models, and seek to answer the questions stated in Section 1. We begin by describing the design of our experiments, including datasets, models, attacks and metrics.

Datasets. Since the primary goal of our work is to compare the robustness of architectures at different dataset scales, we need to choose problem settings where datasets of different scales are easily available. Considering the dearth of publicly available datasets in different scales for problems like medical image segmentation and autonomous driving, we choose to go with the standard image classification datasets, which are widely used. In addition to the standard CIFAR-10 (cifar10) data set, which consists of 60K images of resolution, we also chose CIFAR-100 to test if the same robustness trends hold when the number of classes increases by a factor of 10. Since most real-world applications deal with large-scale datasets, we also test robustness on ImageNet (imagenet_cvpr09) dataset, consisting of M images from 1000 classes. This makes our study more complete when compared to earlier works.

Architectures. We selected most commonly used NAS methods including DARTS (Darts2018), P-DARTS (pdarts) and NSGA-Net (nsganet), as well as recent methods like PC-DARTS (pcdarts) and DenseNAS (densenas). For a fair comparison, we evaluate five well-known handcrafted architectures and at least four NAS architectures on each dataset mentioned above. For all experiments, we either use pre-trained models made available by the respective authors or train the models from scratch until we obtain the performance reported in the respective papers. NSGA-Net’s results are only available for CIFAR-10/100 because its implementation does not support Imagenet. Similarly DenseNAS implementation does not support CIFAR-10/100, so the results are shown only for Imagenet.

Adversarial Attacks. For adversarial robustness, we test against FGSM (fgsm), R-FGSM (rfgsm), StepLL (stepll) and PGD (pgd). Details of the attacks are: For FGSM, we use ; for R-FGSM, , ; for StepLL, , ; and for PGD, , with 10 iterations. All these parameter choices are standard and are widely used in the community, architectures are trained using standard training protocols, and no explicit adversarial training is performed.

Model Clean % FGSM R-FGSM Step LL PGD
VGG16 BN 94.07 55.58 52.03 87.01 11.94
Resnet-18 93.48 55.82 52.25 88.34 14.72
Resent-50 94.38 53.45 49.76 88.97 16.14
Densenet-121 94.76 54.17 50.93 88.98 15.96
Densenet-169 94.74 56.17 53.27 89.76 17.51
DARTS 97.03 64.77 55.32 90.89 3.27
PDARTS 97.12 64.88 56.02 91.25 4.22
NSGA Net 96.94 73.76 64.26 93.64 4.15
PC-DARTS 97.05 66.72 58.38 91.27 4.45
Table 1: CIFAR-10 (Top-1 Accuracy)

Metrics. We use Clean Accuracy and Adversarial Accuracy as our performance metrics. Clean Accuracy refers to the accuracy on undisturbed test set as provided in the dataset. For each attack, we measure Adversarial Accuracy by perturbing the test set examples by following the methods proposed in the papers listed above. These adversarial accuracies are reported in the tables as FGSM, R-FGSM, Step LL, PGD, depending on the attack.

4 Analysis and Results

As mentioned earlier, the primary goal of this work is to understand the robustness of NAS based architectures at different dataset scales. In this section, we compare and contrast the robustness of NAS based architectures with handcrafted models at different dataset scales by answering each of the questions listed in Section 1.

Figure 2: Comparison of robustness and clean accuracy of different architectures

4.1 How do NAS-based methods compare with handcrafted models in terms of robustness?

The robustness of different NAS based architectures on CIFAR-10, CIFAR-100 and Imagenet are shown in Tables 1, 2, 3 respectively. In case of CIFAR-10 and CIFAR-100, NSGA-Net (nsganet) outperforms every other architecture in terms of architectural robustness for attacks like FGSM, R-FGSM, and StepLL by a significant margin. However, in the case of PGD attack, NAS based architectures fail by a significant margin compared to handcrafted models.

Model Clean % FGSM R-FGSM Step LL PGD
VGG16 BN 72.05 22.68 16.23 60.09 2.20
Resnet-18 63.87 21.90 16.59 55.68 3.53
Resent-50 73.09 23.94 18.55 62.92 3.43
Densenet-121 78.71 28.87 22.45 69.68 4.22
Densenet-169 82.44 28.59 22.27 73.24 4.18
DARTS 82.43 33.42 21.00 67.46 0.70
PDARTS 79.36 33.50 24.93 66.32 2.07
NSGA Net 85.54 42.40 31.47 74.58 0.86
PC-DARTS 81.83 34.21 22.56 66.94 1.50
Table 2: CIFAR-100 (Top-1 Accuracy)

This trend seen in the case of CIFAR-10/100 did not hold for large scale datasets like Imagenet in our experiments. As shown in Table 3, handcrafted models like Densenets are more robust than NAS-based methods. In the case of StepLL, DenseNAS-R3 outperforms handcrafted models by a small margin of 0.6%. Considering this difference is in the Top-5 accuracy, it is not very significant. In case of PGD, PC-DARTS outperforms VGG-16 in Top-1 accuracy by 0.04%.

This trend of robustness for each of the three datasets is clearly shown in Figure 2. For stronger attacks like PGD, handcrafted models are generally more robust when compared to NAS based architectures. While NAS based architectures achieve SOTA clean accuracy, the robustness of these architectures is very erratic. Unless clean accuracy is the only criterion, NAS based architectures are not a good choice.

4.2 How does the robustness of NAS based architectures vary concerning the dataset size?

As discussed in Section 4.1 and as shown in Tables 1, 2, for datasets of scale CIFAR-10 and CIFAR-100, NAS based architectures are more robust to simple attacks like FGSM, R-FGSM, and StepLL when compared with handcrafted models. Nevertheless, as the dataset scale increases, the performance falls below the handcrafted architectures even for these relatively simple attacks.

Model Params (M) Clean % FGSM R-FGSM Step LL PGD
Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
VGG16 BN 138.36 73.36 91.52 21.59 43.54 13.15 31.51 39.73 62.82 0.72 0.92
Resnet-18 11.68 69.76 89.08 21.23 41.42 13.75 31.99 52.14 76.14 0.09 1.21
Resnet-50 25.55 76.13 92.86 30.75 53.90 22.91 45.45 58.15 81.60 0.14 2.91
Densenet-121 7.97 74.43 91.97 39.736 64.63 29.90 55.85 60.90 83.79 0.22 3.79
Densenet-169 14.14 75.60 92.81 44.55 69.29 34.43 61.46 61.95 84.70 0.38 6.38
DARTS 5.97 73.30 91.26 38.94 63.45 29.47 53.60 51.18 81.98 0.28 1.31
PDARTS 6.20 75.62 92.61 39.52 63.47 30.91 54.51 60.19 84.80 0.62 2.03
PC-DARTS 5.34 75.37 92.49 42.38 66.75 33.42 57.94 59.46 84.66 0.76 2.39
DenseNAS-Large 6.48 76.06 92.80 32.70 55.73 25.42 46.93 57.60 82.44 0.32 1.33
DenseNAS-R3 24.65 77.05 93.26 36.10 59.45 30.0 53.08 61.46 85.36 0.36 1.88
Table 3: Results on ImageNet

Figure 1 shows the difference between the maximum accuracy of NAS based architectures and the maximum accuracy of handcrafted models for all the four attacks shown in Tables 1, 2 and 3. In general, as the dataset scale increases, the robustness of NAS-based methods decreases when compared with handcrafted models. To confirm our claims on the latest NAS architectures, we ran experiments using the recently introduced PC-DARTS (pcdarts) and DenseNAS (densenas), even they are not robust when compared to the handcrafted models.

In conclusion, even though NAS-based architectures achieve SOTA test set performance for a dataset, their robustness varies heavily based on the dataset scale.

4.3 Does an increase in the number of parameters of NAS-based architectures help improve robustness?

dongetal2018 and pgd observed that within the same family of architectures, increasing the number of network parameters helps improve robustness.

Figure 3: Robustness of DenseNAS architectures with increasing order of parameters

We hypothesize that thus increasing model capacity benefits network robustness. To study this claim, we used different architectures of DenseNAS (densenas) with an increasing number of parameters. The results are shown in Table 4 and Figure 3 contains the corresponding plot.

Params (M) Clean % FGSM R-FGSM Step LL PGD
A 4.76 90.24 49.48 40.44 79.48 0.88
B 5.58 90.97 53.70 45.09 80.86 1.13
C 6.13 91.53 56.44 47.76 82.1 1.22
Large 6.48 92.80 55.73 46.93 82.44 1.33
R1 11.09 90.57 51.82 44.17 79.41 0.93
R2 19.46 91.75 56.68 50.51 81.07 1.19
R3 24.65 93.26 59.45 53.08 85.36 1.88
Table 4: DenseNAS architectures comparison on ImageNet (Top-5 Accuracy)

We run experiments on two families of DenseNAS, models A, B, C, and Large which use MobileNetV2-based search space and R1, R2, R3 which use ResNet-based search space. It is clear that in their respective search spaces, an increase in parameters results in an increase in robustness. Robustness increases as we move from model R1 to R3 and from model A to Large. There is a minor drop in robustness when we move from model C to Large for FGSM and R-FGSM attacks, but considering that they have nearly the same number of parameters, this drop is insignificant and does not affect the global trend.

In conclusion, increasing the parameters of a NAS based architecture can improve its robustness albeit it comes with an increase in training and inference time.

5 Conclusion

We conclude from our studies that existing work on NAS and robustness are largely superfluous, i.e., NAS-based architectures are already robust on datasets considered in these papers, and only lack robustness on large datasets, which have not been attempted; and that the number of parameters co-relates to robustness within same family of architectures. Explicitly searching for robust architectures using NAS is an important problem which is not thoroughly studied at this time.