Effective, Efficient and Robust Neural Architecture Search

11/19/2020
by   Zhixiong Yue, et al.
0

Recent advances in adversarial attacks show the vulnerability of deep neural networks searched by Neural Architecture Search (NAS). Although NAS methods can find network architectures with the state-of-the-art performance, the adversarial robustness and resource constraint are often ignored in NAS. To solve this problem, we propose an Effective, Efficient, and Robust Neural Architecture Search (E2RNAS) method to search a neural network architecture by taking the performance, robustness, and resource constraint into consideration. The objective function of the proposed E2RNAS method is formulated as a bi-level multi-objective optimization problem with the upper-level problem as a multi-objective optimization problem, which is different from existing NAS methods. To solve the proposed objective function, we integrate the multiple-gradient descent algorithm, a widely studied gradient-based multi-objective optimization algorithm, with the bi-level optimization. Experiments on benchmark datasets show that the proposed E2RNAS method can find adversarially robust architectures with optimized model size and comparable classification accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/03/2021

Bag of Baselines for Multi-objective Joint Neural Architecture Search and Hyperparameter Optimization

Neural architecture search (NAS) and hyperparameter optimization (HPO) m...
07/16/2020

An Empirical Study on the Robustness of NAS based Architectures

Most existing methods for Neural Architecture Search (NAS) focus on achi...
02/14/2021

Multi-Objective Meta Learning

Meta learning with multiple objectives can be formulated as a Multi-Obje...
12/16/2021

Learning Interpretable Models Through Multi-Objective Neural Architecture Search

Monumental advances in deep learning have led to unprecedented achieveme...
03/23/2021

Enhanced Gradient for Differentiable Architecture Search

In recent years, neural architecture search (NAS) methods have been prop...
12/30/2019

RC-DARTS: Resource Constrained Differentiable Architecture Search

Recent advances show that Neural Architectural Search (NAS) method is ab...
02/28/2021

Tiny Adversarial Mulit-Objective Oneshot Neural Architecture Search

Due to limited computational cost and energy consumption, most neural ne...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Comparison of the architecture searching procedure between DARTS (top) and the proposed E2RNAS (bottom). We formulate E2RNAS as a bi-level multi-objective optimization problem. There are two key differences between E2RNAS and DARTS. Firstly, an adversarial training method is adopted to improve the robustness in the proposed E2RNAS model. Secondly, we evaluate the E2RNAS model with two objectives, including the validation loss and the number of parameters , for both the effectiveness and efficiency which is to learn a compact architecture with a controllable number of parameters. Therefore, E2RNAS can search an effective, efficient, and robust architecture.

Deep learning has achieved great success in many areas, such as computer vision, nature language processing, speech, gaming and so on. The design of the neural network architecture is important for such success. However, such design relies heavily on the knowledge and experience of experts and even experienced experts cannot design the optimal architecture. Therefore, Neural Architecture Search (NAS), which aims to design the architecture of neural networks in an automated way, has attracted great attentions in recent years. NAS has demonstrated the capability to find neural network architectures with state-of-the-art performance in various tasks [emh19, lsy19, tcpvshl19, xzll19]

. Search strategies in NAS are based on several techniques, including reinforcement learning

[pham2018efficient, zoph2016neural]

, evolutionary algorithms

[liu2017hierarchical, real2019regularized], Bayesian optimization, and gradient descent [lsy19, xu2019pc, chen2019progressive]. As a representative of gradient-descent-based NAS methods, the Differentiable ARchiTecture Search (DARTS) method [lsy19] becomes popular because of its good performance and low search cost.

However, those NAS methods are typically only designed for optimizing the accuracy during the architecture searching process while neglecting other significant objectives, which results in very limited application scenarios. For example, a deep neural network with high computational and storage demands is difficult to deploy to embedded devices (mobile phone and IoT device), where the resource is limited. Besides, the robustness of deep neural networks is also important. It is well known that the trained neural networks are easily misled by adversarial examples [fgsm15, pgd17, fast_fgsm], which makes them risky to deploy in real-world applications. For example, a spammer can easily bypass the anti-spam email filter system by adding some special characters as perturbations, and a self-driving car cannot recognize the guideboard correctly after sticking some adversarial patches.

Therefore, multi-objective NAS has drawn great attention recently because we need to consider more than performance when NAS meets real-world applications [cztl20, emh19, tcpvshl19]. In [jin2019rc, emh19, cztl20], the model size and computational cost are considered to satisfy some resource constraint. Besides, some works [Dong20, RobNets] search for differentiable architectures that can defense adversarial attacks. However, to the best of our knowledge, there is no work to simultaneously optimize the three objectives, the performance, the robustness, and the resource constraint.

To fill this gap, this paper proposes an Effective, Efficient, and Robust Neural Architecture Search method (E2RNAS) to balance the trade-off among multiple objectives. Built on DARTS, the proposed E2RNAS method formulates the entire objective function as a bi-level multi-objective optimization problem where the upper-level problem is a multi-objective optimization problem, which can be viewed as an extension of the objective function proposed in DRATS. To the best of our knowledge, there is little work to solve such bi-level multi-objective optimization problem based on gradient descent techniques. To solve such problem, we propose an optimization algorithm by combining the multiple gradient descent algorithm (MGDA) [desideri12] and the bi-level optimization algorithm [colson2007overview].

Specifically, the contributions of this paper are three-fold.

  • We propose the E2RNAS method for searching effective, efficient and robust network architectures, leading to a practical DARTS-based framework for multi-objective NAS.

  • We formulate the objective function of the E2RNAS method as a novel bi-level multi-objective optimization problem and propose an efficient algorithm to solve it.

  • Experiments on benchmark datasets show that the proposed E2RNAS method can find adversarially robust architectures with optimized model size and comparable classification accuracy.

2 Related Works

2.1 Adversarial Attack and Defence

Deep neural networks are not robust while facing adversarial attacks [sze14]. Most adversarial attacks are white-box attacks that assume attack algorithms can access to all configurations of the trained neural network, including the architecture and model weights.

Fast Gradient Sign Method

Goodfellow  [fgsm15]

propose Fast Gradient Sign Method (FGSM) for generating adversarial examples. It directly uses the sign of the gradient of the loss function with respect to weights as the direction of the adversarial perturbation as

where is the original input, is a small scalar to represent the strength of the perturbation, denotes parameters of the victim model, is the original ground-truth label for input , denotes the elementwise sign function, denotes the loss function used for training the victim model, and denotes its gradient with respect to .

Projected Gradient Descent (PGD)

Instead of generating one-step perturbations as in FGSM, Kurakin  [pgd17] propose the PGD method by applying a small number of iterative steps. To ensure the perturbation is in -neighborhood of the original image, the PGD method clips the intermediate results after each iteration as

where is the perturbation generated in the -th steps, is the attack step size. means to elementwisely clip the input to lie within an interval , .

Adversarial Training

Adversarial training is an effective method for defending adversarial attacks [fgsm15, pgd17, pgd7, fast_fgsm]. Goodfellow  [fgsm15] leverage the FGSM as a regularizer to train deep neural networks and make the model more resistant to adversarial examples. Wong  [fast_fgsm] use FGSM adversarial training with random initialization for the perturbation. The proposed method can speed up the adversarial training process and it is as effective as the PGD-based adversarial training.

2.2 Multi-Objective Optimization

Multi-objective optimization aims to optimize more than one objective function simultaneously. Among different techniques to solve multi-objective problems, we are interested in gradient-based multi-objective optimization algorithms [desideri12, fliege2000steepest, schaffler2002stochastic], which leverage the Karush-Kuhn-Tucker (KKT) conditions [kuhn2014nonlinear] to find a common descent direction for all objectives.

In this paper, we utilize one such method, MGDA [desideri12]. With objective functions to be minimized, MGDA is an iterative method by first solving the following quadratic programming problem as

(1)

where denotes the

norm of a vector and

can be viewed as a weight for the th objective, and then minimizing with respect to . When convergent, the MGDA can find a Pareto-stationary solution.

2.3 Multi-Objective NAS

Most NAS methods focus on searching architectures with the best accuracy. However, in real-world applications, other factors, such as model size and robustness, must be considered. To take those factors into consideration, some works on multi-objective NAS have been proposed in recent years. LEMONADE [emh19] considers two objectives, including maximizing the validation accuracy and minimizing the number of parameters. It is based on the evolutionary algorithm and thus the search cost is quite high. MnasNet [tcpvshl19] uses a reinforcement learning approach to optimize both the accuracy and inference latency together when searching the architecture. Chen [cztl20] perform the neural architecture search based on the reinforcement learning to optimize three objectives, including the maximization of the validation accuracy, the minimization of the number of parameters, and the minimization of the number of FLOPs. FBNet [wu2019fbnet] also considers both the accuracy and model latency when searching the architecture via a gradient-based method to solve the corresponding multi-objective problem. Built on DARTS, RC-DARTS [jin2019rc] considers to search architectures with high accuracy while constraining model parameters of the searched architecture below a threshold. Therefore, the proposed objective function is formulated as a constrained optimization problem and a projected gradient descent method is proposed to solve it. Based on DARTS, GOLD-NAS [bi2020gold] considers three objectives: maximizing the validation accuracy, minimizing the number of parameters, and minimizing the number of FLOPs, and by enlarging the search space, it proposes a one-level optimization algorithm instead of the bi-level optimization.

3 The E2RNAS Method

In this section, we present the proposed E2RNAS method. We first give an overview of the DARTS method and then introduce how to achieve the robustness and formulate the objective to constrain the number of parameters in the search architecture. Finally we present the bi-level multi-objective problem of the proposed E2RNAS method as well as its optimization.

3.1 Preliminary: DARTS

DARTS [lsy19] aims to learn a Directed Acyclic Graph (DAG) called cell, which can be stacked to form a neural network architecture. Each cell consists of nodes

, each of which denotes a hidden representation.

denotes a discrete operation space. The edge of the DAG represents an operation function (skip connection or pooling) from

with a probability

to perform at the node . Therefore, we can formulate each edge as a weighted sum function to combine all the operations in as . An intermediate node is the sum of its predecessors, . The output of the cell, node , is the concatenation of all the output of nodes excluding the two input nodes and . Therefore, can parameterize the searched architecture, where denotes the set of all the edges from all the cells.

Let denote the training dataset and denote the corresponding set of labels. Similarly, the validation dataset and labels are denoted by and . We use to denote all the weights of the neural network and to denote the loss function. DARTS is to solve a bi-level optimization problem as

(2)

where and represent the training and validation losses, respectively. Here is called the upper-level problem and is called the lower-level problem.

When the search procedure finishes, the final architecture can be determined by the operation with the largest probability in each cell, .

3.2 Adversarial Training for Robustness

In E2RNAS, we expect the searched architecture to be robust, which means that for the trained model with the searched architecture, its performance is stable when adding some perturbation to the dataset. To improve the robustness of the searched architecture, we leverage the adversarial training method in [fast_fgsm] to train a robust model.

Following [fast_fgsm], for each sample and its corresponding label , we can generate a perturbation for using one single step as

where is the perturbation size,

is randomly initialized with an uniform distribution on the interval

, and is the attack step size. Therefore, we generate the adversarial instance as . Obviously, the FGSM is a special case of this method when is initialized with zero and . This FGSM-based adversarial training method with random initialization for [fast_fgsm] can effectively defense the PGD adversarial attack [pgd17], while not adding much computational cost in the architecture search procedure.

We use these perturbed data to learn the network parameters so that the trained model can defense adversarial attacks. Therefore, we aim to minimize the training loss of the perturbed data as

(3)

Note that this adversarial training method trains the model only on adversarial examples, which is different from the FGSM-based adversarial training method [fgsm15] that uses them as a regularization term for training.

3.3 Objective Function of Resource Constraints

Architectures with a small number of parameters have more application scenarios even in resource-constrained mobile devices. Therefore, we regard resource constraints as one of the desired objectives.

By following DARTS [lsy19], we determine the operation of each cell in the final architecture as the one with the largest probabilities. So the number of parameters in an architecture can be computed as

(4)

where denotes the number of parameters corresponding to the operation .

Note that in Eq. (4) is a non-differentiable operation, making the computation of the gradient of with respect to infeasible. To make such operation differentiable, we use the softmax trick to approximate the operation and then formulate the approximation as

(5)

Furthermore, to prevent the model to search over-simplified architectures (the one containing too many parameter-free operations) that leads to unsatisfactory performance, we add a lower bound to the parameter size in Eq. (5), . Therefore, the objective function of the resource constraint can be formulated as

(6)

Different from RC-DARTS [jin2019rc] that directly adds the resource constraint into the original DARTS objective function (2) as a constraint and formulates the objective function as a constrained optimization problem, here we take it as an objective function.

3.4 Bi-level Multi-Objective Formulation

E2RNAS aims to search the architecture parameter to minimize the validation loss for the effectiveness and the number of parameters for the efficiency, while achieving the robustness via the adversarial training. Thus, we combine Eqs. (3) and (6) as well as the adversarial training to formulate the entire objective function as

(7)

Problem (7) is similar to the bi-level optimization problem (2) in the DARTS, where the lower-level problem () is similar, but there exists significant differences in that the upper-level problem () contains two objectives. So problem (7) is a bi-level multi-objective optimization problem which is a generalization of problem (2) in the DARTS. There are few works on bi-level multi-objective optimization [calvete2010linear, deb2009solving, zhang2012improved, ruuska2012constructing] and to the best of our knowledge, the proposed optimization algorithm as introduced in the following is the first gradient-based algorithm to solve general bi-level multi-objective optimization problems.

Problem (7) can be understood as a two-stage optimization. Firstly, when given an architecture parameter , we can learn a robust model with optimal model weights via the empirical risk minimization on adversarial examples. Secondly, given , the architecture parameter is updated on the validation dataset by making a trade-off between its performance and model size. Therefore, we can solve problem (7) in two stages, which are described as follows.

Updating

Given the architecture parameter , can be simply updated as

(8)

where denotes the index of the iteration and denotes the learning rate.

Updating

After obtaining , we can optimize the upper-level problem to update the architecture parameter . As the upper-level problem is a multi-objective optimization problem, we adopt the MGDA to solve it. In MGDA, we first need to solve problem (1), which requires the computation of the gradients of the two objectives with respect to . The gradient of with respect to is easy to compute, while the gradient of with respect to is a bit complicate as is also a function of and it is too expensive to obtain . Therefore, we use a second-order approximation as

(9)

Obviously when , becomes an approximation of and Eq. (9) degenerates to the first-order approximation, which can speed up the gradient computation and reduce the memory cost but lead to worse performance [lsy19]. So we use the second-order approximation in Eq. (9). Then due to the two objectives in the upper-level problem of problem (7), we can simplify problem (1) as a one-dimensional quadratic function of as

(10)

where and denote the gradients of two objectives, respectively. Here can be viewed the weight for the first objective and is for the second objective. It is easy to show that problem (10) has an analytical solution as

(11)

After that, we can update by minimizing as

(12)

where denotes the learning rate for .

0:  Dataset , batch size , perturbation size , minimum constraint , learning rates and
0:  Learned architecture parameter
1:  Randomly initialized and ;
2:  ;
3:  while not converged do
4:     Sample a mini-batch of size ;
5:     Compute according to Eq. (3);
6:     Update according to Eq. (8);
7:     Compute two objective functions , and the corresponding gradients;
8:     Compute according to Eq. (11);
9:     Update according to Eq. (12);
10:     ;
11:  end while
Algorithm 1 E2RNAS

Comparison between E2RNAS and DARTS

Though the proposed E2RNAS method is based on the DARTS, there are two key differences between them, which are shown in Figure 1. Firstly, E2RNAS adopts the adversarial training to improve the robustness of the corresponding neural network. Secondly, E2RNAS evaluates model with two objectives: minimizing the validation loss for the effectiveness and the number of parameters for the efficiency. Therefore, E2RNAS can search an effective, efficient, and robust architecture. The whole algorithm is summarized in Algorithm 1.

4 Experiments

In this section, we empirically evaluate the proposed E2RNAS method on three image datasets, including CIFAR-10

[krizhevsky2009learning], CIFAR-100 [krizhevsky2009learning]

, and SVHN

[netzer2011reading]. Details about these datasets are presented in the Appendix.

4.1 Implementation Details

Search Space

The search space adopts the same setting as DARTS [lsy19]. There are two types of cells, the reduction cell and the normal cell. The reduction cell is located at the and of the total depth of the network and other cells belong to the normal cell. For both reduction and normal cells, there are nodes in each cell, including four intermediate nodes, two input nodes, and one output node. In both normal and reduction cell, the set of operations contains eight operations, including separable convolutions, separable convolutions, dilated separable convolutions, dilated separable convolutions, max pooling,

average pooling, identity, zero. For the convolution operator, the ReLU-Conv-BN order is used.

Training Settings

By following DARTS [lsy19], a half of the standard training set is used for training a model and the other half for validation. A small network of 8 cells is trained via the FGSM-based adversarial training method [fast_fgsm] in Eq. (3

) with the batch size as 64 and initial channels as 16 for 50 epochs. Following the setting of

[fast_fgsm], the perturbation of the FGSM adversary is randomly initialized from the uniform distribution in , where . The attack step size is set to . The SGD optimizer with the momentum and the weight decay

is used. The proposed method is implemented in PyTorch 0.3.1 and all the experiments are conducted in Tesla V100S GPUs with 32G CUDA memory.

Evaluation Settings

A large network of 20 cells is trained on the full training set for 600 epochs, with the batch size as 96, the initial number of channels 36, a cutout of length 16, the dropout probability 0.2, and auxiliary towers of weight 0.4. To make the model size comparable, we adjust the initial channels of each cell for both DARTS and the proposed E2RNAS method, which is denoted by “”. The accuracy is tested on the full testing set. Adversarial examples are generated using the PGD attack [pgd17] with the perturbation size on the testing set. The PGD attack takes 10 iterative steps with the step size of as suggested in [pgd7].

Architecture Test Err. Params PGD Acc. Search Cost Search Method
(%)  (MB)  (%)  (GPU days) 
DenseNet-BC [huang2017densely] 3.46 25.6 - - manual
NASNet-A [zoph2016neural] 2.65 3.3 - 2000 RL
AmoebaNet-B [real2019regularized] 2.550.05 2.8 - 3150 evolution
Hireachical Evolution [liu2017hierarchical] 3.750.12 15.7 - 300 evolution
PNAS [liu2018progressive] 3.410.09 3.2 - 225 SMBO
ENAS [pham2018efficient] 2.89 4.6 - 0.5 RL
DARTS [lsy19] 2.59 3.349 6.57 0.595 gradient-based
DARTS-C28 2.68 2.061 5.42 0.595 gradient-based
DARTS-C20 3.15 1.083 3.90 0.595 gradient-based
DARTS-C12 3.09 0.416 3.08 0.595 gradient-based
P-DARTS [chen2019progressive] 2.59 3.434 8.35 0.247 gradient-based
PC-DARTS [xu2019pc] 2.65 3.635 9.53 0.426 gradient-based
E2RNAS-C46 3.64 3.383 10.21 0.836 gradient-based
E2RNAS-C36 4.19 2.102 9.61 0.836 gradient-based
E2RNAS-C25 4.86 1.042 7.76 0.836 gradient-based
E2RNAS-C16 6.03 0.449 6.76 0.836 gradient-based
Table 1: Comparison with state-of-the-art NAS methods on the CIFAR-10 dataset. represents training without the cutout augmentation. indicates the use of the code released by original authors. indicates a larger value is better, while indicates a lower value is better. “” means the architecture searched by “model” is evaluated with the initial number of channels as “channels”. means the search cost is recorded on a single Tesla V100S GPU.

4.2 Analysis on Experimental Results

Search Architecture on CIFAR-10

The normal and reduction cells searched by the E2RNAS method on the CIFAR-10 dataset are presented in Figures 2 and 3, respectively. Different from DARTS [lsy19], the reduction cell in E2RNAS contains many convolution operations and the normal cell only includes one operation with parameters (the separable convolution). Thus, the parameter size of the architecture searched by E2RNAS is lower than that of DARTS because E2RNAS searched an architecture with fewer reduction cells.

Figure 2: The normal cell in E2RNAS learned on CIFAR-10.
Figure 3: The reduction cell in E2RNAS learned on CIFAR-10.

Architecture Evaluation on CIFAR-10

The comparison of the proposed E2RNAS method with state-of-the-art NAS methods on the CIFAR-10 dataset is shown in Table 1. Notably, E2RNAS outperforms these NAS methods in [zoph2016neural, real2019regularized, liu2017hierarchical, liu2018progressive] by searching for a more lightweight architecture with lower search costs of three to four orders of magnitude and a slightly higher test error rate. Moreover, although ENAS [pham2018efficient] slightly outperforms E2RNAS in the test accuracy and search time, it finds a deeper architecture with about doubled model size (4.6MB for “ENAS” 2.102MB for “E2RNAS-C36”).

Dataset Architecture Test Err. Params PGD Acc.
(%)  (MB)  (%) 

CIFAR-100

DARTS [lsy19] 17.17 3.401 2.06
DARTS-C34 17.70 3.047 1.67
DARTS-C27 17.78 1.960 1.70
DARTS-C19 19.15 1.010 1.34
P-DARTS [chen2019progressive] 15.67 3.485 4.58
PC-DARTS [xu2019pc] 16.66 3.687 4.29
E2RNAS-C38 19.30 3.459 4.90
E2RNAS-C36 19.19 3.120 4.00
E2RNAS-C29 19.80 2.075 3.78
E2RNAS-C20 22.97 1.041 3.44

SVHN

DARTS [lsy19] 2.16 3.449 46.78
DARTS-C34 2.18 2.998 41.32
DARTS-C28 2.13 2.061 35.35
DARTS-C20 2.16 1.083 40.38
P-DARTS [chen2019progressive] 2.12 3.433 49.11
PC-DARTS [xu2019pc] 2.20 3.635 54.81
E2RNAS-C39 2.21 3.421 44.15
E2RNAS-C36 2.14 2.935 53.82
E2RNAS-C30 2.13 2.075 52.38
E2RNAS-C21 2.21 1.062 54.96
Table 2: Comparison with state-of-the-art NAS methods on the CIFAR-100 and SVHN datasets. indicates the use of the code released by original authors. indicates larger value is better, while indicates lower value is better. “” means the architecture searched by “model” is evaluated with the initial number of channels as “channels”.
Method adv nop MGDA L Test Err. Params PGD Acc.
(MB) (%)  (MB)  (%) 
E2RNAS 4.19 2.102 9.61
 w/o  adv 2.75 3.733 10.35
 w/o  adv (C27) 2.84 2.148 8.91
 w/o  adv () 7.95 1.370 4.00
 w/o  nop 8.29 1.370 5.21
 w/o  MGDA 5.48 2.105 8.11
 w/o  L 8.30 1.370 4.39
Table 3: Ablation study on the CIFAR-10 dataset. adv means using adversarial training in the low-level problem of problem (7); nop indicates adding the resource constraint into the upper-level problem of problem (7); MGDA denotes using MGDA to make a trade-off between the accuracy and model size and if without MGDA, it means equal weights of the two objectives in problem (7) are used ( in Eq. (12)). L is the lower bound of the number of parameters. indicates larger value is better, while indicates lower value is better. “C27” means the initial number of channels in the architecture evaluation is changed to 27, instead of the default number of 36.

Compared to the original DARTS in [lsy19], “E2RNAS-C36” significantly improves the robustness with lower model size and comparable search cost while the classification error increases slightly. Some studies [raghunathan2019adversarial, yang2020closer] show that the increased robustness is usually accompanied by decreased test accuracy. Therefore, the increased test error of E2RNAS is because of the improved robustness and the decreased parameter size, which indicates E2RNAS can make a better trade-off among these three goals than DARTS.

Besides, both P-DARTS [chen2019progressive] and PC-DARTS [xu2019pc] search for a deeper architecture with less search cost than “E2RNAS-C36”, so they slightly outperform in the test error rate with competitive PGD accuracy. We can apply the E2RNAS method to P-DARTS and PC-DARTS to make a trade-off among multiple objectives (the accuracy, the robustness and the number of parameters) in future work.

To further compare the performance of E2RNAS and DARTS, we change the initial number of channels in the architecture evaluation for both methods to keep a roughly similar model size. According to the results shown in Table 1, we can see that E2RNAS remarkably improves the robustness with comparable classification accuracy. For example, compared “E2RNAS-C46” with “DARTS”, the PGD accuracy increases about 1.6 times, while the test error increases by only around 0.9%.

In summary, experiment results in Table 1 show tat E2RNAS can search significantly robust architectures with a lower model size and comparable classification accuracy, compared with state-of-the-art NAS methods.

Architecture Evaluation on CIFAR-100 and SVHN

The comparison of E2RNAS with DARTS on the CIFAR-100 and SVHN datasets is presented in Table 2. The performance of E2RNAS on the CIFAR-100 dataset is similar to that on the CIFAR-10 dataset in that E2RNAS can search a robust architecture with a lower model size and a slightly decreased test accuracy. For example, compared to DARTS, “E2RNAS-C36” reduces the number of parameters by 0.3MB and improves the PGD accuracy by nearly twice times, though the test error is slightly increased (about 2%). In addition, E2RNAS shows excellent results on the SVHN dataset. It not only significantly improves the robustness but also achieves competitive test accuracy with a lower parameter size. For instance, compared to DARTS, “E2RNAS-C36” reduces the model size by about 15% and increases the PGD accuracy, while keeping competitive performance. Therefore, those quantitative experiments indicate that E2RNAS can search robust architectures with a lower model size and comparable performance.

(a) Test Error
(b) The Number of Parameters
(c) PGD Accuracy
Figure 4: Architecture evaluation of E2RNAS on the CIFAR-10 dataset using different minimum constraint in Eq. (6).

4.3 Ablation Study

In this section, we study how each design in E2RNAS influences its performance on different objectives. The corresponding results are presented in Table 3. The adversarial training (abbreviated as adv) in the lower-level problem of problem (7) transforms training data to adversarial examples and hopes to learn a robust model when given an architecture. The resource constraint (abbreviated as nop) in the upper-level problem of problem (7) expects to constrain the parameter size of the searched architecture. The multiple-gradient descent algorithm (abbreviated as MGDA) is applied to solve the upper-level problem of problem (7), which is a multi-objective problem to minimize both the validation accuracy and the model size. If without MGDA, it means that we solve the upper-level problem by minimizing an equally weighted sum of two objectives ( in Eq. (12)). The lower bound (abbreviated as L) of the number of parameters expects to prevent the model to search over-simplified architectures.

Impact of Adversarial Training

The adversarial training, which trains a neural network on adversarial examples, is an effective method for improving the robustness of a neural network. Thus, we apply it in the lower-level problem of problem (7) and hope the searched architecture can defense adversarial attacks. Here we discuss two impacts of the adversarial training in details.

Firstly, using the adversarial training tends to reduce the number of parameters, which may leads to worse accuracy. We notice that the parameter size of the architecture searched by DARTS with adversarial training (“E2RNAS w/o nop” in Table 3) is only 1.37MB, which means that the searched architecture contains many parameter-free operations. Therefore, it has a larger test error because of its simplistic architecture, although its PGD accuracy is larger than DARTS with a comparable model size (“DARTS-C20” in Table 1). Besides, compared with E2RNAS, the model size of “E2RNAS w/o adv” increases by 1.631MB, which indicates that the adversarial training significantly decreases the number of parameters. However, it can be alleviated by constraining the parameter size with a lower bound .

Secondly, using the adversarial training can help E2RNAS make a trade-off between the robustness and accuracy. We notice adversarial training can significantly influences the model size. Therefore, to make a fair comparison, we set the number of initial channels of “E2RNAS w/o adv” (“E2RNAS w/o adv (C27)”) in the architecture evaluation to keep its model size roughly similar to E2RNAS. The result in Table 3 shows “E2RNAS” has a better robustness but lower accuracy than “E2RNAS w/o adv (C27)”.

Therefore, using adversarial training can help E2RNAS to make a trade-off among multiple objectives and search a robust architecture with a lower model size.

Effectiveness of MGDA

MGDA is used to solve the upper-level problem of problem (7). We quantitatively compare the performance of E2RNAS with and without MGDA (“E2RNAS” “E2RNAS w/o MGDA” in Table 3) and find that solving with MGDA achieves much better results on the test accuracy, parameter size, and PGD accuracy. So instead of using equal weights, using MGDA can find a good solution of weights and make a trade-off among multiple objectives.

Necessity of

We find that training E2RNAS without the minimum constraint (“E2RNAS w/o L” in Table 3) searches an architecture with many parameter-free operations (its parameter size is only 1.370MB). There are three reasons for this phenomenon. Firstly, the instability of DARTS sometimes makes it converge to extreme architectures (full of skip-connects) [zela2019understanding, chen2019progressive]. Secondly, as discussed above, using the adversarial training in the lower-level problem tends to reduce the number of parameters. Finally, only optimizing the number of parameters in the upper-level problem (“E2RNAS w/o adv ()” in Table 3) also results in searched architectures with many parameter-free operations. Therefore, it is necessary to constrain the number of parameters with a lower bound to prevent E2RNAS to search over-simplified architectures.

Figure 4 shows the architecture evaluation results of E2RNAS on the CIFAR-10 dataset using different in Eq. (6

). Hence, we set this hyperparameter

to 1 in our work because E2RNAS achieves the best performance (lowest test error rate in Figure 4(a), acceptable model size in Figure 4(b), highest PGD accuracy in Figure 4(c)) when .

5 Conclusions

In this paper, we propose the E2RNAS method that optimizes multiple objectives simultaneously to search an effective, efficient and robust architecture. The proposed objective function is formulated as a bi-level multi-objective problem and we design an algorithm to integrate the MGDA with the bi-level optimization. Experiments demonstrate that E2RNAS can find adversarial robust architecture with optimized model size and comparable classification accuracy on various datasets. In our future study, we are interested in extending the proposed E2RNAS method to search for multiple Pareto-optimal architectures at one time.

References