Deep learning has achieved great success in many areas, such as computer vision, nature language processing, speech, gaming and so on. The design of the neural network architecture is important for such success. However, such design relies heavily on the knowledge and experience of experts and even experienced experts cannot design the optimal architecture. Therefore, Neural Architecture Search (NAS), which aims to design the architecture of neural networks in an automated way, has attracted great attentions in recent years. NAS has demonstrated the capability to find neural network architectures with state-of-the-art performance in various tasks [emh19, lsy19, tcpvshl19, xzll19]
. Search strategies in NAS are based on several techniques, including reinforcement learning[pham2018efficient, zoph2016neural]liu2017hierarchical, real2019regularized], Bayesian optimization, and gradient descent [lsy19, xu2019pc, chen2019progressive]. As a representative of gradient-descent-based NAS methods, the Differentiable ARchiTecture Search (DARTS) method [lsy19] becomes popular because of its good performance and low search cost.
However, those NAS methods are typically only designed for optimizing the accuracy during the architecture searching process while neglecting other significant objectives, which results in very limited application scenarios. For example, a deep neural network with high computational and storage demands is difficult to deploy to embedded devices (mobile phone and IoT device), where the resource is limited. Besides, the robustness of deep neural networks is also important. It is well known that the trained neural networks are easily misled by adversarial examples [fgsm15, pgd17, fast_fgsm], which makes them risky to deploy in real-world applications. For example, a spammer can easily bypass the anti-spam email filter system by adding some special characters as perturbations, and a self-driving car cannot recognize the guideboard correctly after sticking some adversarial patches.
Therefore, multi-objective NAS has drawn great attention recently because we need to consider more than performance when NAS meets real-world applications [cztl20, emh19, tcpvshl19]. In [jin2019rc, emh19, cztl20], the model size and computational cost are considered to satisfy some resource constraint. Besides, some works [Dong20, RobNets] search for differentiable architectures that can defense adversarial attacks. However, to the best of our knowledge, there is no work to simultaneously optimize the three objectives, the performance, the robustness, and the resource constraint.
To fill this gap, this paper proposes an Effective, Efficient, and Robust Neural Architecture Search method (E2RNAS) to balance the trade-off among multiple objectives. Built on DARTS, the proposed E2RNAS method formulates the entire objective function as a bi-level multi-objective optimization problem where the upper-level problem is a multi-objective optimization problem, which can be viewed as an extension of the objective function proposed in DRATS. To the best of our knowledge, there is little work to solve such bi-level multi-objective optimization problem based on gradient descent techniques. To solve such problem, we propose an optimization algorithm by combining the multiple gradient descent algorithm (MGDA) [desideri12] and the bi-level optimization algorithm [colson2007overview].
Specifically, the contributions of this paper are three-fold.
We propose the E2RNAS method for searching effective, efficient and robust network architectures, leading to a practical DARTS-based framework for multi-objective NAS.
We formulate the objective function of the E2RNAS method as a novel bi-level multi-objective optimization problem and propose an efficient algorithm to solve it.
Experiments on benchmark datasets show that the proposed E2RNAS method can find adversarially robust architectures with optimized model size and comparable classification accuracy.
2 Related Works
2.1 Adversarial Attack and Defence
Deep neural networks are not robust while facing adversarial attacks [sze14]. Most adversarial attacks are white-box attacks that assume attack algorithms can access to all configurations of the trained neural network, including the architecture and model weights.
Fast Gradient Sign Method
propose Fast Gradient Sign Method (FGSM) for generating adversarial examples. It directly uses the sign of the gradient of the loss function with respect to weights as the direction of the adversarial perturbation as
where is the original input, is a small scalar to represent the strength of the perturbation, denotes parameters of the victim model, is the original ground-truth label for input , denotes the elementwise sign function, denotes the loss function used for training the victim model, and denotes its gradient with respect to .
Projected Gradient Descent (PGD)
Instead of generating one-step perturbations as in FGSM, Kurakin [pgd17] propose the PGD method by applying a small number of iterative steps. To ensure the perturbation is in -neighborhood of the original image, the PGD method clips the intermediate results after each iteration as
where is the perturbation generated in the -th steps, is the attack step size. means to elementwisely clip the input to lie within an interval , .
Adversarial training is an effective method for defending adversarial attacks [fgsm15, pgd17, pgd7, fast_fgsm]. Goodfellow [fgsm15] leverage the FGSM as a regularizer to train deep neural networks and make the model more resistant to adversarial examples. Wong [fast_fgsm] use FGSM adversarial training with random initialization for the perturbation. The proposed method can speed up the adversarial training process and it is as effective as the PGD-based adversarial training.
2.2 Multi-Objective Optimization
Multi-objective optimization aims to optimize more than one objective function simultaneously. Among different techniques to solve multi-objective problems, we are interested in gradient-based multi-objective optimization algorithms [desideri12, fliege2000steepest, schaffler2002stochastic], which leverage the Karush-Kuhn-Tucker (KKT) conditions [kuhn2014nonlinear] to find a common descent direction for all objectives.
In this paper, we utilize one such method, MGDA [desideri12]. With objective functions to be minimized, MGDA is an iterative method by first solving the following quadratic programming problem as
where denotes the
norm of a vector andcan be viewed as a weight for the th objective, and then minimizing with respect to . When convergent, the MGDA can find a Pareto-stationary solution.
2.3 Multi-Objective NAS
Most NAS methods focus on searching architectures with the best accuracy. However, in real-world applications, other factors, such as model size and robustness, must be considered. To take those factors into consideration, some works on multi-objective NAS have been proposed in recent years. LEMONADE [emh19] considers two objectives, including maximizing the validation accuracy and minimizing the number of parameters. It is based on the evolutionary algorithm and thus the search cost is quite high. MnasNet [tcpvshl19] uses a reinforcement learning approach to optimize both the accuracy and inference latency together when searching the architecture. Chen [cztl20] perform the neural architecture search based on the reinforcement learning to optimize three objectives, including the maximization of the validation accuracy, the minimization of the number of parameters, and the minimization of the number of FLOPs. FBNet [wu2019fbnet] also considers both the accuracy and model latency when searching the architecture via a gradient-based method to solve the corresponding multi-objective problem. Built on DARTS, RC-DARTS [jin2019rc] considers to search architectures with high accuracy while constraining model parameters of the searched architecture below a threshold. Therefore, the proposed objective function is formulated as a constrained optimization problem and a projected gradient descent method is proposed to solve it. Based on DARTS, GOLD-NAS [bi2020gold] considers three objectives: maximizing the validation accuracy, minimizing the number of parameters, and minimizing the number of FLOPs, and by enlarging the search space, it proposes a one-level optimization algorithm instead of the bi-level optimization.
3 The E2RNAS Method
In this section, we present the proposed E2RNAS method. We first give an overview of the DARTS method and then introduce how to achieve the robustness and formulate the objective to constrain the number of parameters in the search architecture. Finally we present the bi-level multi-objective problem of the proposed E2RNAS method as well as its optimization.
3.1 Preliminary: DARTS
DARTS [lsy19] aims to learn a Directed Acyclic Graph (DAG) called cell, which can be stacked to form a neural network architecture. Each cell consists of nodes
, each of which denotes a hidden representation.denotes a discrete operation space. The edge of the DAG represents an operation function (skip connection or pooling) from
with a probabilityto perform at the node . Therefore, we can formulate each edge as a weighted sum function to combine all the operations in as . An intermediate node is the sum of its predecessors, . The output of the cell, node , is the concatenation of all the output of nodes excluding the two input nodes and . Therefore, can parameterize the searched architecture, where denotes the set of all the edges from all the cells.
Let denote the training dataset and denote the corresponding set of labels. Similarly, the validation dataset and labels are denoted by and . We use to denote all the weights of the neural network and to denote the loss function. DARTS is to solve a bi-level optimization problem as
where and represent the training and validation losses, respectively. Here is called the upper-level problem and is called the lower-level problem.
When the search procedure finishes, the final architecture can be determined by the operation with the largest probability in each cell, .
3.2 Adversarial Training for Robustness
In E2RNAS, we expect the searched architecture to be robust, which means that for the trained model with the searched architecture, its performance is stable when adding some perturbation to the dataset. To improve the robustness of the searched architecture, we leverage the adversarial training method in [fast_fgsm] to train a robust model.
Following [fast_fgsm], for each sample and its corresponding label , we can generate a perturbation for using one single step as
where is the perturbation size,
is randomly initialized with an uniform distribution on the interval, and is the attack step size. Therefore, we generate the adversarial instance as . Obviously, the FGSM is a special case of this method when is initialized with zero and . This FGSM-based adversarial training method with random initialization for [fast_fgsm] can effectively defense the PGD adversarial attack [pgd17], while not adding much computational cost in the architecture search procedure.
We use these perturbed data to learn the network parameters so that the trained model can defense adversarial attacks. Therefore, we aim to minimize the training loss of the perturbed data as
Note that this adversarial training method trains the model only on adversarial examples, which is different from the FGSM-based adversarial training method [fgsm15] that uses them as a regularization term for training.
3.3 Objective Function of Resource Constraints
Architectures with a small number of parameters have more application scenarios even in resource-constrained mobile devices. Therefore, we regard resource constraints as one of the desired objectives.
By following DARTS [lsy19], we determine the operation of each cell in the final architecture as the one with the largest probabilities. So the number of parameters in an architecture can be computed as
where denotes the number of parameters corresponding to the operation .
Note that in Eq. (4) is a non-differentiable operation, making the computation of the gradient of with respect to infeasible. To make such operation differentiable, we use the softmax trick to approximate the operation and then formulate the approximation as
Furthermore, to prevent the model to search over-simplified architectures (the one containing too many parameter-free operations) that leads to unsatisfactory performance, we add a lower bound to the parameter size in Eq. (5), . Therefore, the objective function of the resource constraint can be formulated as
Different from RC-DARTS [jin2019rc] that directly adds the resource constraint into the original DARTS objective function (2) as a constraint and formulates the objective function as a constrained optimization problem, here we take it as an objective function.
3.4 Bi-level Multi-Objective Formulation
E2RNAS aims to search the architecture parameter to minimize the validation loss for the effectiveness and the number of parameters for the efficiency, while achieving the robustness via the adversarial training. Thus, we combine Eqs. (3) and (6) as well as the adversarial training to formulate the entire objective function as
Problem (7) is similar to the bi-level optimization problem (2) in the DARTS, where the lower-level problem () is similar, but there exists significant differences in that the upper-level problem () contains two objectives. So problem (7) is a bi-level multi-objective optimization problem which is a generalization of problem (2) in the DARTS. There are few works on bi-level multi-objective optimization [calvete2010linear, deb2009solving, zhang2012improved, ruuska2012constructing] and to the best of our knowledge, the proposed optimization algorithm as introduced in the following is the first gradient-based algorithm to solve general bi-level multi-objective optimization problems.
Problem (7) can be understood as a two-stage optimization. Firstly, when given an architecture parameter , we can learn a robust model with optimal model weights via the empirical risk minimization on adversarial examples. Secondly, given , the architecture parameter is updated on the validation dataset by making a trade-off between its performance and model size. Therefore, we can solve problem (7) in two stages, which are described as follows.
Given the architecture parameter , can be simply updated as
where denotes the index of the iteration and denotes the learning rate.
After obtaining , we can optimize the upper-level problem to update the architecture parameter . As the upper-level problem is a multi-objective optimization problem, we adopt the MGDA to solve it. In MGDA, we first need to solve problem (1), which requires the computation of the gradients of the two objectives with respect to . The gradient of with respect to is easy to compute, while the gradient of with respect to is a bit complicate as is also a function of and it is too expensive to obtain . Therefore, we use a second-order approximation as
Obviously when , becomes an approximation of and Eq. (9) degenerates to the first-order approximation, which can speed up the gradient computation and reduce the memory cost but lead to worse performance [lsy19]. So we use the second-order approximation in Eq. (9). Then due to the two objectives in the upper-level problem of problem (7), we can simplify problem (1) as a one-dimensional quadratic function of as
where and denote the gradients of two objectives, respectively. Here can be viewed the weight for the first objective and is for the second objective. It is easy to show that problem (10) has an analytical solution as
After that, we can update by minimizing as
where denotes the learning rate for .
Comparison between E2RNAS and DARTS
Though the proposed E2RNAS method is based on the DARTS, there are two key differences between them, which are shown in Figure 1. Firstly, E2RNAS adopts the adversarial training to improve the robustness of the corresponding neural network. Secondly, E2RNAS evaluates model with two objectives: minimizing the validation loss for the effectiveness and the number of parameters for the efficiency. Therefore, E2RNAS can search an effective, efficient, and robust architecture. The whole algorithm is summarized in Algorithm 1.
In this section, we empirically evaluate the proposed E2RNAS method on three image datasets, including CIFAR-10[krizhevsky2009learning], CIFAR-100 [krizhevsky2009learning]
, and SVHN[netzer2011reading]. Details about these datasets are presented in the Appendix.
4.1 Implementation Details
The search space adopts the same setting as DARTS [lsy19]. There are two types of cells, the reduction cell and the normal cell. The reduction cell is located at the and of the total depth of the network and other cells belong to the normal cell. For both reduction and normal cells, there are nodes in each cell, including four intermediate nodes, two input nodes, and one output node. In both normal and reduction cell, the set of operations contains eight operations, including separable convolutions, separable convolutions, dilated separable convolutions, dilated separable convolutions, max pooling,
average pooling, identity, zero. For the convolution operator, the ReLU-Conv-BN order is used.
By following DARTS [lsy19], a half of the standard training set is used for training a model and the other half for validation. A small network of 8 cells is trained via the FGSM-based adversarial training method [fast_fgsm] in Eq. (3
) with the batch size as 64 and initial channels as 16 for 50 epochs. Following the setting of[fast_fgsm], the perturbation of the FGSM adversary is randomly initialized from the uniform distribution in , where . The attack step size is set to . The SGD optimizer with the momentum and the weight decay
is used. The proposed method is implemented in PyTorch 0.3.1 and all the experiments are conducted in Tesla V100S GPUs with 32G CUDA memory.
A large network of 20 cells is trained on the full training set for 600 epochs, with the batch size as 96, the initial number of channels 36, a cutout of length 16, the dropout probability 0.2, and auxiliary towers of weight 0.4. To make the model size comparable, we adjust the initial channels of each cell for both DARTS and the proposed E2RNAS method, which is denoted by “”. The accuracy is tested on the full testing set. Adversarial examples are generated using the PGD attack [pgd17] with the perturbation size on the testing set. The PGD attack takes 10 iterative steps with the step size of as suggested in [pgd7].
|Architecture||Test Err.||Params||PGD Acc.||Search Cost||Search Method|
|Hireachical Evolution [liu2017hierarchical]||3.750.12||15.7||-||300||evolution|
4.2 Analysis on Experimental Results
Search Architecture on CIFAR-10
The normal and reduction cells searched by the E2RNAS method on the CIFAR-10 dataset are presented in Figures 2 and 3, respectively. Different from DARTS [lsy19], the reduction cell in E2RNAS contains many convolution operations and the normal cell only includes one operation with parameters (the separable convolution). Thus, the parameter size of the architecture searched by E2RNAS is lower than that of DARTS because E2RNAS searched an architecture with fewer reduction cells.
Architecture Evaluation on CIFAR-10
The comparison of the proposed E2RNAS method with state-of-the-art NAS methods on the CIFAR-10 dataset is shown in Table 1. Notably, E2RNAS outperforms these NAS methods in [zoph2016neural, real2019regularized, liu2017hierarchical, liu2018progressive] by searching for a more lightweight architecture with lower search costs of three to four orders of magnitude and a slightly higher test error rate. Moreover, although ENAS [pham2018efficient] slightly outperforms E2RNAS in the test accuracy and search time, it finds a deeper architecture with about doubled model size (4.6MB for “ENAS” 2.102MB for “E2RNAS-C36”).
|Dataset||Architecture||Test Err.||Params||PGD Acc.|
|Method||adv||nop||MGDA||L||Test Err.||Params||PGD Acc.|
|w/o adv (C27)||2.84||2.148||8.91|
|w/o adv ()||7.95||1.370||4.00|
Compared to the original DARTS in [lsy19], “E2RNAS-C36” significantly improves the robustness with lower model size and comparable search cost while the classification error increases slightly. Some studies [raghunathan2019adversarial, yang2020closer] show that the increased robustness is usually accompanied by decreased test accuracy. Therefore, the increased test error of E2RNAS is because of the improved robustness and the decreased parameter size, which indicates E2RNAS can make a better trade-off among these three goals than DARTS.
Besides, both P-DARTS [chen2019progressive] and PC-DARTS [xu2019pc] search for a deeper architecture with less search cost than “E2RNAS-C36”, so they slightly outperform in the test error rate with competitive PGD accuracy. We can apply the E2RNAS method to P-DARTS and PC-DARTS to make a trade-off among multiple objectives (the accuracy, the robustness and the number of parameters) in future work.
To further compare the performance of E2RNAS and DARTS, we change the initial number of channels in the architecture evaluation for both methods to keep a roughly similar model size. According to the results shown in Table 1, we can see that E2RNAS remarkably improves the robustness with comparable classification accuracy. For example, compared “E2RNAS-C46” with “DARTS”, the PGD accuracy increases about 1.6 times, while the test error increases by only around 0.9%.
In summary, experiment results in Table 1 show tat E2RNAS can search significantly robust architectures with a lower model size and comparable classification accuracy, compared with state-of-the-art NAS methods.
Architecture Evaluation on CIFAR-100 and SVHN
The comparison of E2RNAS with DARTS on the CIFAR-100 and SVHN datasets is presented in Table 2. The performance of E2RNAS on the CIFAR-100 dataset is similar to that on the CIFAR-10 dataset in that E2RNAS can search a robust architecture with a lower model size and a slightly decreased test accuracy. For example, compared to DARTS, “E2RNAS-C36” reduces the number of parameters by 0.3MB and improves the PGD accuracy by nearly twice times, though the test error is slightly increased (about 2%). In addition, E2RNAS shows excellent results on the SVHN dataset. It not only significantly improves the robustness but also achieves competitive test accuracy with a lower parameter size. For instance, compared to DARTS, “E2RNAS-C36” reduces the model size by about 15% and increases the PGD accuracy, while keeping competitive performance. Therefore, those quantitative experiments indicate that E2RNAS can search robust architectures with a lower model size and comparable performance.
4.3 Ablation Study
In this section, we study how each design in E2RNAS influences its performance on different objectives. The corresponding results are presented in Table 3. The adversarial training (abbreviated as adv) in the lower-level problem of problem (7) transforms training data to adversarial examples and hopes to learn a robust model when given an architecture. The resource constraint (abbreviated as nop) in the upper-level problem of problem (7) expects to constrain the parameter size of the searched architecture. The multiple-gradient descent algorithm (abbreviated as MGDA) is applied to solve the upper-level problem of problem (7), which is a multi-objective problem to minimize both the validation accuracy and the model size. If without MGDA, it means that we solve the upper-level problem by minimizing an equally weighted sum of two objectives ( in Eq. (12)). The lower bound (abbreviated as L) of the number of parameters expects to prevent the model to search over-simplified architectures.
Impact of Adversarial Training
The adversarial training, which trains a neural network on adversarial examples, is an effective method for improving the robustness of a neural network. Thus, we apply it in the lower-level problem of problem (7) and hope the searched architecture can defense adversarial attacks. Here we discuss two impacts of the adversarial training in details.
Firstly, using the adversarial training tends to reduce the number of parameters, which may leads to worse accuracy. We notice that the parameter size of the architecture searched by DARTS with adversarial training (“E2RNAS w/o nop” in Table 3) is only 1.37MB, which means that the searched architecture contains many parameter-free operations. Therefore, it has a larger test error because of its simplistic architecture, although its PGD accuracy is larger than DARTS with a comparable model size (“DARTS-C20” in Table 1). Besides, compared with E2RNAS, the model size of “E2RNAS w/o adv” increases by 1.631MB, which indicates that the adversarial training significantly decreases the number of parameters. However, it can be alleviated by constraining the parameter size with a lower bound .
Secondly, using the adversarial training can help E2RNAS make a trade-off between the robustness and accuracy. We notice adversarial training can significantly influences the model size. Therefore, to make a fair comparison, we set the number of initial channels of “E2RNAS w/o adv” (“E2RNAS w/o adv (C27)”) in the architecture evaluation to keep its model size roughly similar to E2RNAS. The result in Table 3 shows “E2RNAS” has a better robustness but lower accuracy than “E2RNAS w/o adv (C27)”.
Therefore, using adversarial training can help E2RNAS to make a trade-off among multiple objectives and search a robust architecture with a lower model size.
Effectiveness of MGDA
MGDA is used to solve the upper-level problem of problem (7). We quantitatively compare the performance of E2RNAS with and without MGDA (“E2RNAS” “E2RNAS w/o MGDA” in Table 3) and find that solving with MGDA achieves much better results on the test accuracy, parameter size, and PGD accuracy. So instead of using equal weights, using MGDA can find a good solution of weights and make a trade-off among multiple objectives.
We find that training E2RNAS without the minimum constraint (“E2RNAS w/o L” in Table 3) searches an architecture with many parameter-free operations (its parameter size is only 1.370MB). There are three reasons for this phenomenon. Firstly, the instability of DARTS sometimes makes it converge to extreme architectures (full of skip-connects) [zela2019understanding, chen2019progressive]. Secondly, as discussed above, using the adversarial training in the lower-level problem tends to reduce the number of parameters. Finally, only optimizing the number of parameters in the upper-level problem (“E2RNAS w/o adv ()” in Table 3) also results in searched architectures with many parameter-free operations. Therefore, it is necessary to constrain the number of parameters with a lower bound to prevent E2RNAS to search over-simplified architectures.
). Hence, we set this hyperparameterto 1 in our work because E2RNAS achieves the best performance (lowest test error rate in Figure 4(a), acceptable model size in Figure 4(b), highest PGD accuracy in Figure 4(c)) when .
In this paper, we propose the E2RNAS method that optimizes multiple objectives simultaneously to search an effective, efficient and robust architecture. The proposed objective function is formulated as a bi-level multi-objective problem and we design an algorithm to integrate the MGDA with the bi-level optimization. Experiments demonstrate that E2RNAS can find adversarial robust architecture with optimized model size and comparable classification accuracy on various datasets. In our future study, we are interested in extending the proposed E2RNAS method to search for multiple Pareto-optimal architectures at one time.