SmoothDARTS
Code for our ICML'2020 paper "Stabilizing Differentiable Architecture Search via Perturbationbased Regularization"
view repo
Differentiable architecture search (DARTS) is a prevailing NAS solution to identify architectures. Based on the continuous relaxation of the architecture space, DARTS learns a differentiable architecture weight and largely reduces the search cost. However, its stability and generalizability have been challenged for yielding deteriorating architectures as the search proceeds. We find that the precipitous validation loss landscape, which leads to a dramatic performance drop when distilling the final architecture, is an essential factor that causes instability. Based on this observation, we propose a perturbationbased regularization, named SmoothDARTS (SDARTS), to smooth the loss landscape and improve the generalizability of DARTS. In particular, our new formulations stabilize DARTS by either random smoothing or adversarial attack. The search trajectory on NASBench1Shot1 demonstrates the effectiveness of our approach and due to the improved stability, we achieve performance gain across various search spaces on 4 datasets. Furthermore, we mathematically show that SDARTS implicitly regularizes the Hessian norm of the validation loss, which accounts for a smoother loss landscape and improved performance. The code is available at https://github.com/xiangningchen/SmoothDARTS.
READ FULL TEXT VIEW PDFCode for our ICML'2020 paper "Stabilizing Differentiable Architecture Search via Perturbationbased Regularization"
Image classification via neural architecture search. We implemented SDARTS (smooth differentiable architecture search) with the learning by teaching framework.
Neural architecture search (NAS) has emerged as a rational next step to automate the trial and error paradigm of architecture design. It is straightforward to search by reinforcement learning
(Zoph and Le, 2017; Zoph et al., 2018; Zhong et al., 2018)(Stanley and Miikkulainen, 2002; Miikkulainen et al., 2019; Real et al., 2017; Liu et al., 2017) due to the discrete nature of the architecture space. However, these methods usually require massive computation resources. A variety of approaches are then proposed to reduce the search cost including oneshot architecture search (Pham et al., 2018; Bender et al., 2018; Brock et al., 2018), performance estimation
(Klein et al., 2017; Baker, 2018) and network morphisms (Elsken et al., 2019; Cai et al., 2018, 2018). For example, oneshot architecture search methods construct a supernetwork covering all candidate architectures, where subnetworks with shared components also share the corresponding weights. Then the supernetwork is trained only once, which is much more efficient. In particular, DARTS (Liu et al., 2019) builds a continuous mixture architecture and relaxes the categorical architecture search problem to learning a differentiable architecture weight .Despite being computationally efficient, the stability and generalizability of DARTS have been challenged recently. Many (Zela et al., 2020a; Yu et al., 2020) have observed that although the validation accuracy of the mixture architecture keeps growing, the performance of the derived architecture collapses when evaluation. Such instability makes DARTS converge to distorted architectures. For instance, Chu et al. (2019) and Liang et al. (2019) find that parameterfree operations such as skip connection dominate the generated architecture, and DARTS has a preference towards wide and shallow structures (Shu et al., 2020). To alleviate this issue, some (Zela et al., 2020a; Liang et al., 2019) propose to early stop the search process based on handcrafted criteria. However, the inherent instability starts from the very beginning and early stopping is a compromise without actually improving the search algorithm.
An important source of such instability is the final projection step to derive the actual discrete architecture from the continuous mixture architecture. There is often a huge performance drop in this projection step, so the validation accuracy of the mixture architecture, which is optimized by DARTS, may not be correlated with the final validation accuracy. As shown in Figure 1(a), DARTS often converges to a sharp region, so small perturbations will dramatically decrease the validation accuracy, let alone the projection step. Moreover, the sharp cone in the landscape illustrates that the network weight is almost only applicable to the current architecture weight . Bender et al. (2018) also discovers a similar phenomenon that the shared weight of the oneshot network is sensitive and only works for a few subnetworks. This empirically prevents DARTS from fully exploring the architecture space.
To address these problems, we propose two novel formulations. Intuitively, the optimization of is based on that performs well on nearby configurations rather than exactly the current one. This leads to smoother landscapes as shown in Figure 1(b, c). Our contributions are as follows:
We present SmoothDARTS (SDARTS) to overcome the instability and lack of generalizability of DARTS. Instead of assuming the shared weight as the minimizer with respect to the current architecture weight , we formulate as the minimizer of the Randomly Smoothed function, defined as the expected loss within the neighborhood of current . The resulting approach, called SDARTSRS, requires scarcely additional computational cost but is surprisingly effective. We also propose a stronger formulation that forces to minimize the worstcase loss around a neighborhood of , which can be solved by ADVersarial training. The resulting algorithm, called SDARTSADV, leads to even better stability and improved performance.
Mathematically, we show that the performance drop caused by discretization is highly related to the norm of Hessian regarding the architecture weight , which is also mentioned empirically in (Zela et al., 2020a). Furthermore, we show that both our regularization techniques are implicitly minimizing this term, which explains why our methods can significantly improve DARTS throughout various settings.
The proposed methods consistently improve DARTS and can match or improve over stateoftheart results on various search spaces of CIFAR10 and Penn Treebank. Besides, extensive experiments show that our methods outperform other regularization approaches on three datasets across four search spaces.
Similar to prior work (Zoph et al., 2018), DARTS only searches for the architecture of cells, which are stacked to compose the full network. Within a cell, there are nodes organized as a DAG (Figure 2), where every node is a latent representation and every edge is associated with a certain operation . It is inherently difficult to perform an efficient search since the choice of operation on every edge is discrete. As a solution, DARTS constructs a mixed operation on every edge:
where is the candidate operation corpus and denotes the corresponding architecture weight for operation on edge
. Therefore, the original categorical choice per edge is parameterized by a vector
with dimension . And the architecture search is relaxed to learning a continuous architecture weight . With such relaxation, DARTS formulates a bilevel optimization objective:(1) 
Then, and are updated via gradient descent alternately, where is approximated by the current or onestep forward . DARTS sets up a wave in the NAS scenario and many approaches are springing up to make further improvements (Xie et al., 2019; Dong and Yang, 2019; Cai et al., 2019; Yao et al., 2020; Xu et al., 2020).
After search, DARTS simply prunes out operations on every edge except the one with the largest when evaluation. Under such perturbation, its stability and generalizability have been widely challenged (Zela et al., 2020a; Liang et al., 2019; Chu et al., 2019). DARTS+ (Liang et al., 2019) proposes to early stop the search based on the number of skip connection. Zela et al. (2020a)
empirically points out that the dominate eigenvalue
of the Hessian matrix is highly correlated with the stability. They also present another early stopping criterion (DARTSES) to prevent from exploding. Besides, partial channel connection (Xu et al., 2020), ScheduledDropPath (Zoph et al., 2018) and L2 regularization on are also shown to improve the stability of DARTS.NASBench1Shot1 (Zela et al., 2020b) is a benchmark architecture dataset covering three search spaces on CIFAR10. It provides a mapping between the continuous space of differentiable NAS and discrete one in NASBench101 (Ying et al., 2019)  the first architecture dataset proposed to lower the entry barrier of NAS. By querying in NASBench1Shot1, researchers can obtain necessary quantities for a specific architecture (e.g. test accuracy) in milliseconds. Using this benchmark, we track the anytime test error of various NAS algorithms, which allows us to compare their stability.
In this paper, we claim that DARTS should be robust against the perturbation on the architecture weight
. Similarly, the topic of adversarial robustness aims to overcome the vulnerability of neural networks against contrived input perturbation
(Szegedy et al., 2014). Random smoothing (Lecuyer et al., 2019; Cohen et al., 2019) is a popular method to improve model robustness. Another effective approach is adversarial training (Goodfellow et al., 2015; Madry et al., 2018b), which intuitively optimizes the worstcase training loss. To the best of our knowledge, we are the first to apply this idea to stabilize the searching of NAS.During the DARTS search procedure, a continuous architecture weight is used, but it has to be projected to derive the discrete architecture eventually. There is often a huge performance drop in the projection stage, and thus a good mixture architecture does not imply a good final architecture. Therefore, although DARTS can consistently reduce the validation error of the mixture architecture, the validation error after projection is very unstable and could even blow up, as shown in Figure 3 and 4.
This phenomenon has been discussed in several recent papers (Zela et al., 2020a; Liang et al., 2019), and Zela et al. (2020a) empirically finds that the instability is related to the norm of Hessian . To verify this phenomenon, we plot the validation accuracy landscape of DARTS in Figure 1(a), which is extremely sharp – small perturbation on can hugely reduce the validation accuracy from over 90% to less than 10%. This also undermines DARTS’ ability to explore the architecture space: can only change slightly at each iteration because the current only works within a small local region.
To address this issue, intuitively we want to force to be more smooth with respect to the perturbation . This leads to the following two versions of SDARTS by redefining :
(2)  
where
represents the uniform distribution between
and . The main idea is that instead of using that only performs well on the current , we replace it by the defined in (2) that performs well within a neighborhood of . This forces our algorithms to focus on pairs with smooth loss landscapes. For SDARTSRS, we set as the minimizer of the expected loss under small random perturbation bounded by . This is based on the idea of random smoothing, which randomly averaging the neighborhood of a given function to obtain a smoother version (Cohen et al., 2019; Lecuyer et al., 2019). On the other hand, we set to minimize the worstcase training loss under small perturbation of for SDARTSADV. This is based on the idea of adversarial training, which is a widely used technique in adversarial defense (Madry et al., 2018a).The optimization algorithm for solving the proposed formulations is described in Algorithm 1. Similar to DARTS, our algorithm is based on alternating minimization between and . For SDARTSRS, is the minimizer of the expected loss altered by a randomly chosen , which can be optimized by SGD directly. We sample the following and add it to before running a single step of SGD on ^{1}^{1}1We use uniform random for simplicity, while in practice the approach works also with other random perturbations, such as Gaussian. :
(3) 
This approach is very simple (adding only one line of the code) and efficient (doesn’t introduce any overhead), and we find that it is quite effective to improve the stability. As shown in Figure 1(b), the sharp cone disappears and the landscape becomes much smoother, which maintains high validation accuracy under perturbation on .
For SDARTSADV, we consider the worstcase loss under certain perturbation level, which is a stronger requirement than the expected loss in SDARTSRS. The resulting landscape is even smoother as illustrated in Figure 1(c). In this case, updating needs to solve a minmax optimization problem beforehand. We employ the widely used multistep projected gradient descent (PGD) on the negative training loss to iteratively compute :
(4) 
where denotes the projection onto the chosen norm ball (e.g. clipping in the case of the norm) and denotes the learning rate.
In the next section, we will mathematically explain why SDARTSRS and SDARTSADV improve the stability and generalizability of DARTS.
It has been empirically pointed out in (Zela et al., 2020a) that the dominant eigenvalue of (spectral norm of Hessian) is highly correlated with the generalization quality of DARTS solutions. In standard DARTS training, the Hessian norm usually blows up, which leads to deteriorating (test) performance of the solutions. In Figure 5, we plot this Hessian norm during the training procedure and find that the proposed methods, including both SDARTSRS and SDARTSADV, consistently reduce the Hessian norms during the training procedure. In the following, we first explain why the spectral norm of Hessian is correlated with the solution quality, and then formally show that our algorithms can implicitly control the Hessian norm.
Assume is the optimal solution of (1) in the continuous space while is the discrete solution by projecting to the simplex. Based on Taylor expansion and assume due to optimality condition, we have
(5) 
where is the average Hessian. If we assume that Hessian is stable in a local region, then the quantity of can approximately bound the performance drop when projecting to with a fixed . After fine tuning, where is the optimal weight corresponding to is expected to be even smaller than , if the training and validation losses are highly correlated. Therefore, the performance of , which is the quantity we care, will also be bounded by . Note that the bound could be quite loose since it assumes the network weight remains unchanged when switching from to . A more precise bound can be computed by viewing as a function only paramterized by , and then calculate its derivative/Hessian.
With the observation that the solution quality of DARTS is related to , an immediate thought is to explicitly control this quantity during the optimization procedure. To implement this idea, we add an auxiliary term  the finite difference estimation of Hessian matrix
to the loss function when updating
. However, this requires much additional memory to build a computational graph of the gradient, and Figure 3 suggests that it takes some effect compared with DARTS but is worse than both SDARTSRS and SDARTSADV. One potential reason is the high dimensionality – there are too many directions of to choose from and we can only randomly sample a subset of them at each iteration.
In SDARTSRS, the objective function becomes
(6)  
(7)  
(8) 
where the second term in (7) is canceled out since and the offdiagonal elements of the third term becomes after taking the expectation on . The update of in SDARTSRS can thus implicitly controls the trace norm of . If the matrix is close to PSD, this is approximately regularizing the (positive) eigenvalues of . Therefore, we observe that SDARTSRS empirically reduces the Hessian norm through its training procedure.
SDARTSADV ensures that the validation loss is small under the worstcase perturbation of . If we assume the Hessian matrix is roughly constant within ball, then adversarial training implicitly minimizes
(9)  
(10) 
when the perturbation is in norm, the second term becomes the , and when the perturbation is in norm, the second term is bounded by . Thus SDARTSADV also approximately minimizes the norm of Hessian. In addition, notice that from (9) to (10) we assume the gradient is , which is the property holds only for . In the intermediate steps for a general , the stability under perturbation will not only be related to Hessian but also gradient, and in SDARTSADV we can still implicitly control the landscape to be smooth by minimizing the firstorder term in the Taylor expansion of (9).
In this section, we first track the anytime performance of our methods on NASBench1Shot1 in Section 5.1, which demonstrates their superior stability and generalizability. Then we perform experiments on the widely used CNN cell space on CIFAR10 (Section 5.2) and RNN cell space on PTB (Section 5.3). In Section 5.4, we present a detailed comparison between our methods with other popular regularization techniques. At last, we examine the generated architectures and illustrate that our methods mitigate DARTS’ bias for certain operations and connection patterns in Section 5.5.
NASBench1Shot1 consists of 3 search spaces based on CIFAR10, which contains 6,240, 29,160 and 363,648 architectures respectively. The macro architecture of models in all spaces is constructed by 3 stacked blocks, with a maxpooling operation in between as the DownSampler. Each block contains 3 stacked cells and the micro architecture of each cell is represented as a DAG. Besides the operation on every edge, the search algorithm also needs to determine the topology of edges connecting input, output nodes and the choice blocks. We refer to their paper (Zela et al., 2020b) for details about the search spaces.
We make a comparison between our methods and stateoftheart NAS algorithms on all 3 search spaces. Descriptions of the compared baselines can be found in Appendix 7.1
. We run every NAS algorithm for 100 epochs (twice of the default DARTS setting) to allow a thorough and comprehensive analysis on search stability and generalizability. Hyperparameter settings for 5 baselines are set as their default. For both SDARTSRS and SDARTSADV, the perturbation on
is performed after the softmax layer. We initialize the norm ball
as 0.03 and linearly increase it to 0.3 in all our experiments. The random perturbation in SDARTSRS is sampled uniformly between and . And we use the 7step PGD attack under norm ball to obtain the in SDARTSADV. Other settings are the same as DARTS.To search for 100 epochs on a single NVIDIA GTX 1080 Ti GPU, ENAS, DARTS, GDAS, NASP, PCDARTS requires 10.5h, 8h, 4.5h, 5h, and 6h respectively. Extra time of SDARTSRS is just for the random sample, so its search time is approximately the same as DARTS, which is 8h. SDARTSADV needs extra steps of forward and backward propagation to perform the adversarial attack, so it spends 16h. Notice that this can be largely reduced by setting the PGD attack step as 1 (FGSM (Goodfellow et al., 2015)), which only brings little performance decrease according to our experiments.
We plot the anytime test error averaged from 6 independent runs in Figure 4. Also, the trajectory (mean std) of the spectral norm of is shown in Figure 5. Noting that ENAS is not included in Figure 5 since it does not have the architecture weight . We provide our detailed analysis below.
DARTS generates architectures with deteriorating performance when the search epoch becomes large, which is in accordance with the observations in (Zela et al., 2020a; Liang et al., 2019). The singlepath modifications (GDAS, NASP) take effects to some extent, e.g. GDAS prevents to find worse architectures and remains stable. However, GDAS suffers premature convergence to suboptimal architectures, and NASP is effective for the first few search epochs before its performance starts to fluctuate like ENAS. A potential reason is that the architecture weight is clipped to the nearest boundary when it can not satisfy some range constraint. This makes NASP confused when choosing among operations if their corresponding weights are similar on certain edges. The partial channel connection introduced by PCDARTS makes it the best baseline on Space 1 and 3, but PCDARTS also suffers severely degenerate performance on Space 2.
SDARTSRS outperforms all 5 baselines on 3 search spaces. It better explores the architecture space and meanwhile overcomes the instability issue in DARTS. SDARTSADV achieves even better performance by forcing to minimize the worstcase loss around a neighborhood of . Its anytime test error continues to decrease when the search epoch is larger than 80, which does not occur for any other method.
As explained in Section 4, the spectral norm of Hessian has strong correlation with the stability and solution quality. Large leads to poor generalizability and stability. In agreement with the theoretical analysis that our methods keep minimizing (Section 4), both SDARTSRS and SDARTSADV anneal to a low level throughout the search procedure. In comparison, in all baselines continue to increase and they even enlarge beyond 10 times after 100 search epochs. Though GDAS has the lowest at the beginning, it suffers the largest growth rate. The partial channel connection in PCDARTS can not regularize the Hessian norm, it has a similar trajectory to DARTS and NASP, which supports their comparably unstable performance.
Architecture 






DenseNetBC (Huang et al., 2017)^{⋆}  3.46  25.6    manual  
NASNetA (Zoph et al., 2018)  2.65  3.3  2000  RL  
AmoebaNetA (Real et al., 2019)  3.2  3150  evolution  
AmoebaNetB (Real et al., 2019)  2.8  3150  evolution  
PNAS (Liu et al., 2018)^{⋆}  3.2  225  SMBO  
ENAS (Pham et al., 2018)  2.89  4.6  0.5  RL  
NAONet (Luo et al., 2018)  3.53  3.1  0.4  NAO  
DARTS (1st) (Liu et al., 2019)  3.3  0.4  gradient  
DARTS (2nd) (Liu et al., 2019)  3.3  1  gradient  
SNAS (moderate) (Xie et al., 2019)  2.8  1.5  gradient  
GDAS (Dong and Yang, 2019)  2.93  3.4  0.3  gradient  
BayesNAS (Zhou et al., 2019)  3.4  0.2  gradient  
ProxylessNAS (Cai et al., 2019)^{†}  2.08    4.0  gradient  
NASP (Yao et al., 2020)  3.3  0.1  gradient  
PCDARTS (Xu et al., 2020)  3.6  0.1  gradient  
RDARTS(L2) (Zela et al., 2020a)    1.6  gradient  
SDARTSRS  3.4  0.4^{‡}  gradient  
SDARTSADV  3.3  1.3^{‡}  gradient 
Obtained without cutout augmentation.
Obtained on a different space with PyramidNet (Han et al., 2017) as the backbone.
Recorded on a single GTX 1080Ti GPU.
Comparison with stateoftheart image classifiers on CIFAR10.
We employ SDARTSRS and SDARTSADV to search CNN cells on CIFAR10 following the search space (with 7 operations) in DARTS (Liu et al., 2019). The macro architecture is obtained by stacking convolution cells for 8 times, and every cell contains nodes (2 input nodes, 4 intermediate nodes, and 1 output nodes). Other detailed settings for searching and evaluation can be found in Appendix 7.2, which are the same as DARTS.
Table 1 summarizes the comparison of our methods with stateoftheart algorithms, and the searched normal cells are visualized in Figure 2
. We achieve performance gain compared with DARTS and most of its variants. Moreover, the variance of SDARTSRS is considerably better than baselines and SDARTSADV achieves even better stability. PCDARTS slightly outperforms our methods but has a higher variance. It warm starts
for the first 15 epochs, and the search epoch is comparably smaller, which may alleviate its instability issue discussed in Section 5.1. Nevertheless, when searching on various simplified search spaces across 3 datasets, our methods achieve superior stability and test accuracy compared with PCDARTS as indicated in Section 5.4.Besides searching for CNN cells, our methods are applicable to various scenarios such as identifying RNN cells. Following DARTS (Liu et al., 2019), the RNN search space based on PTB contains 5 candidate functions, i.e. tanh, relu, sigmoid, identity and zero. The macro architecture of the RNN network is comprised of only a single cell consisting of nodes. The first intermediate node is manually fixed and the rest nodes are determined by the search algorithm. When searching, we train the RNN network for 50 epochs with sequence length as 35. During evaluation, the final architecture is trained by an SGD optimizer, where the batch size is set as 64 and the learning rate is fixed as 20. These settings are the same as DARTS.
The results are shown in Table 2. SDARTSRS achieves a validation perplexity of 58.7 and a test perplexity of 56.4. Meanwhile, SDARTSADV achieves a validation perplexity of 58.3 and a test perplexity of 56.1. We outperform other NAS methods with similar model size, which demonstrates the effectiveness of our methods for the RNN space. LSTM + SE obtains better results than us, but it benefits from a handcrafted ensemble structure.
Architecture  Perplexity(%) 


valid  test  
LSTM + SE (Yang et al., 2018)^{⋆}  58.1  56.0  22  
NAS (Zoph and Le, 2017)    64.0  25  
ENAS (Pham et al., 2018)  60.8  58.6  24  
DARTS (1st) (Liu et al., 2019)  60.2  57.6  23  
DARTS (2nd) (Liu et al., 2019)^{†}  58.1  55.7  23  
GDAS (Dong and Yang, 2019)  59.8  57.5  23  
NASP (Yao et al., 2020)  59.9  57.3  23  
SDARTSRS  58.7  56.4  23  
SDARTSADV  58.3  56.1  23 
LSTM + SE represents LSTM with 15 softmax experts.
We achieve 58.5 for validation and 56.2 for test when training the architecture found by DARTS (2nd) ourselves.
Dataset  Space  DARTS  PCDARTS  DARTSES  RDARTS(DP)  RDARTS(L2)  SDARTSRS  SDARTSADV 

C10  S1  3.84  3.11  3.01  3.11  2.78  2.78  2.73 
S2  4.85  3.02  3.26  3.48  3.31  2.75  2.65  
S3  3.34  2.51  2.74  2.93  2.51  2.53  2.49  
S4  7.20  3.02  3.71  3.58  3.56  2.93  2.87  
C100  S1  29.46  18.87  28.37  25.93  24.25  17.02  16.88 
S2  26.05  18.23  23.25  22.30  22.44  17.56  17.24  
S3  28.90  18.05  23.73  22.36  23.99  17.73  17.12  
S4  22.85  17.16  21.26  22.18  21.94  17.17  15.46  
SVHN  S1  4.58  2.28  2.72  2.55  4.79  2.26  2.16 
S2  3.53  2.39  2.60  2.52  2.51  2.37  2.07  
S3  3.41  2.27  2.50  2.49  2.48  2.21  2.05  
S4  3.05  2.37  2.51  2.61  2.50  2.35  1.98 
Our methods can be viewed as a way to regularize DARTS (implicitly regularize the Hessian norm of validation loss). In this section, we compare SDARTSRS and SDARTSADV with other popular regularization techniques. The compared baselines are 1) partial channel connection (PCDARTS (Xu et al., 2020)); 2) ScheduledDropPath (Zoph et al., 2018) (RDARTS(DP)); 3) L2 regularization on (RDARTS(L2)); 3) early stopping (DARTSES (Zela et al., 2020a)). Descriptions of the compared regularization baselines are shown in Appendix 7.1.
We perform a thorough comparison on 4 simplified search spaces proposed in (Zela et al., 2020a) across 3 datasets (CIFAR10, CIFAR100, and SVHN). All search spaces utilize the same macro architecture as in Section 5.2, the difference is that they only contain a portion of candidate operations (details are shown in Appendix 7.3). Results in Table 3 are obtained by running every method 4 independent times and pick the final architecture based on the validation accuracy (retrain from scratch for a few epochs). Other settings are the same as Section 5.2.
The discovered cells are shown in Appendix (Figure 7, 8, 9 and 10). Our methods achieve substantial performance gains compared with baselines. SDARTSADV is the best method for all 12 benchmarks and SDARTSRS strikes the second place on 10 benchmarks. The cell discovered on S3 for CIFAR10 even achieves higher test accuracy than all the methods in Table 1 (except for ProxylessNAS that searches based on PyramidNet).
As pointed out in (Zela et al., 2020a; Liang et al., 2019; Shu et al., 2020), DARTS tends to fall into distorted architectures that converge faster, which is another manifestation of its instability. So here we examine the generated architectures and see whether our methods can overcome such bias.
Space  DARTS  PCDARTS  DARTSES  SDARTSRS  SDARTSADV 

S1  1.0  0.5  0.375  0.125  0.125 
S2  0.875  0.75  0.25  0.375  0.125 
S3  1.0  0.125  1.0  0.125  0.125 
S4  0.625  0.125  0.0  0.0  0.0 
Many (Zela et al., 2020a; Liang et al., 2019) have found out that parameterfree operations such as skip connection dominate the generated architecture. Though makes architectures converge faster, excessive parameterfree operations can largely reduce the model’s representation capability and bring out low test accuracy. As illustrated in Table 4, we also find similar phenomenon when searching by DARTS on 4 simplified search spaces in Section 5.4. The proportion of parameterfree operations even becomes 100% on S1 and S3, and DARTS can not distinguish the harmful noise operation on S4. PCDARTS achieves some improvements but is not enough since noise still appears. DARTSES reveals its effectiveness on S2 and S4 but fails on S3 since all operations found are skip connection. We do not show RDARTS(DP) and RDARTS(L2) here because their discovered cells are not released. In comparison, both SDARTSRS and SDARTSADV succeed in controlling the portion of parameterfree operations on all search spaces.
Shu et al. (2020) demonstrates, from both empirical and theoretical aspects, that DARTS tends to favor wide and shallow cells since they often have smoother loss landscape and faster convergence speed. However, these cells may not generalize better than their narrower and deeper variants (Shu et al., 2020). Follow their definitions (suppose every intermediate node has width , detailed definitions are shown in Appendix 7.4), the best cell generated by our methods on CNN standard space (Section 5.2) has width 3 and depth 4. In contrast, ENAS has width 5 and depth 2, DARTS has width 3.5 and depth 3, PCDARTS has width 4 and depth 2. Consequently, we succeed in mitigating the bias of connection pattern.
We introduce SmoothDARTS (SDARTS), a perturbationbased regularization to improve the stability and generalizability of differentiable architecture search. Specifically, the regularization is carried out with random smoothing or adversarial attack. SDARTS possesses a much smoother landscape and has the theoretical guarantee to regularize the Hessian norm of the validation loss. Extensive experiments illustrate the effectiveness of SDARTS and we outperform various regularization techniques.
Improved regularization of convolutional neural networks with cutout
. External Links: 1708.04552 Cited by: §7.2.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 1761–1770. Cited by: §2.1, Table 1, Table 2, 3rd item.Towards deep learning models resistant to adversarial attacks
. In ICLR, Cited by: §3.2.ENAS (Pham et al., 2018) first trains the shared parameter of a oneshot network. For the search phase, it samples subnetworks and use the validation error as the reward signal to update an RNN controller following REINFORCE (Williams, 1992) rule. Finally, they sample several architectures guided by the trained controller and derive the one with the highest validation accuracy.
DARTS (Liu et al., 2019) builds a mixture architecture similar to ENAS. The difference is that it relaxes the discrete architecture space to a continuous and differentiable representation by assigning a weight to every operation. The network weight and are then updated via gradient descent alternately based on the training set and the validation set respectively. For evaluation, DARTS prunes out all operations except the one with the largest on every edge, which leaves the final architecture.
NASP (Yao et al., 2020) is another modification of DARTS via the proximal algorithm. A discrete version of architecture weight is computed every search epoch by applying a proximal operation to the continuous . Then the gradient of is utilized to update its corresponding
after backpropagation.
PCDARTS (Xu et al., 2020) evaluates only a random proportion of the channels. This partial channel connection not only accelerates search but also serves as a regularization that controls the bias towards parameterfree operations, as explained by the author.
RDARTS(DP) (Zela et al., 2020a) runs DARTS with different intensity of ScheduledDropPath regularization (Zoph et al., 2018)
and picks the final architecture according to the performance on the validation set. In ScheduledDropPath, each path in the cell is dropped out with a probability that increases linearly over the training procedure.
RDARTS(L2) (Zela et al., 2020a) runs DARTS with different amounts of L2 regularization and selects the final architecture in the same way with RDARTS(DP). Specifically, the L2 regularization is applied on the inner loop (i.e. network weight ) of the bilevel optimization problem.
DARTSES (Zela et al., 2020a) early stops the search procedure of DARTS if the increase of (the dominate eigenvalue of Hessian ) exceeds a threshold. This prevents , which is highly correlated with the stability and generalizability of DARTS, from exploding.
For the search phase, we train the mixture architecture for 50 epochs, with the 50K CIFAR10 dataset be equally split into training and validation set. Following (Liu et al., 2019), the network weight is optimized on the training set by an SGD optimizer with momentum as 0.9 and weight decay as , where the learning rate is annealed from 0.025 to 1e3 following a cosine schedule. Meanwhile, we use an Adam optimizer with learning rate 3e4 and weight decay 1e3 to learn the architecture weight on the validation set. For the evaluation phase, the macro structure consists of 20 cells and the initial number of channels is set as 36. We train the final architecture by 600 epochs using the SGD optimizer with a learning rate cosine scheduled from 0.025 to 0, a momentum of 0.9 and a weight decay of 3e4. The drop probability of ScheduledDropPath increases linearly from 0 to 0.2, and the auxiliary tower (Zoph and Le, 2017) is employed with a weight of 0.4. We also utilize CutOut (DeVries and Taylor, 2017) as the data augmentation technique and report the result (mean std) of 4 independent runs with different random seeds.
The first space S1 contains 2 popular operators per edge as shown in Figure 6, S2 restricts the set of candidate operations on every edge as { separable convolution, skip connection}, the operation set in S3 is { separable convolution, skip connection, zero}, and S4 simplifies the set as { separable convolution, noise}.
Specifically, the depth of a cell is the number of connections on the longest path from input nodes to the output node. While the width of a cell is computed by adding the width of all intermediate nodes that are directly connected to the input nodes, where the width of a node is defined as the channel number for convolutions and the feature dimension for linear operations (In (Shu et al., 2020), they assume the width of every intermediate node is for simplicity). In particular, if an intermediate node is partially connected to input nodes (i.e. has connections to other intermediate nodes), its width is deducted by the percentage of intermediate nodes it is connected to when computing the cell width.