1 Introduction
Recently, Neural Architecture Search (NAS) has attracted lots of attentions for its potential to democratize deep learning. For a practical endtoend deep learning platform, NAS plays a crucial role in discovering taskspecific architecture depending on users’ configurations (e.g. dataset, evaluation metric, etc.). Pioneers in this field develop prototypes based on reinforcement learning
nasamoebanet and Bayesian optimization pnas. These works usually incur large computation overheads, which make them impractical to use. More recent algorithms significantly reduce the search cost including oneshot methods enas; oneshot, a continuous relaxation of the space darts and network morphisms morphisms. In particular, darts proposes a differentiable NAS framework  DARTS, converting the categorical operation selection problem into learning a continuous architecture mixing weight. They formulate a bilevel optimization objective allowing the architecture search to be efficiently performed by a gradientbased optimizer.While current differentiable NAS methods achieve encouraging results, they still have shortcomings that hinder their realworld applications. Firstly, although they view the architecture mixing weight as learnable parameters that can be directly optimized when searching, the derived continuous architecture has no guarantee to perform well when it is stiffly discretized during evaluation. Several works have cast doubts on the stability and generalization of these differentiable NAS methods smoothdarts; understanding. They discover that directly optimizing the architecture mixing weight is prone to overfit the validation set and often leads to distorted structures (e.g. the searched architecture is dominated by parameterfree operations). Secondly, there exists a gap between the search and evaluation phases. Due to the large memory consumption of differentiable NAS, proxy tasks are usually employed during search such as using a smaller dataset, or searching with a shallower and narrower network.
In this paper, we propose an effective approach that addresses the aforementioned shortcomings named Dirichlet Neural Architecture Search (DrNAS). Inspired by the fact that directly optimizing the architecture mixing weight is equivalent to performing point estimation (MLE/MAP) from a probabilistic perspective, we formulate the differentiable NAS as a distribution learning problem instead, which naturally induces stochasticity and encourages exploration. Making use of the probability simplex property of the Dirichlet samples, DrNAS models the architecture mixing weight as random variables sampled from a parameterized Dirichlet distribution. Optimizing the Dirichlet objective can thus be done efficiently in an endtoend fashion, by employing the pathwise derivative estimators to compute the gradient of the distribution
pathwise. A straightforward optimization, however, turns out to be problematic due to the uncontrolled variance of the Dirichlet, i.e. too much variance leads to training instability and too little variance suffers from insufficient exploration. In light of that, we apply an additional distance constraint directly on the Dirichlet concentration parameter to strike a balance between the exploration and the exploitation. We further derive a theoretical bound showing that the constrained distributional objective promotes stability and generalization of architecture search by implicitly controlling the Hessian of the validation error.
Furthermore, to enable a direct search on largescale tasks, we propose a progressive architecture learning scheme, eliminating the gap between the search and evaluation phases. Based on partial channel connection pcdarts, we maintain a taskspecific supernetwork of the same depth and number of channels as the evaluation phase throughout searching. To prevent loss of information and instability induced by partial connection, we divide the search phase into multiple stages and progressively increase the channel fraction via network transformation net2net. Meanwhile, we prune the operation space according to the learnt distribution to maintain the memory efficiency.
We conduct extensive experiments on different datasets and search spaces to demonstrate DrNAS’s effectiveness. Based on the DARTS search space darts, we achieve an average error rate of 2.46% on CIFAR10, which ranks top amongst NAS methods. Furthermore, DrNAS achieves superior performance on largescale tasks such as ImageNet. It obtains a top1/5 error of 23.7%/7.1%, surpassing the previous stateoftheart (24.0%/7.3%) under the mobile setting. On NASBench201 nasbench201, we also set new stateoftheart results on all three datasets with low variance.
2 The Proposed Approach
In this section, we first briefly review differentiable NAS setups and generalize the formulation to motivate distribution learning. We then layout our proposed DrNAS and describe its optimization in section 2.2. In section 2.3, we provide a generalization result by showing that our method implicitly regularizes the Hessian norm over the architecture parameter. The progressive architecture learning method that enables direct search is then described in section 2.4.
2.1 Preliminaries: Differentiable Architecture Search
CellBased Search Space
The cellbased search space is constructed by replications of normal and reduction cells nasnet; darts. A normal cell keeps the spatial resolution while a reduction cell halves it but doubles the number of channels. Every cell is represented by a DAG with nodes and edges, where every node represents a latent representation and every edge is associated with an operations (e.g. max pooling or convolution) selected from a predefined candidate space . The output of a node is a summation of all input flows, i.e. , and a concatenation of intermediate node outputs, i.e. , composes the cell output, where the first two input nodes and are fixed to be the outputs of previous two cells.
GradientBased Search via Continuous Relaxation
To enable gradientbased optimization, darts apply a continuous relaxation to the discrete space. Concretely, the information passed from node to node is computed by a weighted sum of all operations alone the edge, forming a mixedoperation . The operation mixing weight is defined over the probability simplex and its magnitude represents the strength of each operation. Therefore, the architecture search can be cast as selecting the operation associated with the highest mixing weight for each edge. To prevent abuse of terminology, we refer to as the architecture/operation mixing weight, and concentration parameter in DrNAS as the architecture parameter throughout the paper.
BilevelOptimization with Simplex Constraints
With continuous relaxation, the network weight and operation mixing weight can be jointly optimized by solving a constraint bilevel optimization problem:
(1) 
where the simplex constraint can be either solved explicitly via Lagrangian function gega, or eliminated by substitution method (e.g. ) darts. In the next section we describe how this generalized formulation motivates our method.
2.2 Differentiable Architecture Search as Distribution Learning
Learning a Distribution over Operation Mixing Weight
Previous differentiable architecture search methods view the operation mixing weight as a learnable parameter that can be directly optimized darts; pcdarts; gega. This has been shown to cause to overfit the validation set and thus induce large generalization error understanding; nasbench1shot1; smoothdarts. We recognize that this treatment is equivalent to performing point estimation (e.g. MLE/MAP) of in probabilistic view, which is inherently prone to cause overfitting prml; bayesian. Furthermore, directly optimizing incurs no exploration in the search space, and thus cause the search algorithm to commit to suboptimal paths in the DAG that converges faster at the beginning but plateaued quickly WideShallow.
Based on these insights, we formulate differentiable architecture search as a distribution learning problem. The operation mixing weight is treated as random variables sampled from learnable distribution. Formally, let denotes the distribution of parameterized by . The bilevel objective is then given by:
(2) 
Since lies on the probability simplex, we select Dirichlet distribution to model its behavior, i.e. , where represents the Dirichlet concentration parameter. Dirichlet distribution is a widely used distribution over the probability simplex drvae; lda, and it enjoys several nice properties that enables gradientbased training pathwise.
The concentration parameter controls the sampling behavior of Dirichlet distribution and is crucial in balancing the exploration and exploitation during the search phase. When for most , Dirichlet tends to produce sparse samples with high variance, reducing the training stability; when for most , the samples will be dense with low variance, leading to insufficient exploration. Therefore, we add a constraint to the objective (2) to restrict the distance between and anchor . The constraint objective can be written as:
(3) 
which can be solved using penalty method ppo:
(4) 
In section 2.3, we also derive a theoretical bound showing that the constrained Dirichlet NAS formulation (3) additionally promotes stability and generalization of the architecture search by implicitly regularizing the Hessian of validation loss w.r.t. architecture parameters.
Learning Dirichlet Parameters via Pathwise Derivative Estimator
Optimizing objective (4) with gradientbased methods requires backpropagation through stochastic nodes of Dirichlet samples. The commonly used reparameterization trick does not apply to Dirichlet distribution, therefore we approximate the gradient of Dirichlet samples via pathwise derivative estimators pathwise
(5) 
where and
denote the CDF and PDF of beta distribution respectively,
is the indicator function, and is the sum of concentrations. is the iregularised incomplete beta function, for which its gradient can be computed by simple numerical approximation. We refer to pathwise for the complete derivations.Joint Optimization of Model Weight and Architecture Parameter
With pathwise derivative estimator, the model weight and concentration can be jointly optimized with gradient descent. Concretely, we draw a sample
for every forward pass, and the gradients can be obtained easily through backpropagation. Following DARTS
darts, we approximate in the lower level objective of (4) with one step of gradient descent, and run alternative updates between and .Selecting the Best Architecture
At the end of the search phase, a learnt distribution of operation mixing weight is obtained. We then select the best operation for each edge by the most likely operation in expectation:
(6) 
In the Dirichlet case, the expectation term is simply the Dirichlet mean . Note that under the distribution learning framework, we are able to sample a wide range of architectures from the learnt distribution. This property alone has many potentials. For example, in practical settings where both accuracy and latency are concerned, the learnt distribution can be used to find architectures under resource restrictions in a post search phase. We leave these extensions to future work.
2.3 Implicit Regularization on Hessian
It has been observed that the generalization error of differentiable NAS is highly related to the dominant eigenvalue of the Hessian of validation loss w.r.t. architecture parameter. Several recent works report that the large dominant eigenvalue of
in DARTS results in poor generalization performance understanding; smoothdarts. In the following proposition we derive an approximated lower bound of our objective (3), which demonstrates that our method implicitly controls this Hessian matrix.Proposition 1
Let and in the bilevel formulation (3). If is Positive Semidefinite, the upperlevel objective can be approximated bounded by:
(7) 
with:
This proposition is driven by the Laplacian approximation to the Dirichlet distribution laplace; topic. The lower bound (7) indicates that minimizing the expected validation loss controls the trace norm of the Hessian matrix. Empirically, we observe that DrNAS always maintains the dominant eigenvalue of Hessian at a low level (Appendix 6.2). The detailed proof can be found in Appendix 6.1.
2.4 Progressive Architecture Learning
The GPU memory consumption of differentiable NAS methods grows linearly with the size of operation candidate space. Therefore, they usually use a easier proxy task such as training with a smaller dataset, or searching with fewer layers and number of channels proxylessnas. For instance, the architecture search is performed on 8 cells and 16 initial channels in DARTS darts. But during evaluation, the network has 20 cells and 36 initial channels. Such gap makes it hard to derive an optimal architecture for the target task proxylessnas.
PCDARTS pcdarts proposes a partial channel connection to reduce the memory overheads of differentiable NAS, where they only send a random subset of channels to the mixedoperation while directly bypassing the rest channels in a shortcut. However, their method causes loss of information and makes the selection of operation unstable since the sampled subsets may vary widely across iterations. This drawback is amplified when combining with the proposed method since we learn the architecture distribution from Dirichlet samples, which already injects certain stochasticity. As shown in Table 1, when directly applying partial channel connection with distribution learning, the test accuracy of the searched architecture decreases over 3% and 18% on CIFAR10 and CIFAR100 respectively if we send only 1/8 channels to the mixedoperation.
To alleviate such information loss and instability problem while being memoryefficient, we propose a progressive learning scheme which gradually increases the fraction of channels that are forwarded to the mixedoperation and meanwhile prunes the operation space based on the learnt distribution. We split the search process into consecutive stages and construct a taskspecific supernetwork with the same depth and number of channels as the evaluation phase at the initial stage. Then after each stage, we increase the partial channel fraction, which means that the supernetwork in the next stage will be wider, i.e. have more convolution channels, and in turn preserve more information. This is achieved by enlarging every convolution weight with a random mapping function similar to Net2Net net2net. The mapping function with is defined as
(8) 
To widen layer , we replace its convolution weight with a new weight .
(9) 
where denote the number of output and input channels, filter height and width respectively. Intuitively, we copy directly into and fulfill the rest part by choosing randomly as defined in . Unlike Net2Net, we do not divide by a replication factor here because the information flow on each edge has the same scale no matter the partial fraction is. After widening the supernetwork, we reduce the operation space by pruning out less important operations according to the Dirichlet concentration parameter learnt from the previous stage, maintaining a consistent memory consumption. As illustrated in Table 1, the proposed progressive architecture learning scheme effectively discovers high accuracy architectures and retains a low GPU memory overhead.
CIFAR10  




1  2437  
2  1583  
4  1159  
8  949  
Ours  949  
CIFAR100  



1  2439  
2  1583  
4  1161  
8  949  
Ours  949 
3 Discussions and Relationship to Prior Work
Early methods in NAS usually include a full training and evaluation procedure every iteration as the inner loop to guide the consecutive search nas; nasnet; amoebanet. Consequently, their computational overheads are beyond acceptance for practical usage, especially on largescale tasks.
Differentiable NAS
Recently, many works are proposed to improve the efficiency of NAS enas; morphisms; darts; oneshot. Amongst them, DARTS darts
proposes a differentiable NAS framework, which introduces a continuous architecture parameter that relaxes the discrete search space. Despite being efficient, DARTS only optimizes a single point on the simplex every search epoch, which has no guarantee to generalize well after the discretization during evaluation. So its stability and generalization have been widely challenged
randomnas; understanding; smoothdarts. Following DARTS, SNAS snas and GDAS gdas leverage the gumbelsoftmax trick to learn the exact architecture parameter. However, their reparameterization is motivated from reinforcement learning perspective, which is an approximation with softmax rather than an architecture distribution. Besides, their methods require tuning of temperature schedule hman; mann. GDAS linearly decreases the temperature from 10 to 1 while SNAS anneals it from 1 to 0.03. In comparison, the proposed method can automatically learn the architecture distribution without the requirement of handcrafted scheduling. BayesNAS BayesNASapplies Bayesian Learning in NAS. Specifically, they cast NAS as model compression problem and use Bayes Neural Network as the supernetwork, which is difficult to optimize and requires oversimplified approximation. While our method considers the stochasticity in architecture mixing weight, as it is directly related to the generalization of differentiable NAS algorithms
understanding; smoothdarts.Memory overhead
When dealing with the large memory consumption of differentiable NAS, previous works mainly sample few paths during search. For instance, ProxylessNAS proxylessnas employs binary gates and samples two paths every search epoch. Similarly, GDAS gdas and DSNAS dsnas both enforce a discrete constraint after the gumbelsoftmax reparametrization, i.e. only one path is activated. However, such discretization manifests premature convergence and may harm the search stability nasbench1shot1. Our experiments in section 4.3 also empirically demonstrate this phenomenon. As an alternative, PCDARTS pcdarts proposes a partial channel connection, where only a portion of channels is sent to the mixedoperation. However, partial connection can cause loss of information as shown in section 2.4 and PCDARTS searches on a shallower network with less channels, suffering the search and evaluation gap. Our solution, by progressively pruning the operation space and meanwhile widening the network, searches in a taskspecific manner and achieves superior accuracy on challenging datasets like ImageNet (+2.8% over BayesNAS, +2.3% over GDAS, +2.0% over DSNAS, +1.2% over ProxylessNAS, and +0.5% over PCDARTS).
4 Experimental Results
In this section, we evaluate our proposed method DrNAS on two search spaces: the CNN search space in DARTS darts and NASBench201 nasbench201. For DARTS space, we conduct experiments on both CIFAR10 and ImageNet in section 4.1 and 4.2 respectively. For NASBench201, we test all 3 supported datasets (CIFAR10, CIFAR100, ImageNet16120 imagenet16) in section 4.3.
4.1 Results on CIFAR10
Architecture Space
For both search and evaluation phases, we stack 20 cells to compose the network and set the initial channel number as 36. We place the reduction cells at the 1/3 and 2/3 of the network and each cell consists of nodes. Following previous works darts, the operation space contains 8 choices, including and separable convolution, and dilated separable convolution, max pooling, average pooling, skip connection, and none (zero).
Search Settings
We equally divide the 50K training images into two parts, one is used for optimizing the network weights by momentum SGD and the other for learning the Dirichlet architecture distribution by an Adam optimizer. Since Dirichlet concentration must be positive, we apply the shifted exponential linear mapping and optimize over instead. We use norm to constrain the distance between and the anchor . The is initialized by standard Gaussian with scale 0.001, and in (4) is set to 0.001. These settings are consistent for all experiments. For progressive architecture learning, the whole search process consists of 2 stages, each with 25 iterations. In the first stage, we set the partial channel parameter as 6 to fit the supernetwork into a single GTX 1080Ti GPU with 11GB memory, i.e. only 1/6 features are sampled on each edge. For the second stage, we prune half candidates and meanwhile widen the network twice, i.e. the operation space size reduces from 8 to 4 and becomes 3.
Retrain Settings
The evaluation phase uses the entire 50K training set to train the network from scratch for 600 epochs. The network weight is optimized by an SGD optimizer with a cosine annealing learning rate initialized as 0.025, a momentum of 0.9, and a weight decay of . To allow a fair comparison with previous work, we also employ cutout regularization with length 16, droppath nasnet with probability 0.3 and an auxiliary tower of weight 0.4.
Results
Table 2 summarizes the performance of DrNAS compared with other popular NAS methods, and we also visualize the searched cells in appendix 6.2. DrNAS achieves a test error of 2.46%, ranking among the top of recent NAS results. ProxylessNAS is the only method that achieves lower test error than us, but it searches on a different space with a much longer search time and has larger model size. We also perform experiments to assign proper credit to the two parts of our proposed algorithm, i.e. Dirichlet architecture distribution and progressive learning scheme. When searching on a proxy task with 8 stacked cells and 16 initial channels as the convention darts; pcdarts, we achieve a test error of 2.54% that surpasses most baselines. Our progressive learning algorithm eliminates the gap between the proxy and target tasks, which further reduces the test error. Consequently, both of the two parts contribute a lot to our performance gains.
Architecture 






DenseNetBC densenet^{⋆}  3.46  25.6    manual  
NASNetA nasnet  2.65  3.3  2000  RL  
AmoebaNetA amoebanet  3.2  3150  evolution  
AmoebaNetB amoebanet  2.8  3150  evolution  
PNAS pnas^{⋆}  3.2  225  SMBO  
ENAS enas  2.89  4.6  0.5  RL  
DARTS (1st) darts  3.3  0.4  gradient  
DARTS (2nd) darts  3.3  1.0  gradient  
SNAS (moderate) snas  2.8  1.5  gradient  
GDAS gdas  2.93  3.4  0.3  gradient  
BayesNAS BayesNAS  3.4  0.2  gradient  
ProxylessNAS proxylessnas^{†}  2.08  5.7  4.0  gradient  
PDARTS pdarts  2.50  3.4  0.3  gradient  
PCDARTS pcdarts  3.6  0.1  gradient  
SDARTSADV smoothdarts  3.3  1.3  gradient  
GAEA + PCDARTS gaea  3.7  0.1  gradient  
DrNAS (without progressive learning)  4.0  0.4^{‡}  gradient  
DrNAS  4.1  0.6^{‡}  gradient 

Obtained without cutout augmentation.

Obtained on a different space with PyramidNet pyramidnet as the backbone.

Recorded on a single GTX 1080Ti GPU.
Comparison with stateoftheart image classifiers on CIFAR10.
4.2 Results on ImageNet
Architecture Space
The network architecture for ImageNet is slightly different from that for CIFAR10 in that we stack 14 cells and set the initial channel number as 48. We also first downscale the spatial resolution from to
with three convolution layers of stride 2 following previous works
pcdarts; pdarts. The other settings remain the same with section 4.1.Search Settings
Following PCDARTS pcdarts, we randomly sample 10% and 2.5% images from the 1.3M training set to alternatively learn network weight and Dirichlet architecture distribution by a momentum SGD and an Adam optimizer respectively. We use 8 RTX 2080 Ti GPUs for both search and evaluation, and the setup of progressive pruning is the same with that on CIFAR10, i.e. 2 stages with operation space size shrinking from 8 to 4, and the partial channel reduces from 6 to 3.
Retrain Settings
For architecture evaluation, we train the network for 250 epochs by an SGD optimizer with a momentum of 0.9, a weight decay of , and a linearly decayed learning rate initialized as 0.5. We also use label smoothing and an auxiliary tower of weight 0.4 during training. The learning rate warmup is employed for the first 5 epochs following previous works pdarts; pcdarts.
Results
As shown in Table 3, we achieve a top1/5 test error of 23.7%/7.1%, outperforming all compared baselines and achieving stateoftheart performance in the ImageNet mobile setting. The searched cells are visualized in appendix 6.2. Similar to section 4.1, we also report the result achieved with 8 cells and 16 initial channels, which is a common setup for the proxy task on ImageNet pcdarts. The obtained 24.2% top1 accuracy is already highly competitive, which demonstrates the effectiveness of the architecture distribution learning on largescale tasks. Then our progressive learning scheme further increases the top1/5 accuracy for 0.5%/0.2%. Therefore, learning in a taskspecific manner is essential to discover better architectures.
Architecture  Test Error(%) 





top1  top5  
Inceptionv1 inceptionv1  30.1  10.1  6.6    manual  
MobileNet mobilenets  29.4  10.5  4.2    manual  
ShuffleNet (v1) shufflenetv1  26.4  10.2    manual  
ShuffleNet (v2) shufflenetv2  25.1      manual  
NASNetA nasnet  26.0  8.4  5.3  2000  RL  
AmoebaNetC amoebanet  24.3  7.6  6.4  3150  evolution  
PNAS pnas  25.8  8.1  5.1  225  SMBO  
MnasNet92 mnasnet  25.2  8.0  4.4    RL  
DARTS (2nd) darts  26.7  8.7  4.7  4.0  gradient  
SNAS (mild) snas  27.3  9.2  4.3  1.5  gradient  
GDAS gdas  26.0  8.5  5.3  0.3  gradient  
BayesNAS BayesNAS  26.5  8.9  3.9  0.2  gradient  
DSNAS dsnas^{†}  25.7  8.1      gradient  
ProxylessNAS (GPU) proxylessnas^{†}  24.9  7.5  7.1  8.3  gradient  
PDARTS (CIFAR10) pdarts  24.4  7.4  4.9  0.3  gradient  
PDARTS (CIFAR100) pdarts  24.7  7.5  5.1  0.3  gradient  
PCDARTS (CIFAR10) pcdarts  25.1  7.8  5.3  0.1  gradient  
PCDARTS (ImageNet) pcdarts^{†}  24.2  7.3  5.3  3.8  gradient  
GAEA + PCDARTS gaea^{†}  24.0  7.3  5.6  3.8  gradient  
DrNAS (without progressive learning)^{†}  24.2  7.3  5.2  3.9  gradient  
DrNAS^{†}  23.7  7.1  5.7  4.6  gradient 

The architecture is searched on ImageNet, otherwise it is searched on CIFAR10 or CIFAR100.
4.3 Results on NASBench201
Recently, some researchers doubt that the expert knowledge applied to the evaluation protocol plays an important role in the impressive results achieved by leading NAS methods HardEvaluation; randomnas. So to further verify the effectiveness of DrNAS, we perform experiments on NASBench201 nasbench201, where architecture performance can be directly obtained by querying in the database. NASBench201 provides support for 3 dataset (CIFAR10, CIFAR100, ImageNet16120 imagenet16) and has a unified cellbased search space containing 15,625 architectures. We refer to their paper nasbench201
for details of the space. Our experiments are performed in a taskspecific manner, i.e. the search and evaluation are based on the same dataset. The hyperparameters for all compared methods are set as their default and for DrNAS, we use the same search settings with section
4.1. We run every method 4 independent times with different random seeds and report the mean and standard deviation in Table
4.As shown, we achieve the best accuracy on all 3 datasets. On CIFAR100, we even achieve the global optimal. Specifically, DrNAS outperforms DARTS darts, GDAS gdas, DSNAS dsnas, PCDARTS pcdarts, and SNAS snas by 103.8%, 35.9%, 30.4%, 6.4%, and 4.3% on average. We notice that the two methods (GDAS and DSNAS) that enforce a discrete constraint, i.e. only sample a single path every search iteration, perform undesirable especially on CIFAR100. In comparison, SNAS, employing a similar Gumbelsoftmax trick but without the discretization, performs much better. Consequently, a discrete constraint during search can reduce the GPU memory consumption but empirically suffers instability. In comparison, we develop the progressive learning scheme on top of the architecture distribution learning, enjoying both memory efficiency and strong search performance.
Method  CIFAR10  CIFAR100  ImageNet16120  

validation  test  validation  test  validation  test  
ResNet resnet  90.83  93.97  70.42  70.86  44.53  43.63 
Random (baseline)  
RSPS randomnas  
Reinforce nasnet  
ENAS enas  
DARTS (1st) darts  
DARTS (2nd) darts  
GDAS gdas  
SNAS snas  
DSNAS dsnas  
PCDARTS pcdarts  
DrNAS  
optimal  91.61  94.37  73.49  73.51  46.77  47.31 
5 Conclusion
In this paper, we propose Dirichlet Neural Architecture Search (DrNAS). We formulate the differentiable NAS as a constraint distribution learning problem, which explicitly models the stochasticity in the architecture mixing weight and balances exploration and exploitation in the search space. The proposed method can be optimized efficiently via gradientbased algorithm, and possesses theoretical benefit to improve the generalization ability. Furthermore, we propose a progressive learning scheme to eliminate the search and evaluation gap. DrNAS consistently achieves strong performance across several image classification tasks, which reveals its potential to play a crucial role in future endtoend deep learning platform.
References
6 Appendix
6.1 Proof of Proposition 1
Preliminaries:
Before the development of Pathwise Derivative Estimator, Laplace Approximate with Softmax basis has been extensively used to approximate the Dirichlet Distribution laplace, topic. The approximated Dirichlet distribution is:
(10) 
Where is the softmaxtransformed ,
follows multivariate normal distribution, and
is an arbitrary density to ensure integrability topic. The mean and diagonal covariance matrix of depends on the Dirichlet concentration parameter :(11) 
It can be directly obtained from (11) that the Dirichlet mean . Sampling from the approximated distribution can be down by first sampling from and then applying Softmax function to obtain . We will leverage the fact that this approximation supports explicit reparameterization to derive our proof.
Proof:
Apply the above Laplace Approximation to Dirichlet distribution, the unconstrained upperlevel objective in (3) can then be written as:
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
In our full objective, we constrain the Euclidean distance between learnt Dirichlet concentration and fixed prior concentration . The covariance matrix of approximated softmax Gaussian can be bounded as:
(18)  
(19) 
6.2 Searched Architectures
We visualize the searched normal and reduction cells in figure 1 and 2, which is directly searched on CIFAR10 and ImageNet respectively.
6.3 Trajectory of the Hessian Norm
We track the anytime Hessian norm on NASBench201 in figure 3. The result is obtained by averaging from 4 independent runs. We observe that the largest eigenvalue expands about 10 times when searching by DARTS for 100 epochs. In comparison, DrNAS always maintains the Hessian norm at a low level, which is in agreement with our theoretical analysis in section 2.3.