Recently, Neural Architecture Search (NAS) has attracted lots of attentions for its potential to democratize deep learning. For a practical end-to-end deep learning platform, NAS plays a crucial role in discovering task-specific architecture depending on users’ configurations (e.g. dataset, evaluation metric, etc.). Pioneers in this field develop prototypes based on reinforcement learningnasamoebanet and Bayesian optimization pnas. These works usually incur large computation overheads, which make them impractical to use. More recent algorithms significantly reduce the search cost including one-shot methods enas; oneshot, a continuous relaxation of the space darts and network morphisms morphisms. In particular, darts proposes a differentiable NAS framework - DARTS, converting the categorical operation selection problem into learning a continuous architecture mixing weight. They formulate a bi-level optimization objective allowing the architecture search to be efficiently performed by a gradient-based optimizer.
While current differentiable NAS methods achieve encouraging results, they still have shortcomings that hinder their real-world applications. Firstly, although they view the architecture mixing weight as learnable parameters that can be directly optimized when searching, the derived continuous architecture has no guarantee to perform well when it is stiffly discretized during evaluation. Several works have cast doubts on the stability and generalization of these differentiable NAS methods smoothdarts; understanding. They discover that directly optimizing the architecture mixing weight is prone to overfit the validation set and often leads to distorted structures (e.g. the searched architecture is dominated by parameter-free operations). Secondly, there exists a gap between the search and evaluation phases. Due to the large memory consumption of differentiable NAS, proxy tasks are usually employed during search such as using a smaller dataset, or searching with a shallower and narrower network.
In this paper, we propose an effective approach that addresses the aforementioned shortcomings named Dirichlet Neural Architecture Search (DrNAS). Inspired by the fact that directly optimizing the architecture mixing weight is equivalent to performing point estimation (MLE/MAP) from a probabilistic perspective, we formulate the differentiable NAS as a distribution learning problem instead, which naturally induces stochasticity and encourages exploration. Making use of the probability simplex property of the Dirichlet samples, DrNAS models the architecture mixing weight as random variables sampled from a parameterized Dirichlet distribution. Optimizing the Dirichlet objective can thus be done efficiently in an end-to-end fashion, by employing the pathwise derivative estimators to compute the gradient of the distributionpathwise
. A straightforward optimization, however, turns out to be problematic due to the uncontrolled variance of the Dirichlet, i.e. too much variance leads to training instability and too little variance suffers from insufficient exploration. In light of that, we apply an additional distance constraint directly on the Dirichlet concentration parameter to strike a balance between the exploration and the exploitation. We further derive a theoretical bound showing that the constrained distributional objective promotes stability and generalization of architecture search by implicitly controlling the Hessian of the validation error.
Furthermore, to enable a direct search on large-scale tasks, we propose a progressive architecture learning scheme, eliminating the gap between the search and evaluation phases. Based on partial channel connection pcdarts, we maintain a task-specific super-network of the same depth and number of channels as the evaluation phase throughout searching. To prevent loss of information and instability induced by partial connection, we divide the search phase into multiple stages and progressively increase the channel fraction via network transformation net2net. Meanwhile, we prune the operation space according to the learnt distribution to maintain the memory efficiency.
We conduct extensive experiments on different datasets and search spaces to demonstrate DrNAS’s effectiveness. Based on the DARTS search space darts, we achieve an average error rate of 2.46% on CIFAR-10, which ranks top amongst NAS methods. Furthermore, DrNAS achieves superior performance on large-scale tasks such as ImageNet. It obtains a top-1/5 error of 23.7%/7.1%, surpassing the previous state-of-the-art (24.0%/7.3%) under the mobile setting. On NAS-Bench-201 nasbench201, we also set new state-of-the-art results on all three datasets with low variance.
2 The Proposed Approach
In this section, we first briefly review differentiable NAS setups and generalize the formulation to motivate distribution learning. We then layout our proposed DrNAS and describe its optimization in section 2.2. In section 2.3, we provide a generalization result by showing that our method implicitly regularizes the Hessian norm over the architecture parameter. The progressive architecture learning method that enables direct search is then described in section 2.4.
2.1 Preliminaries: Differentiable Architecture Search
Cell-Based Search Space
The cell-based search space is constructed by replications of normal and reduction cells nasnet; darts. A normal cell keeps the spatial resolution while a reduction cell halves it but doubles the number of channels. Every cell is represented by a DAG with nodes and edges, where every node represents a latent representation and every edge is associated with an operations (e.g. max pooling or convolution) selected from a predefined candidate space . The output of a node is a summation of all input flows, i.e. , and a concatenation of intermediate node outputs, i.e. , composes the cell output, where the first two input nodes and are fixed to be the outputs of previous two cells.
Gradient-Based Search via Continuous Relaxation
To enable gradient-based optimization, darts apply a continuous relaxation to the discrete space. Concretely, the information passed from node to node is computed by a weighted sum of all operations alone the edge, forming a mixed-operation . The operation mixing weight is defined over the probability simplex and its magnitude represents the strength of each operation. Therefore, the architecture search can be cast as selecting the operation associated with the highest mixing weight for each edge. To prevent abuse of terminology, we refer to as the architecture/operation mixing weight, and concentration parameter in DrNAS as the architecture parameter throughout the paper.
Bilevel-Optimization with Simplex Constraints
With continuous relaxation, the network weight and operation mixing weight can be jointly optimized by solving a constraint bi-level optimization problem:
where the simplex constraint can be either solved explicitly via Lagrangian function gega, or eliminated by substitution method (e.g. ) darts. In the next section we describe how this generalized formulation motivates our method.
2.2 Differentiable Architecture Search as Distribution Learning
Learning a Distribution over Operation Mixing Weight
Previous differentiable architecture search methods view the operation mixing weight as a learnable parameter that can be directly optimized darts; pcdarts; gega. This has been shown to cause to overfit the validation set and thus induce large generalization error understanding; nasbench1shot1; smoothdarts. We recognize that this treatment is equivalent to performing point estimation (e.g. MLE/MAP) of in probabilistic view, which is inherently prone to cause overfitting prml; bayesian. Furthermore, directly optimizing incurs no exploration in the search space, and thus cause the search algorithm to commit to suboptimal paths in the DAG that converges faster at the beginning but plateaued quickly WideShallow.
Based on these insights, we formulate differentiable architecture search as a distribution learning problem. The operation mixing weight is treated as random variables sampled from learnable distribution. Formally, let denotes the distribution of parameterized by . The bi-level objective is then given by:
Since lies on the probability simplex, we select Dirichlet distribution to model its behavior, i.e. , where represents the Dirichlet concentration parameter. Dirichlet distribution is a widely used distribution over the probability simplex drvae; lda, and it enjoys several nice properties that enables gradient-based training pathwise.
The concentration parameter controls the sampling behavior of Dirichlet distribution and is crucial in balancing the exploration and exploitation during the search phase. When for most , Dirichlet tends to produce sparse samples with high variance, reducing the training stability; when for most , the samples will be dense with low variance, leading to insufficient exploration. Therefore, we add a constraint to the objective (2) to restrict the distance between and anchor . The constraint objective can be written as:
which can be solved using penalty method ppo:
In section 2.3, we also derive a theoretical bound showing that the constrained Dirichlet NAS formulation (3) additionally promotes stability and generalization of the architecture search by implicitly regularizing the Hessian of validation loss w.r.t. architecture parameters.
Learning Dirichlet Parameters via Pathwise Derivative Estimator
Optimizing objective (4) with gradient-based methods requires back-propagation through stochastic nodes of Dirichlet samples. The commonly used reparameterization trick does not apply to Dirichlet distribution, therefore we approximate the gradient of Dirichlet samples via pathwise derivative estimators pathwise
denote the CDF and PDF of beta distribution respectively,is the indicator function, and is the sum of concentrations. is the iregularised incomplete beta function, for which its gradient can be computed by simple numerical approximation. We refer to pathwise for the complete derivations.
Joint Optimization of Model Weight and Architecture Parameter
With pathwise derivative estimator, the model weight and concentration can be jointly optimized with gradient descent. Concretely, we draw a sample
for every forward pass, and the gradients can be obtained easily through backpropagation. Following DARTSdarts, we approximate in the lower level objective of (4) with one step of gradient descent, and run alternative updates between and .
Selecting the Best Architecture
At the end of the search phase, a learnt distribution of operation mixing weight is obtained. We then select the best operation for each edge by the most likely operation in expectation:
In the Dirichlet case, the expectation term is simply the Dirichlet mean . Note that under the distribution learning framework, we are able to sample a wide range of architectures from the learnt distribution. This property alone has many potentials. For example, in practical settings where both accuracy and latency are concerned, the learnt distribution can be used to find architectures under resource restrictions in a post search phase. We leave these extensions to future work.
2.3 Implicit Regularization on Hessian
It has been observed that the generalization error of differentiable NAS is highly related to the dominant eigenvalue of the Hessian of validation loss w.r.t. architecture parameter. Several recent works report that the large dominant eigenvalue ofin DARTS results in poor generalization performance understanding; smoothdarts. In the following proposition we derive an approximated lower bound of our objective (3), which demonstrates that our method implicitly controls this Hessian matrix.
Let and in the bi-level formulation (3). If is Positive Semi-definite, the upper-level objective can be approximated bounded by:
This proposition is driven by the Laplacian approximation to the Dirichlet distribution laplace; topic. The lower bound (7) indicates that minimizing the expected validation loss controls the trace norm of the Hessian matrix. Empirically, we observe that DrNAS always maintains the dominant eigenvalue of Hessian at a low level (Appendix 6.2). The detailed proof can be found in Appendix 6.1.
2.4 Progressive Architecture Learning
The GPU memory consumption of differentiable NAS methods grows linearly with the size of operation candidate space. Therefore, they usually use a easier proxy task such as training with a smaller dataset, or searching with fewer layers and number of channels proxylessnas. For instance, the architecture search is performed on 8 cells and 16 initial channels in DARTS darts. But during evaluation, the network has 20 cells and 36 initial channels. Such gap makes it hard to derive an optimal architecture for the target task proxylessnas.
PC-DARTS pcdarts proposes a partial channel connection to reduce the memory overheads of differentiable NAS, where they only send a random subset of channels to the mixed-operation while directly bypassing the rest channels in a shortcut. However, their method causes loss of information and makes the selection of operation unstable since the sampled subsets may vary widely across iterations. This drawback is amplified when combining with the proposed method since we learn the architecture distribution from Dirichlet samples, which already injects certain stochasticity. As shown in Table 1, when directly applying partial channel connection with distribution learning, the test accuracy of the searched architecture decreases over 3% and 18% on CIFAR-10 and CIFAR-100 respectively if we send only 1/8 channels to the mixed-operation.
To alleviate such information loss and instability problem while being memory-efficient, we propose a progressive learning scheme which gradually increases the fraction of channels that are forwarded to the mixed-operation and meanwhile prunes the operation space based on the learnt distribution. We split the search process into consecutive stages and construct a task-specific super-network with the same depth and number of channels as the evaluation phase at the initial stage. Then after each stage, we increase the partial channel fraction, which means that the super-network in the next stage will be wider, i.e. have more convolution channels, and in turn preserve more information. This is achieved by enlarging every convolution weight with a random mapping function similar to Net2Net net2net. The mapping function with is defined as
To widen layer , we replace its convolution weight with a new weight .
where denote the number of output and input channels, filter height and width respectively. Intuitively, we copy directly into and fulfill the rest part by choosing randomly as defined in . Unlike Net2Net, we do not divide by a replication factor here because the information flow on each edge has the same scale no matter the partial fraction is. After widening the super-network, we reduce the operation space by pruning out less important operations according to the Dirichlet concentration parameter learnt from the previous stage, maintaining a consistent memory consumption. As illustrated in Table 1, the proposed progressive architecture learning scheme effectively discovers high accuracy architectures and retains a low GPU memory overhead.
3 Discussions and Relationship to Prior Work
Early methods in NAS usually include a full training and evaluation procedure every iteration as the inner loop to guide the consecutive search nas; nasnet; amoebanet. Consequently, their computational overheads are beyond acceptance for practical usage, especially on large-scale tasks.
Recently, many works are proposed to improve the efficiency of NAS enas; morphisms; darts; oneshot. Amongst them, DARTS darts
proposes a differentiable NAS framework, which introduces a continuous architecture parameter that relaxes the discrete search space. Despite being efficient, DARTS only optimizes a single point on the simplex every search epoch, which has no guarantee to generalize well after the discretization during evaluation. So its stability and generalization have been widely challengedrandomnas; understanding; smoothdarts. Following DARTS, SNAS snas and GDAS gdas leverage the gumbel-softmax trick to learn the exact architecture parameter. However, their reparameterization is motivated from reinforcement learning perspective, which is an approximation with softmax rather than an architecture distribution. Besides, their methods require tuning of temperature schedule hman; mann. GDAS linearly decreases the temperature from 10 to 1 while SNAS anneals it from 1 to 0.03. In comparison, the proposed method can automatically learn the architecture distribution without the requirement of handcrafted scheduling. BayesNAS BayesNAS
applies Bayesian Learning in NAS. Specifically, they cast NAS as model compression problem and use Bayes Neural Network as the super-network, which is difficult to optimize and requires oversimplified approximation. While our method considers the stochasticity in architecture mixing weight, as it is directly related to the generalization of differentiable NAS algorithmsunderstanding; smoothdarts.
When dealing with the large memory consumption of differentiable NAS, previous works mainly sample few paths during search. For instance, ProxylessNAS proxylessnas employs binary gates and samples two paths every search epoch. Similarly, GDAS gdas and DSNAS dsnas both enforce a discrete constraint after the gumbel-softmax reparametrization, i.e. only one path is activated. However, such discretization manifests premature convergence and may harm the search stability nasbench1shot1. Our experiments in section 4.3 also empirically demonstrate this phenomenon. As an alternative, PC-DARTS pcdarts proposes a partial channel connection, where only a portion of channels is sent to the mixed-operation. However, partial connection can cause loss of information as shown in section 2.4 and PC-DARTS searches on a shallower network with less channels, suffering the search and evaluation gap. Our solution, by progressively pruning the operation space and meanwhile widening the network, searches in a task-specific manner and achieves superior accuracy on challenging datasets like ImageNet (+2.8% over BayesNAS, +2.3% over GDAS, +2.0% over DSNAS, +1.2% over ProxylessNAS, and +0.5% over PC-DARTS).
4 Experimental Results
In this section, we evaluate our proposed method DrNAS on two search spaces: the CNN search space in DARTS darts and NAS-Bench-201 nasbench201. For DARTS space, we conduct experiments on both CIFAR-10 and ImageNet in section 4.1 and 4.2 respectively. For NAS-Bench-201, we test all 3 supported datasets (CIFAR-10, CIFAR-100, ImageNet-16-120 imagenet16) in section 4.3.
4.1 Results on CIFAR-10
For both search and evaluation phases, we stack 20 cells to compose the network and set the initial channel number as 36. We place the reduction cells at the 1/3 and 2/3 of the network and each cell consists of nodes. Following previous works darts, the operation space contains 8 choices, including and separable convolution, and dilated separable convolution, max pooling, average pooling, skip connection, and none (zero).
We equally divide the 50K training images into two parts, one is used for optimizing the network weights by momentum SGD and the other for learning the Dirichlet architecture distribution by an Adam optimizer. Since Dirichlet concentration must be positive, we apply the shifted exponential linear mapping and optimize over instead. We use norm to constrain the distance between and the anchor . The is initialized by standard Gaussian with scale 0.001, and in (4) is set to 0.001. These settings are consistent for all experiments. For progressive architecture learning, the whole search process consists of 2 stages, each with 25 iterations. In the first stage, we set the partial channel parameter as 6 to fit the super-network into a single GTX 1080Ti GPU with 11GB memory, i.e. only 1/6 features are sampled on each edge. For the second stage, we prune half candidates and meanwhile widen the network twice, i.e. the operation space size reduces from 8 to 4 and becomes 3.
The evaluation phase uses the entire 50K training set to train the network from scratch for 600 epochs. The network weight is optimized by an SGD optimizer with a cosine annealing learning rate initialized as 0.025, a momentum of 0.9, and a weight decay of . To allow a fair comparison with previous work, we also employ cutout regularization with length 16, drop-path nasnet with probability 0.3 and an auxiliary tower of weight 0.4.
Table 2 summarizes the performance of DrNAS compared with other popular NAS methods, and we also visualize the searched cells in appendix 6.2. DrNAS achieves a test error of 2.46%, ranking among the top of recent NAS results. ProxylessNAS is the only method that achieves lower test error than us, but it searches on a different space with a much longer search time and has larger model size. We also perform experiments to assign proper credit to the two parts of our proposed algorithm, i.e. Dirichlet architecture distribution and progressive learning scheme. When searching on a proxy task with 8 stacked cells and 16 initial channels as the convention darts; pcdarts, we achieve a test error of 2.54% that surpasses most baselines. Our progressive learning algorithm eliminates the gap between the proxy and target tasks, which further reduces the test error. Consequently, both of the two parts contribute a lot to our performance gains.
|DARTS (1st) darts||3.3||0.4||gradient|
|DARTS (2nd) darts||3.3||1.0||gradient|
|SNAS (moderate) snas||2.8||1.5||gradient|
|GAEA + PC-DARTS gaea||3.7||0.1||gradient|
|DrNAS (without progressive learning)||4.0||0.4||gradient|
Obtained without cutout augmentation.
Obtained on a different space with PyramidNet pyramidnet as the backbone.
Recorded on a single GTX 1080Ti GPU.
Comparison with state-of-the-art image classifiers on CIFAR-10.
4.2 Results on ImageNet
The network architecture for ImageNet is slightly different from that for CIFAR-10 in that we stack 14 cells and set the initial channel number as 48. We also first downscale the spatial resolution from to
with three convolution layers of stride 2 following previous workspcdarts; pdarts. The other settings remain the same with section 4.1.
Following PC-DARTS pcdarts, we randomly sample 10% and 2.5% images from the 1.3M training set to alternatively learn network weight and Dirichlet architecture distribution by a momentum SGD and an Adam optimizer respectively. We use 8 RTX 2080 Ti GPUs for both search and evaluation, and the setup of progressive pruning is the same with that on CIFAR-10, i.e. 2 stages with operation space size shrinking from 8 to 4, and the partial channel reduces from 6 to 3.
For architecture evaluation, we train the network for 250 epochs by an SGD optimizer with a momentum of 0.9, a weight decay of , and a linearly decayed learning rate initialized as 0.5. We also use label smoothing and an auxiliary tower of weight 0.4 during training. The learning rate warm-up is employed for the first 5 epochs following previous works pdarts; pcdarts.
As shown in Table 3, we achieve a top-1/5 test error of 23.7%/7.1%, outperforming all compared baselines and achieving state-of-the-art performance in the ImageNet mobile setting. The searched cells are visualized in appendix 6.2. Similar to section 4.1, we also report the result achieved with 8 cells and 16 initial channels, which is a common setup for the proxy task on ImageNet pcdarts. The obtained 24.2% top-1 accuracy is already highly competitive, which demonstrates the effectiveness of the architecture distribution learning on large-scale tasks. Then our progressive learning scheme further increases the top-1/5 accuracy for 0.5%/0.2%. Therefore, learning in a task-specific manner is essential to discover better architectures.
|ShuffleNet (v1) shufflenet-v1||26.4||10.2||-||manual|
|ShuffleNet (v2) shufflenet-v2||25.1||-||-||manual|
|DARTS (2nd) darts||26.7||8.7||4.7||4.0||gradient|
|SNAS (mild) snas||27.3||9.2||4.3||1.5||gradient|
|ProxylessNAS (GPU) proxylessnas||24.9||7.5||7.1||8.3||gradient|
|P-DARTS (CIFAR-10) pdarts||24.4||7.4||4.9||0.3||gradient|
|P-DARTS (CIFAR-100) pdarts||24.7||7.5||5.1||0.3||gradient|
|PC-DARTS (CIFAR-10) pcdarts||25.1||7.8||5.3||0.1||gradient|
|PC-DARTS (ImageNet) pcdarts||24.2||7.3||5.3||3.8||gradient|
|GAEA + PC-DARTS gaea||24.0||7.3||5.6||3.8||gradient|
|DrNAS (without progressive learning)||24.2||7.3||5.2||3.9||gradient|
The architecture is searched on ImageNet, otherwise it is searched on CIFAR-10 or CIFAR-100.
4.3 Results on NAS-Bench-201
Recently, some researchers doubt that the expert knowledge applied to the evaluation protocol plays an important role in the impressive results achieved by leading NAS methods HardEvaluation; randomnas. So to further verify the effectiveness of DrNAS, we perform experiments on NAS-Bench-201 nasbench201, where architecture performance can be directly obtained by querying in the database. NAS-Bench-201 provides support for 3 dataset (CIFAR-10, CIFAR-100, ImageNet-16-120 imagenet16) and has a unified cell-based search space containing 15,625 architectures. We refer to their paper nasbench201
for details of the space. Our experiments are performed in a task-specific manner, i.e. the search and evaluation are based on the same dataset. The hyperparameters for all compared methods are set as their default and for DrNAS, we use the same search settings with section4.1
. We run every method 4 independent times with different random seeds and report the mean and standard deviation in Table4.
As shown, we achieve the best accuracy on all 3 datasets. On CIFAR-100, we even achieve the global optimal. Specifically, DrNAS outperforms DARTS darts, GDAS gdas, DSNAS dsnas, PC-DARTS pcdarts, and SNAS snas by 103.8%, 35.9%, 30.4%, 6.4%, and 4.3% on average. We notice that the two methods (GDAS and DSNAS) that enforce a discrete constraint, i.e. only sample a single path every search iteration, perform undesirable especially on CIFAR-100. In comparison, SNAS, employing a similar Gumbel-softmax trick but without the discretization, performs much better. Consequently, a discrete constraint during search can reduce the GPU memory consumption but empirically suffers instability. In comparison, we develop the progressive learning scheme on top of the architecture distribution learning, enjoying both memory efficiency and strong search performance.
|DARTS (1st) darts|
|DARTS (2nd) darts|
In this paper, we propose Dirichlet Neural Architecture Search (DrNAS). We formulate the differentiable NAS as a constraint distribution learning problem, which explicitly models the stochasticity in the architecture mixing weight and balances exploration and exploitation in the search space. The proposed method can be optimized efficiently via gradient-based algorithm, and possesses theoretical benefit to improve the generalization ability. Furthermore, we propose a progressive learning scheme to eliminate the search and evaluation gap. DrNAS consistently achieves strong performance across several image classification tasks, which reveals its potential to play a crucial role in future end-to-end deep learning platform.
6.1 Proof of Proposition 1
Before the development of Pathwise Derivative Estimator, Laplace Approximate with Softmax basis has been extensively used to approximate the Dirichlet Distribution laplace, topic. The approximated Dirichlet distribution is:
Where is the softmax-transformed ,
follows multivariate normal distribution, andis an arbitrary density to ensure integrability topic. The mean and diagonal covariance matrix of depends on the Dirichlet concentration parameter :
It can be directly obtained from (11) that the Dirichlet mean . Sampling from the approximated distribution can be down by first sampling from and then applying Softmax function to obtain . We will leverage the fact that this approximation supports explicit reparameterization to derive our proof.
Apply the above Laplace Approximation to Dirichlet distribution, the unconstrained upper-level objective in (3) can then be written as:
In our full objective, we constrain the Euclidean distance between learnt Dirichlet concentration and fixed prior concentration . The covariance matrix of approximated softmax Gaussian can be bounded as:
6.2 Searched Architectures
6.3 Trajectory of the Hessian Norm
We track the anytime Hessian norm on NAS-Bench-201 in figure 3. The result is obtained by averaging from 4 independent runs. We observe that the largest eigenvalue expands about 10 times when searching by DARTS for 100 epochs. In comparison, DrNAS always maintains the Hessian norm at a low level, which is in agreement with our theoretical analysis in section 2.3.