One-Shot Neural Architecture Search Through A Posteriori Distribution Guided Sampling

06/23/2019
by   Yizhou Zhou, et al.
Microsoft
USTC
5

The emergence of one-shot approaches has greatly advanced the research on neural architecture search (NAS). Recent approaches train an over-parameterized super-network (one-shot model) and then sample and evaluate a number of sub-networks, which inherit weights from the one-shot model. The overall searching cost is significantly reduced as training is avoided for sub-networks. However, the network sampling process is casually treated and the inherited weights from an independently trained super-network perform sub-optimally for sub-networks. In this paper, we propose a novel one-shot NAS scheme to address the above issues. The key innovation is to explicitly estimate the joint a posteriori distribution over network architecture and weights, and sample networks for evaluation according to it. This brings two benefits. First, network sampling under the guidance of a posteriori probability is more efficient than conventional random or uniform sampling. Second, the network architecture and its weights are sampled as a pair to alleviate the sub-optimal weights problem. Note that estimating the joint a posteriori distribution is not a trivial problem. By adopting variational methods and introducing a hybrid network representation, we convert the distribution approximation problem into an end-to-end neural network training problem which is neatly approached by variational dropout. As a result, the proposed method reduces the number of sampled sub-networks by orders of magnitude. We validate our method on the fundamental image classification task. Results on Cifar-10, Cifar-100 and ImageNet show that our method strikes the best trade-off between precision and speed among NAS methods. On Cifar-10, we speed up the searching process by 20x and achieve a higher precision than the best network found by existing NAS methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/11/2020

Few-shot Neural Architecture Search

To improve the search efficiency for Neural Architecture Search (NAS), O...
05/13/2019

BayesNAS: A Bayesian Approach for Neural Architecture Search

One-Shot Neural Architecture Search (NAS) is a promising method to signi...
01/28/2021

Neural Architecture Search with Random Labels

In this paper, we investigate a new variant of neural architecture searc...
05/19/2021

Efficient Transfer Learning via Joint Adaptation of Network Architecture and Weight

Transfer learning can boost the performance on the targettask by leverag...
09/02/2020

Adversarially Robust Neural Architectures

Deep Neural Network (DNN) are vulnerable to adversarial attack. Existing...
02/06/2020

Variational Depth Search in ResNets

One-shot neural architecture search allows joint learning of weights and...
02/28/2020

Neural Inheritance Relation Guided One-Shot Layer Assignment Search

Layer assignment is seldom picked out as an independent research topic i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural architecture search (NAS), which automates the design of artificial neural networks (ANN), has received increasing attention in recent years. It is capable of finding ANNs which achieve similar or even better performance than manually designed ones. NAS is essentially a bi-level optimization task as shown in Fig. 1(a). Let denote the set of possible network architectures under a predefined search space. Let and denote an architecture in and its corresponding weights, respectively. The lower-level objective optimizes weights as

(1)

where is the loss criterion evaluated on the training dataset and denotes the network with architecture and weight . The upper-level objective optimizes the network architecture on the validation dataset with the weight that has been optimized by the lower-level task as

(2)

where is the loss criterion on the validation dataset . To solve this bi-level problem, approaches based on evolution liu2017hierarchical ; real2018regularized

, reinforcement learning

baker2016designing ; zoph2016neural ; zoph2018learning ; tan2018mnasnet ; zhong2018practical ; zela2018towards ; real2018regularized ; baker2017accelerating ; swersky2014freeze ; domhan2015speeding ; klein2016learning ; liu2018progressive or gradient-based methods liu2018darts ; cai2018proxylessnas ; brock2017smash ; xie2018snas are proposed. However, most of these methods suffer from high computational complexity, (often in the orders of thousands of GPU days) liu2017hierarchical real2018regularized baker2016designing zoph2016neural zoph2018learning , or lack of convergence guarantee cai2018proxylessnas ; liu2018darts ; xie2018snas .

Figure 1: Illustration of NAS mechanisms. (a) Solving NAS by bi-level optimization which is computational and resource demanding. (b) Sampling-based one-shot NAS. The sampling of architectures is independent of the training dataset and there is often a mismatch between the shared weights and the sampled architectures. (c) Our NASAS samples architecture-weight pairs w.r.t. a posteriori distribution estimated on the training dataset, and directly outputs the searched network without fine-tuning.

Rather than directly tackling the bi-level problem, some attempts veniat2018learning ; Wu2018FBNetHE ; saxena2016convolutional ; shin2018differentiable ; ahmed2017connectivity ; xie2018snas relax the discrete search space to a continues one denoted by , which can be written into where denotes the continuous relaxation and stands for the topology of the relaxed architecture. The weight and architecture are jointly optimized with a single objective function

(3)

Then the optimal architecture is derived by discretizing the continuous architecture . These methods greatly simplify the optimization problem and enable end-to-end training. However, since the validation set is not involved in Eq. (3), the search results are inevitably biased towards the training dataset.

More recent NAS methods tend to reduce the computational complexity by decoupling the bi-level optimization problem into a sequential one bender2018understanding ; brock2017smash ; guo2019single . Specifically, a super-network (one-shot model) is defined and the search space is constrained to contain only sub-networks of . As shown in Fig. 1(b), recent one-shot NAS methods first optimize weights for the super-network by solving

(4)

Then a number of sub-networks are sampled from and the best-performing sub-network is picked out with

(5)

where denotes the weights of architecture inherited from . The core assumption of this one-shot NAS method is that the best-performing sub-network shares weights with the optimal super-network, so that each sampled sub-network does not need to be re-trained in the searching process. This greatly boosts the efficiency of NAS. However, this assumption does not always hold. Clues can be found in the common practice that previous one-shot methods rely on fine-tuning to further improve the performance of the found best model. Previous research has also pointed out that the mismatch between weights and architectures of sampled sub-networks could jeopardize the following ranking results xie2018snas . Besides, the searching process is casually treated by random or uniform sampling. We believe there is large room for improvement in efficiency.

In this paper, we propose a novel NAS strategy, namely NAS through A posteriori distribution guided Sampling (NASAS). In NASAS, we propose to estimate a posteriori distribution over the architecture and weight pair () with a variational distribution , where denotes the variational parameters. The optimial , denoted by , can be found by

(6)

where measures the distance between two distributions. Note that finding is not a trivial problem and the details will be presented in Section 2. After is found, we can look for the optimal architecture by

(7)

In a nutshell, NASAS leverages the training dataset to estimate a posteriori distribution, based on which sampling is performed, and then uses the validation set for performance evaluation.

The flow chart of NASAS is illustrated in Fig. 1

(c). Our work has two main innovations compared with the recently proposed one-shot approaches. First, we greatly improve the efficiency of network search process by a guided sampling. As a result, the searching time can be reduced by orders of magnitude to achieve the best performance. Second, we approximate the joint distribution over architecture and weight to alleviate the mismatch problem mentioned earlier. This not only improves the reliability of ranking result, but also allows us to directly output the found best-performing network without fine-tuning. We evaluate our NASAS on image classification task. It is able to achieve 1.98% test error at 11.1 GPU days on Cifar-10, while the best network found by existing NAS methods is only able to achieve 2.07% test error at 200 GPU days. NASAS also achieves state-of-the-art performance with 14.8% test error at 8.7 GPU days on Cifar-100, and 24.80% test errors at around 40 GPU days on ImageNet under relaxed mobile setting.

2 Nasas

In this section, we first formulate the target problem of our NASAS, and then propose an end-to-end trainable solution to estimate the joint a posteriori distribution over architectures and weights, followed by an efficient sampling and ranking scheme to facilitate the search process.

2.1 Notation and Problem Formulation

Given a one-shot model , let denote the convolution weight matrix for layer with spatial kernel size , and and denote the number of input and output channels, respectively. We use to denote the sliced kernel operated on the input channel dimension and use to denote the weights of the whole one-shot model. As deriving a sub-network in

is equivalent to deactivating a set of convolution kernels, sub-network architecture can be specified by a set of random variables

, where indicates deactivating (zero) or activating (one) convolution kernel . Later on we will use boldface for random variables.

Although we need a joint a posteriori distribution over and , we do not have to explicitly derive the joint distribution since deactivating or activating a convolution kernel is also equivalent to multiplying a binary mask to the kernel. Instead, we combine them as a new random variable , where . Thus, the key problem in our NASAS is to estimate a posteriori distribution over the hybrid network representation . Mathematically,

(8)

where X and Y denote the training samples and labels, respectively. is likelihood that can be inferred by where denotes a sub-network defined by hybrid representation . is the a priori distribution of hybrid representation . Because the marginalized likelihood in Eq. (8) is intractable, we use a variational distribution to approximate the true a posteriori distribution and reformulate our target problem as

(9)

Here we choose KL divergence and accuracy to instantiate and , respectively.

2.2 A Posteriori Distribution Approximation

We employ Variational Inference(VI) to approximate the true a posteriori distribution with by minimizing the negative Evidence Lower Bound (ELBO)

(10)

where is the number of training samples. Inspired by gal2016uncertainty ; gal2016dropout , we propose solving Eq. (10) by the network friendly Variational Dropout.

2.2.1 Approximation by Network Training

We employ the re-parametrization trick kingma2013auto and choose a deterministic and differentiable transformation function that re-parameterizes the as , where

is a parameter-free distribution. Take a uni-variate Gaussian distribution

as an example, its re-parametrization can be with , where and are the variational parameters . Gal et.al. in gal2016uncertainty ; gal2016dropout have shown that when the network weight is re-parameterized with

(11)

the function draw w.r.t. variational distribution over network weights can be efficiently implemented via network inference. Concretely, the function draw is equivalent to randomly drawing masked deterministic weight matrix in neural networks, which is known as the Dropout operations srivastava2014dropout . Similarly, we replace in our hybrid representation with , and reformulate as

(12)

In Eq. (12), we have an additional random variable that controls the activation of kernels whose distribution is unknown. Here we propose using the marginal probability to characterize its behavior, because the marginal can reflect the expected probability of selecting kernel given the training dataset. It exactly matches the real behavior if the selections of kernels in a one-shot model are independent. Since the joint distribution of network architecture is a

multivariate Bernoulli distribution

, its marginal distribution obeys dai2013multivariate , where now is also the variational parameter that should be optimized. Therefore, we have

(13)

Here we omit the subscript in the original because the importance of branches which come from the same kernel size group and layer should be identical. By replacing with a new variable , Eq. (13) has the same form as Eq. (11). Now Eq. (10) can be rewritten as

(14)

where variational parameters are composed of both the deterministic kernel weights and the distribution of network architecture. The expected log likelihood (the integral term) in the equation above is usually estimated by Monte Carlo (MC) estimation

(15)

Eq. (15) indicates that the (negative) ELBO can be computed very efficiently. It is equivalent to the KL term minus the log likelihood that is inferenced by the one-shot network (now reparameterized as ). During each network inference, convolution kernels are randomly deactivated w.r.t. probability , which is exactly equivalent to a dropout neural network.

Now, approximating a posteriori distribution over the hybrid network representation is converted to optimizing the one-shot model with dropout and a KL regularization term. If the derivative of both terms is tractable, we can efficiently train it in an end-to-end fashion.

2.2.2 Network Optimization

In addition to the variational parameters , the variable in Eq. (13) should also be optimized (either via grid-search gal2016uncertainty or gradient-based method gal2017concrete ). So we need to compute

. If each convolution kernel is deactivated with a prior probability

along with a Gaussian weight prior , then the a priori distribution for the hybrid representation is exactly a spike and slab prior . Following gal2016dropout ; gal2017concrete , the derivatives of Eq. (15) can be computed as

(16)

where and denotes the number of input channels for convolution kernel of spatial size at layer . Please note that the above derivation is obtained by setting the prior to be zero, which indicates the network architecture prior is set to be the whole one-shot model. The motivation of employing is that a proper architecture prior is usually difficult to acquire or even estimate, but can be a reasonable one when we choose the over-parameterized network that proves effective on many tasks as our one-shot model. Besides, provides us a more stable way to optimize the gal2016uncertainty . So, we will use the one-shot models that are built upon manually designed networks in our experiments.

Since the first term in Eq. (16) involves computing the derivative of a non-differentiable Bernoulli distribution (remember in Eq. (13)), we thus employ the Gumbel-softmax jang2016categorical to relax the discrete distribution to a continuous space and the in Eq. (16) and Eq. (13) can be deterministically drawn with

(17)

where

is the temperature that decides how steep the sigmoid function is and if

goes to infinite, the above parametrisation is exactly equivalent to drawing the sample from Bernoulli distribution. (Similar relaxation is used in gal2017concrete without using Gumbel-softmax.)

By adopting Eq. (17), the derivatives in Eq. (16

) can be propagated via chain rule. Combining the Eq. (

8), Eq. (10) and Eq. (15) , one can see that the a posteriori distribution over the hybrid representation can be approximated by simply training the one-shot model in an end-to-end fashion with two additional regularization terms and dropout ratio .

2.3 Sampling and Ranking

Once the variational distribution is obtained, we sample a group of network candidates w.r.t. , where the is the number of samples. According to Eq. (13), our sampling process is performed by activating convolution kernels stochastically with the learned probability , which is equivalent to a regular dropout operation. Specifically, each candidate is sampled by randomly dropping convolution kernel w.r.t. the probability for every , and in the one-shot model. Then the sampled candidates are evaluated and ranked on a held-out validation dataset. Due to the hybrid network representation, we actually sample architecture-weight pairs which relieves the mismatch problem. At last, the best-performing one is selected by Eq. (7).

Please note that our a posteriori distribution guided sampling scheme, though not intentionally, leads to an adaptive dropout that reflects the importance of different parts in the one-shot model. It thus relieves the dependency on the hyper-parameter sensitive, carefully designed drop-out probability in the previous one-shot methods bender2018understanding .

3 Experiments

To fully investigate the behavior of the NASAS, we test our NASAS on six one-shot super-networks. Because we use to facilitate Eq. (16), we construct the super-networks based on architecture priors perceived from manually designed networks. We evaluate the performance of our NASAS on three databases Cifar-10, Cifar-100 and ImageNet, respectively. For every one-shot super-network, we insert a dropout layer after each convolution layer according to Eq. (17) to facilitate the computation of Eq. (16

). This modification introduces parameters and FLOPS of negligible overheads. Our NASAS is trained in an end-to-end way with the Stochastic Gradient Descent (SGD) using a single P40 GPU card for Cifar-10/Cifar-100 and 4 M40 GPU cards for ImageNet. Once a model converges, we sample different convolution kernels w.r.t. the learned dropout ratio to get 1500/5000/1500 candidate architectures for Cifar-10, Cifar-100 and ImageNet, respectively. These 1500 candidates are ranked on a held-out validation dataset and the one with the best performance will be selected as the final search result.

3.1 Cifar-10 and Cifar-100

One-shot Model and Hyper-parameters. We test our NASAS with four super-networks, namely SupNet-M/MI and SupNet-E/EI, on Cifar-10 and Cifar-100. They are based on the manually designed multi-branch ResNet gastaldi2017shake and the architecture obtained by ENAS pham2018efficient , respectively. Please refer to the supplementary material for more details of the one-shot models and all hyper-parameter settings used in this paper.

Method Error(%) GPUs Days Params(M) Search Method
shake-shake gastaldi2017shake 2.86 - 26.2 -
shake-shake + cutout devries2017improved 2.56 - 26.2 -
NAS zoph2016neural 4.47 22400 7.1 RL
NAS + more filters zoph2016neural 3.65 22400 37.4 RL
NASNET-A + cutout zoph2018learning 2.65 1800 3.3 RL
Micro NAS + Q-Learning zhong2018practical 3.60 96 - RL
PathLevel EAS + cutout cai2018path 2.30 8.3 13.0 RL
ENAS + cutout pham2018efficient 2.89 0.5 4.6 RL
EAS (DenseNet) cai2018efficient 3.44 10 10.7 RL
AmoebaNet-A + cutout real2018regularized 3.34 3150 3.2 evolution
Hierachical Evo liu2017hierarchical 3.63 300 61.3 evolution
PNAS liu2018progressive 3.63 225 3.2 SMBO
SMASH brock2017smash 4.03 1.5 16.0 gradient-based
DARTS + cutout liu2018darts 2.83 4 3.4 gradient-based
SNAS + cutout xie2018snas 2.85 1.5 2.8 gradient-based
NAONet + cutout luo2018neural 2.07 200 128 gradient-based
One-Shot Top bender2018understanding 3.70 - 45.3 gradient-based
NASAS-E 2.73 2.5 3.1 guided sampling
NASAS-EI 2.56 5.5 10.8 guided sampling
NASAS-M 2.20 4.8 21.6 guided sampling
NASAS-MI 2.06 6.5 33.4 guided sampling
NASAS-MI 1.98 11.1 32.8 guided sampling
Table 1: Performance comparison with other state-of-the-art results. Please note that we do not fine-tune the network searched by our method. indicates the architecture searched by sampling 10000 candidates. Full table can be viewed in the supplementary material .
Method Error(%) GPUs Days Params(M) Search Method
NASNET-A zoph2018learning 19.70 1800 3.3 RL
ENAS pham2018efficient 19.43 0.5 4.6 RL
AmoebaNet-B real2018regularized 17.66 3150 2.8 evolution
PNAS liu2018progressive 19.53 150 3.2 SMBO
NAONet + cutout luo2018neural 14.36 200 128 gradient-based
NASAS-MI(ours) 14.28 11 46.4 guided sampling
Table 2: Performance comparison with other state-of-art results on Cifar-100. Please note that we do not fine-tune the network searched by our method.
SupNet-EI SupNet-E SupNet-MI SupNet-M
Err. Param. Err. Param. Err. Param. Err. Param.
Full model 2.78% 15.3M 2.98% 4.6M -% 72.7M 2.58% 26.2M
Random w/o FT 13.45% 10.7M 15.87% 3.0M 9.75% 35.4M 2.63% 22.4M
Random w/ FT 3.16% 10.7M 3.47% 3.0M 2.69% 35.4M 2.56% 22.4M
NASAS 2.56% 10.8M 2.73% 3.1M 2.06% 33.4M 2.20% 21.6M
(a) Impact of our a posteriori distribution guided sampling. w/o FT and w/ FT indicate whether the best searched architecture is fine-tuned on the dataset. Our NASAS does not need fine-tuning.
50 150 250 500
Error(%) 2.13 2.06 2.27 2.39
Params(M) 49.9 33.4 23.8 18.2
(b) Impact of the weight prior on SupNet-EI.
EI M EI
2.74% 2.49% 2.68%
2.56% 2.20% -
(c) Impact of the temperature . denotes fine-tuned results.
0.05K 0.5k 1.5k 5.0k 10k 20k 50K
Error(%) 2.17 2.06 2.06 2.04 1.98 - -
GPUs Days 0.02 0.23 0.69 2.31 4.63 9.26 23.15
(d) Impact of the number of sampled candidate architectures on SupNet-MI.
Table 3: Ablation study and parameter analysis.

Comparison with State-of-the-arts. Table. 1 shows the comparison results on Cifar-10. Here NASAS-X denotes the performance of our NASAS on the super-network SupNet-X. From top to bottom, the first group consists of state-of-the-art manually designed architectures on Cifar-10; the following three groups list the related NAS methods in which different search algorithms, e.g. RL, evolution, and gradient decent, are adopted; the last group exhibits the performance of our NASAS. It shows that our NASAS is capable of finding advanced architectures in a much efficient and effective way, e.g. it finds the architecture at the lowest errors 1.98% on 11.1 GPU days only.

We also enlist the two networks, Multi-branch ResNet gastaldi2017shake and ENAS pham2018efficient , that inspired our design of super-networks in Table 1. Our NASAS-E and NASAS-M outperform "ENAS+cutout" and "shake-shake+cutout" by 0.16% and 0.36% at smaller model sizes. In the inflated cases, our NASAS-MI/EI find architectures with even higher performance. Regarding the sampling based one-shot method "One-Shot Top" which achieves a competitive 3.7% classification error by randomly sampling 20000 network architectures, our NASAS attains a much higher performance by sampling only 1500 network architectures due to the a posteriori distribution guided sampling.

Table. 2 further demonstrate the performance of our NASAS on a much challenging dataset Cifar-100. Our NASAS achieves a good trade-off on efficiency and accuracy. It achieves 14.8% error rate with only 8.7 GPU days, which is very competitive in terms of both performance and search time.

Please note that results of our NASAS are achieved during search process without any additional fine-tuning on weights of the searched architectures, while those of other methods are obtained by fine-tuning the searched models. In the following ablation study, we will discuss more on this point.

Ablation Study and Parameter Analysis. We first evaluate the effect of our a posteriori distribution guided sampling method in Table. 3(a). Compared with the baseline "Random" sampling that is implemented by employing predefined dropout strategy as discussed in bender2018understanding , "NASAS" successfully finds better sub-networks which bring relatively 14% - 23% gain. Evidently, the a posteriori distribution guided sampling is much more effective, which validates that our approach can learn a meaningful distribution for efficient architecture search. Besides, as can be viewed in the table, there is usually a huge performance gap between the architecture searched with predefined distribution with and without fine-tuning, which reveals the mismatching problems.

Table. 3(b) discusses the weight prior in Eq. (17). We find that a good usually makes the term in Eq. (16) fall into a commonly used weight decay range. So we choose by grid search. As shown in this table, the weight prior affects both error rate and model size. The higher the is, the smaller the size of parameters. Since the objective of NAS is to maximize performance rather than size of parameters, we choose the one with the minimal error rate.

Table. 3(c) shows the impact of temperature value in Eq. (17). It shows that a smaller leads to a lower error, which is consistent with the analysis regarding to Eq. (17). The corresponding fine-tuned result of our NASAS also provides marginal improvement, which on the other hand demonstrates the reliability of our NASAS on sampling of both architecture and weights.

We further evaluate the impact of number of samples in Table. 3(d). The performance improves along with the increase of number of samples as well as the GPU days. Here we choose sampling 1500 architectures as a trade-off between the complexity and accuracy. Please also note that compared with other sampling-based NAS methods, our scheme achieves 2.17 % error rate by sampling only 50 architectures with the assistance of the estimated a poseteriori distribution. It further reveals the fact that the estimated distribution provides essential information of the distribution of architectures and thus significantly facilitates the sampling process in terms of both efficiency and accuracy.

Model ResNet50 Inflated ResNet50 NASAS-R-50
Error 23.96% 22.93% 22.73%
Params 25.6M 44.0M 26.0M
Table 4: Test results on ImageNet with a relatively small super-network based on ResNet-50.

3.2 ImageNet

We further evaluate our NASAS on ImageNet with two super-networks based on ResNet50 he2016deep and DenseNet121 huang2017densely , respectively. Please find detailed experimental settings in the supplementary material. Rather than transferring architectures searched on smaller dataset, the efficiency and flexibility of our method enable us to directly search architectures on ImageNet within few days.

We first provide test results of our NASAS on ImageNet in Table 4 using a relatively small search space by inflating ResNet50 without limiting the size of the model parameters. Hype-parameters and training process for the three models are identical for fair comparison. It can be observed that NASAS-R-50 outperforms the ResNet50 by 1.23% with a similar size of parameters. Table. 5 shows the comparison with the state-of-the-art results on ImageNet. In this test, we control the size of searched architecture to be comparable to those of other NAS methods in mobile setting. Still, our NASAS outperforms. Please note that the size control limits our choice on . As shown in Table. 3(b), it may prevent us from finding better architectures with advanced performance.

Method Error(%)(Top1/Top5) GPUs Days Params(M) Search Method
NASNET-A zoph2018learning 26.0/8.4 1800 5.3 RL
NASNET-B zoph2018learning 27.2/8.7 1800 5.3 RL
NASNET-C zoph2018learning 27.5/9.0 1800 4.9 RL
AmoebaNet-A real2018regularized 25.5/8.0 3150 5.1 evolution
AmoebaNet-B real2018regularized 26.0/8.5 3150 5.3 evolution
PNAS liu2018progressive 25.8/8.1 225 5.1 SMBO
FBNet-C liu2018darts 25.1/- 9 5.5 gradient-based
SinglePath guo2019single 25.3/- 12 - sampling-based
DARTS liu2018darts 26.9/9.0 4 4.9 gradient-based
SNAS xie2018snas 27.3/9.2 1.5 4.3 gradient-based
NASAS-D-121(ours) 24.8/7.5 26 6.6 guided sampling
Table 5: Performance comparison with other state-of-the-art results on ImageNet. Please note our model is directly searched on ImageNet with 26 GPU days.

3.3 Discussions

Weight Sharing. Weight sharing is a popular method adopted by one-shot models to greatly boost the efficiency of NAS. But it is not well understood why sharing weight is effective elsken2018neural ; bender2018understanding . In NASAS, as discussed in subsection 2.2, we find that weight sharing can be viewed as a re-parametrization that enables us to estimate the a posteriori distribution via an end-to-end network training.

Limitations and Future Works. One limitation of our NASAS is that it can not explicitly choose the non-parametric operations such as pooling. Another one is that our NASAS requires prior knowledge on architectures which is hard to achieve. Here we approaches the prior only by manually designed networks. So our future work may be 1) enabling selections on the non-parametric operations (e.g. assigning a 1x1 convolution after each pooling operation as a surrogate to decide whether we need this pooling branch or not.) 2) investigating the robustness of our NASAS to different prior architectures.

4 Conclusion

In this paper, we propose a new one-shot based NAS approach, i.e. NASAS, which explicitly approximates a posteriori distribution of network architecture and weights via network training to facilitate an more efficient search process. It enables candidate architectures to be sampled w.r.t. the a posteriori distribution approximated on training dataset rather than uniform or predefined distribution. It also alleviates the mismatching problem between architecture and shared weights by sampling architecture-weights pair, which makes the ranking results more reliable. The proposed NASAS is efficiently implemented and optimized in an end-to-end way, and thus can be easily extended to other large-scale tasks.

References