Log In Sign Up

Dynamic Distribution Pruning for Efficient Network Architecture Search

Network architectures obtained by Neural Architecture Search (NAS) have shown state-of-the-art performance in various computer vision tasks. Despite the exciting progress, the computational complexity of the forward-backward propagation and the search process makes it difficult to apply NAS in practice. In particular, most previous methods require thousands of GPU days for the search process to converge. In this paper, we propose a dynamic distribution pruning method towards extremely efficient NAS, which samples architectures from a joint categorical distribution. The search space is dynamically pruned every a few epochs to update this distribution, and the optimal neural architecture is obtained when there is only one structure remained. We conduct experiments on two widely-used datasets in NAS. On CIFAR-10, the optimal structure obtained by our method achieves the state-of-the-art 1.9% test error, while the search process is more than 1,000 times faster (only 1.5 GPU hours on a Tesla V100) than the state-of-the-art NAS algorithms. On ImageNet, our model achieves 75.2% top-1 accuracy under the MobileNet settings, with a time cost of only 2 GPU days that is 100% acceleration over the fastest NAS algorithm. The code is available at <>


page 1

page 2

page 3

page 4


Multinomial Distribution Learning for Effective Neural Architecture Search

Architectures obtained by Neural Architecture Search (NAS) have achieved...

Accelerating Neural Architecture Search via Proxy Data

Despite the increasing interest in neural architecture search (NAS), the...

EEEA-Net: An Early Exit Evolutionary Neural Architecture Search

The goals of this research were to search for Convolutional Neural Netwo...

RelativeNAS: Relative Neural Architecture Search via Slow-Fast Learning

Despite the remarkable successes of Convolutional Neural Networks (CNNs)...

Towards Learning of Filter-Level Heterogeneous Compression of Convolutional Neural Networks

Recently, deep learning has become a de facto standard in machine learni...

Prune and Replace NAS

While recent NAS algorithms are thousands of times faster than the pione...

XferNAS: Transfer Neural Architecture Search

The term Neural Architecture Search (NAS) refers to the automatic optimi...

1 Introduction

Deep neural networks have demonstrated their extraordinary power for automatic feature engineering, which however involves extensive human efforts in finding good network architectures. To eliminate such handcraft architecture design, neural architecture search (NAS) was recently proposed to automatically discover suitable networks by searching over a vast architecture space. Recent endeavors have well demonstrated the superior ability of NAS in finding more effective network architecture, which has achieved state-of-the-art performance in various computer vision tasks and beyond, such as image classification

Zoph and Le (2016), semantic segmentation Chen et al. (2018); Liu et al. (2019) and language modeling Liu et al. (2018b); Zoph et al. (2018)

. Despite the remarkable progress, existing NAS methods are limited by intensive computation and memory costs in the offline architecture search. For example, reinforcement learning (RL) based methods

Zoph et al. (2018); Zoph and Le (2016) train and evaluate more than neural networks across GPUs over days. To accelerate this training, recent methods like DARTS Liu et al. (2018b) reduce the search time by formulating the task in a differentiable manner, where the search space is relaxed to a continuous space. Thus, the objective function is optimized by gradient descent, which well reduces the search time to days, while retaining a comparable accuracy. However, DARTS still suffers from high GPU memory consumption, which increases linearly with the size of the candidate search set. Therefore, the need for speed-up NAS algorithms retains urgent when applying to various real-world applications.

(a) Dynamic Distribution Pruning

Objective Probability

Figure 1: (a) The overall framework of the proposed Dynamic Distribution Pruning NAS. The method first samples architectures from the search space according to the corresponding distribution. Then, the generated network is trained epochs with forward and backward propagation.

is estimated by testing the network on the validation set. Finally, the element with the lowest probability is pruned from the candidates. (b) The probabilistic graphical model view of our framework.

is the objective probability distribution we aim to optimize.

A conventional NAS method consists of three parts Elsken et al. (2018): search space, search strategy, and performance estimation. Most NAS methods share the same search space, and have intensive computational requirements in the search strategy and performance estimation. In terms of search strategy, reinforcement learning Sutton and Barto (2018)

and evolutionary algorithms

Back (1996) are widely used in the literature, which require a large number of structure-performance pairs to find the optimal architecture. In terms of performance estimation, most NAS methods Zoph et al. (2018); Chen et al. (2018) use full-fledged training and validation over each searched architecture, which is computationally expensive and thus limits the search exploration. To reduce the computational cost, Zoph et al. (2018); Zela et al. (2018) propose to use early stopping to estimate the performance with a shorter training time. However, extensive experiments in Ying et al. (2019) show that the performance ranking is not consistent during different training epochs, which indicates that the early stopping may result in sub-optimal architectures.

In this paper, we propose a Dynamic Distribution Pruning method for extremely efficient Neural Architecture Search (termed as DDPNAS), which considers architectures as sampled from a dynamic joint categorical distribution. More specifically, we introduce a dynamic distribution to control the choices of inputs and operations, and a specific network structure is directly obtained via sampling. In the searching process, we generate different samples and train them on the training set for a few epochs. Then, the evaluation results on the validation set are used to estimate the parameters of the distribution. The element with the lowest probability will be dynamically pruned. Finally, the best architecture is achieved when there is only one architecture left in the search space. Fig. 1 shows the overall framework of the proposed DDPNAS.

We validate the search efficiency and performance of our architecture for the classification task on CIFAR10 Krizhevsky and Hinton (2009) and ImageNet-2012 Russakovsky et al. (2015). Our architecture reaches the state-of-the-art test error on CIFAR-10 (i.e., %). On ImageNet, out model achieves % top-1 accuracy under the MobileNet setting (MobileNet V1/V2 Howard et al. (2017); Sandler et al. (2018)). Our contributions are summarized as follows:

  • We introduce a novel NAS strategy, referred as DDPNAS, which is memory-efficient and flexible on large datasets. For the first time, we enable NAS to have a similar computational cost to the training of conventional CNNs.

  • A new theoretical perspective is introduced in NAS. Rather than optimizing a proxy like other methods Zoph et al. (2018); Liu et al. (2018b), we directly optimize in the NAS search space. Our model can thus be easily incorporated into most existing NAS algorithms to speed-up the search process. A theoretical analysis is further provided.

  • In the experiments on CIFAR-10 and ImageNet, we show that DDPNAS achieves remarkable search efficiency, e.g., % test error on CIFAR-10 after hours searching with one Tesla V100 (more than faster compared with the state-of-the-art algorithms Zoph et al. (2018); Real et al. (2018)). When evaluating on ImageNet, DDPNAS can directly search over the full ImageNet dataset within 2 days, which achieves % top-1 accuracy under the MobileNet settings.

2 Related Work

Neural architecture search is an automatic architecture engineering technique, which has received significant attention over the last few years. For a given dataset, architectures with high accuracy or low latency are obtained by performing a heuristic search in a predefined search space. For image classification, most human-designed networks are built by stacking

reduction (i.e., the spatial dimension is reduced and the channel size is increased) and norm (i.e., the spatial and channel dimensions are preserved) cells He et al. (2016); Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Huang et al. (2017); Hu et al. (2018). Therefore, existing NAS methods Zoph and Le (2016); Zoph et al. (2018); Liu et al. (2018a, b) can search architectures under the same settings to work on a small search space.

Many different search algorithms have been proposed to explore the neural architecture space using specific search strategies. One popular method is to model NAS as a reinforcement learning (RL) problem Zoph and Le (2016); Zoph et al. (2018); Baker et al. (2016); Cai et al. (2018a); Liu et al. (2018a); Cai et al. (2018b). Zoph et al.Zoph et al. (2018)

employs a recurrent neural network as the policy function to sequentially generate a string that encodes the specific neural architecture. The policy network can be trained with the policy gradient algorithm or the proximal policy optimization. Cai

et al. Cai et al. (2018a, b) propose a method that regards the architecture search space as a tree structure for network transformation. In this method, new network architectures can be generated by a father network with some predefined transformations, which reduces the search space and thus speeds up the search. Another alternative way to explore the architecture space is through evolutionary based methods, which evolve a population of network structures using evolutionary algorithms Xie and Yuille (2017); Real et al. (2018). Although the above architecture search algorithms have achieved state-of-the-art results on various tasks, a large amount of computational resources are still needed.

To overcome this problem, several recent works have proposed to accelerate NAS in a one-shot setting, which has demonstrated the possibility to find the optimal network architecture within a few GPU days. In this one-shot architecture search, each architecture in the search space is considered as a sub-graph sampled from a super-graph, and the search process can be accelerated by parameter sharing Pham et al. (2018). Liu et al. Liu et al. (2018b) jointly optimized the weights within two nodes with the hyper-parameters under continuous relaxation. Both the weights in the graph and the hyper-parameters are updated via standard gradient descent. However, the method in Liu et al. (2018b) still suffers from large GPU memory footprints, and the search complexity is still not applicable to real-world applications. To this end, Cai et al. Cai et al. (2018c) adopte the differentiable framework and proposed to search architectures without any proxy. However, the method still keeps the same search algorithms as the previous work Liu et al. (2018b).

Different from the previous methods, we consider NAS in another way: The operation selection is considered as a sample from a dynamic categorical distribution. Thus the optimal architecture can be obtained through distribution pruning, which achieves an extreme efficiency as quantized in Sec.4.

3 The Methodology

In this section, we present the proposed dynamic distribution pruning method for neural architecture search. We first describe the architecture search space in Sec. 3.1. Then, the proposed dynamic distribution pruning framework is introduced in Sec. 3.2. Finally, a theoretical analysis of the error bound of the proposed method is provided in Sec. 3.3.

3.1 Architecture Search Space

We follow the same architecture search space as in Liu et al. (2018b); Zoph and Le (2016); Zoph et al. (2018). A network consists of a pre-defined number of cells Zoph and Le (2016), which can be either norm cells or reduction cells. Each cell takes the outputs of the two previous cells as input. A cell is a fully-connected directed acyclic graph (DAG) of nodes, i.e., . Each node takes the dependent nodes as input, and generates an output through a sum operation

Here each node is a specific tensor (


a feature map in convolutional neural networks) and each directed edge

between and denotes an operation , which is sampled from the corresponding search space . Note that the constraint ensures there will be no cycles in a cell. Each cell takes outputs of two dependent cells as input, and we set the two input node as and for simplicity. Following Liu et al. (2018b), the operation search space consists of operations: max pooling, no connection (zero), average pooling, skip connection (identity), dilated convolution with rate , dilated convolution with rate , depth-wise separable convolution, and depth-wise separable convolution. Therefore, the size of the whole search space is , where is the set of possible edges with intermediate nodes in the fully-connected DAG. In our case with , together with the two input nodes, the total number of cell structures in the search space is , which is an extremely large space to search, and thus requires efficient optimization strategies.

3.2 Dynamic Distribution Pruning

As illustrated in Fig. 1, the architecture search is formulated as a staged conditional sampling process in our approach. More specifically, for a given edge , we introduce a dynamic categorical distribution , with . In other words, the sampling process begins from the latent state . Then each operation is selected to be the -th operation with a probability . We follow the state-of-the-art works in Liu et al. (2018b); Pham et al. (2018), and use an over-parametrized parent network containing all possible operations at each edge with a weighted probability : where

. This design allows the neural architecture search to be optimized through stochastic gradient descent (SGD) by using an EM-like algorithm,

i.e., iteratively fixing to update the network parameters , and fixing to update . While the real-value weights bring convenience in optimization Liu et al. (2018b), it also requires every possible operation in to be evaluated, which directly causes impractically long training time. Instead, we set

as a one-hot indicator vector:


which can be sampled from a categorical distribution, . While being able to bring significant speed-up, this discrete weight design also leads to difficulty in optimization. Here we propose to optimize it through the validation likelihood as a proxy, which has nice theoretical properties and is one of the core contributions of this paper. Given a set of indicator variables , a network structure is determined. Based on the training data , we are then able to train the model to get the parameter set , which finally allows us to test on the validation set to obtain the labels , as shown in Fig. 1. While our ultimate goal is to find an optimal network architecture that can be fully represented by , we show in the following theorem that it is equivalent to maximize the likelihood of the validation target .

Theorem 1.

In a certain training epoch, the structure variable directly determines the validation performance, specifically:


As illustrated in Fig. 1, the function can be formulated as:


where and are the inputs and labels from the training set, is the set of network weights, and denotes the training epochs. Since are observed variables, during a specific training epoch , Eq. 2 can be further simplified to:


To simplify the analysis, without loss of generality, we assume the network weights are initialized as constants, which means is fixed given a certain structure and training epoch, we can further simplify Eq. 3 to


As shown in Eq. 4, the structure variable directly determines the validation performance, i.e., if a structure shows a better performance on the validation set, the corresponding holds a high probability, and vice versa. Therefore, the theorem is true during any specific training epoch. ∎

Based on Theorem 1, can be optimized by optimizing , which involves the standard sampling, training, evaluating and updating processes. While we reduce the computation requirement by times by using discrete s, such a procedure is still time-consuming considering the large search space and the complexity of network training. Inspired by Ying et al. (2019), we further propose to use a dynamic pruning process to boost the efficiency by a large margin. Ying et al. Ying et al. (2019) did a seires of experiments showing that in the early stage of training, the validation accuracy ranking of different network architectures is not a reliable indicator of the final architecture quality. However, we observe that the experiment results actually suggest a nice property that if an architecture performs bad in the beginning of training, there is little hope that it can be the final optimal model. As the training progresses, this observation shows less uncertainty. Based on this intuition, we derive a simple yet effective pruning process: During training, along with the increasing epoch index , we progressively prune the worse performing model. Further theoretical analysis shows that this strategy has nice theoretical bound, as will be introduced in Sec. 3.3.

Specifically, as illustrated in Fig. 1, we first sample the network structure by sampling a set of from . Then, these structures are trained with epochs, and the probability of is estimated by


Using Theorem 1, the distribution of the latent state is updated by softmax:


Note a non-zero denotes that the structure selects the th operation at edge . Finally, we prune the categorical distribution with minimal probability in : where . The optimal structure is obtained when there is only one architecture in the distribution. Our dynamic distribution pruning algorithm is presented in Alg. 1.

Input: Training data: ; Validation data: ; Searching hyper graph:
. Output: Optimal structure
1. while ()  do
2        Disjoint sample network structures;
3        for t= 1,…,T epoch do
4               Optimize ;
5               Evaluate performance ;
6        end for
7       Estimate the distribution ;
8        Prune the minimal element , ;
9        ;
10 end while
Algorithm 1 Dynamic Distribution Pruning

3.3 Theoretical Analysis

To ensure the greedy pruning to work well, we should have an accurate early estimation of in Eq. 5. To achieve this goal, a theoretical upper bound is given as below.

Corollary 1.

In the

th epoch, the standard deviation

of the estimation error in Eq. 5 is


where are two constants, and is the epoch when convergence is reached.

The corollary is a generalized conclusion from Ying et al. (2019), which demonstrates the validation performance at th epoch and that at the th epoch, where convergence is met, have an increasing Spearman Rank Correlation when is approaching . And shows strong significance in linear relationship with , with the assumption that is a constant. Considering this assumption is widely true in popular learning rate reduction schemes such as cosine annealed schedule, we generalize this empirical formulation in a formal mathematical language in Eq. 7 by introducing a deviation function of the estimation error , e.g., a low rank correlation corresponds to a high deviation.

We consider pruning makes a mistake when the pruned architecture is actually the optimal one, which means the prediction error is considerably large. While the s may vary case by case, when taking all possible architectures in into consideration, there exists a threshold such that if we consider pruning makes a mistake when , we can get the same error rate, i.e. probability of pruning making mistakes. Following Eq. 7, the threshold has the form

Theorem 2.

The upper bound of the error rate of Alg. 1 is


Following the discussion above, we have the error rate equivalent to . From Chebyshev’s inequality, we have


While the bound above is for one epoch only, if we consider a series of pruning operations until one architecture is left, the overall bound of the total error rate is


Based on and defined in Eq. 7 and Eq. 8, the error bound can be further formulated as


Theorem 2 quantitatively demonstrates the rationale of the dynamic pruning design. The error bound is decided by and . On one hand, when the training just begins, is large, and we have to be conservative not to prune the architecture early. On the other hand, when gets closer to , we can prune more aggressively with a guaranteed low risk of missing the optimal architecture.

4 Experiments

In this section, we compare our approach with state-of-the-art methods on both effectiveness and efficiency in terms of CIFAR-10 and ImageNet. First, we conduct experiments under the same settings as previous methods Liu et al. (2018b); Cai et al. (2018b); Zoph et al. (2018); Liu et al. (2018a) to evaluate the generalization capability, i.e., first searching on CIFAR10 dataset, then stacking the optimal cells to deeper networks. Second, we further perform experiments to search architectures directly on ImageNet under the mobile settings by following Cai et al. (2018c). Our results show that we can obtain the network architecture with comparable performance but with much fewer GPU hours.

4.1 Search on CIFAR-10 and Transfer

In this experiment setting, we first search neural architectures on an over-parameterized network, and then evaluate the best architecture with a stacked deeper network. We ran the experiment multiple times and found that the result architectures only showed slight variance in performance, which demonstrates the stability of the proposed method.

4.1.1 Experiment Settings

We use the same datasets and evaluation metrics as existing NAS works

Liu et al. (2018b); Cai et al. (2018b); Zoph et al. (2018); Liu et al. (2018a). First, most experiments are conducted on CIFAR-10 Krizhevsky and Hinton (2009), which has K training images and K testing images with resolution and from classes. The color intensities of all images are normalized to . During architecture search, we randomly select K images from the training set as a validation set. To further evaluate the generalization capability, we stack the discovered optimal cell on CIFAR-10 into a deeper network, and then evaluate the classification accuracy on ILSVRC 2012 ImageNet Russakovsky et al. (2015), which consists of classes with M training images and K validation images. Here, we consider the mobile setting where the input image size is and the number of multiply-add operations is less than 600M.

In the search process, we consider a total of cells in the network, where the reduction cells are inserted in the second and the third layers, with internal nodes in each cell. The search epoch correlates to the estimating epoch . In our experiment, we set , so the network is trained less than 150 epochs, with a batch size of 512 (due to the shallow network and few operation samplings), and an initial channels of . We use SGD with momentum to optimize the network weights , with an initial learning rate of (annealed down to zero following a cosine schedule), a momentum of 0.9, and a weight decay of . The learning rate of category parameters is set to . The search takes only GPU hours with only one Tesla V100 on CIFAR-10.

In the architecture evaluation step, our experimental settings are similar to Liu et al. (2018b); Zoph et al. (2018); Pham et al. (2018). A large network of cells is trained for epochs with a batch size of and additional regularization, such as cutout DeVries and Taylor (2017)

. When stacking cells to evaluate on ImageNet, we use two initial convolutional layers of stride

before stacking cells with scale reduction at the st, nd, th and th cells. The total number of FLOPs is determined by the initial number of channels. The network is trained for 250 epochs with a batch size of 512, a weight decay of

, and an initial SGD learning rate of 0.1. All the experiments and models are implement in PyTorch

Paszke et al. (2017).

Architecture Test Error Params Search Cost Search
(%) (M) (GPU days) Method
ResNet-18 He et al. (2016) 3.53 11.1 - Manual
DenseNet Huang et al. (2017) 4.77 1.0 - Manual
SENet Hu et al. (2018) 4.05 11.2 - Manual
NASNet-A Zoph et al. (2018) 2.65 3.3 1800 RL
AmoebaNet-A Real et al. (2018) 3.34 3.2 3150 Evolution
PNAS Liu et al. (2018a) 3.41 3.2 225 SMBO
ENAS Pham et al. (2018) 2.89 4.6 0.5 RL
Path-level NAS Cai et al. (2018b) 3.64 3.2 8.3 RL
DARTS(first order) Liu et al. (2018b) 2.94 3.1 1.5 Gradient-based
DARTS(second order) Liu et al. (2018b) 2.83 3.4 4 Gradient-based
Random Sample Liu et al. (2018b) 3.49 3.1 - -
DDPNAS 2.58 3.4 0.06 Pruning
DDPNAS(large) 1.9 4.8 0.06 Pruning
Table 1: Test error rates for our discovered architecture, human-designed networks and other NAS architectures on CIFAR-10. For fair comparison, we select the architectures and results with similar parameters ( M) and training conditions. In addition, we further train the optimal architecture in a larger setting i.e., with more initial channels (), training epochs (), and extra regularization.

4.1.2 Results on CIFAR-10

We compare our method with both manually designed networks and other NAS networks. The manually designed networks include ResNet He et al. (2016), DenseNet Huang et al. (2017) and SENet Hu et al. (2018)

. For NAS networks, we classify them according to different search methods, such as RL methods (NASNet

Zoph et al. (2018), ENAS Pham et al. (2018) and Path-level NAS Cai et al. (2018b)), evolutional algorithms (AmoebaNet Real et al. (2018)), Sequential Model Based Optimization (SMBO) (PNAS Liu et al. (2018a)), and gradient-based methods (DARTS Liu et al. (2018b)).

The summarized results for convolutional architectures on CIFAR-10 are presented in Tab. 1. In addition, we define an enhanced training variant, where a larger network with initial channels, is trained for epochs with Auto-Augmentation Cubuk et al. (2018) and dropout of probability Srivastava et al. (2014). It is worth noting that the proposed method outperforms the state-of-the-arts Zoph et al. (2018); Liu et al. (2018b) in accuracy and is with much less computation consumption (only GPU days in Zoph et al. (2018)). We attribute our superior results to our novel way of solving the problem with pruning, as well as the fast learning procedure: The network architecture can be directly obtained from the distribution when it converges. In contrary, previous methods Zoph et al. (2018) evaluate architectures only when the training process is complete, which is highly inefficient. Another notable observation in Tab. 1 is that, even with random sampling in the search space, the test error rate in Liu et al. (2018b) is only %, which is comparable with the previous methods in the same search space. We can therefore conclude that high performance of the previous methods is partially from the search space that is dedicatedly and manually designed with specific expert knowledge. Meanwhile, the proposed method quickly explores the search space and generates a better architecture. We also report the results of hand-crafted networks in Tab. 1. Clearly, our method shows a notable enhancement, which indicates its superiority in both resource consumption and test accuracy.

4.1.3 Results on ImageNet

We further compare our method under the mobile setting on ImageNet to demonstrate the generalization capability. The best architecture obtained by our algorithm on CIFAR-10 is transferred to ImageNet, which follows the same experimental setting as the works in Zoph et al. (2018); Pham et al. (2018); Cai et al. (2018b). Results in Tab. 2 show that the best cell architecture on CIFAR-10 is transferable to ImageNet. The proposed method achieves comparable accuracy to state-of-the-art methods Zoph et al. (2018); Real et al. (2018); Liu et al. (2018a); Real et al. (2018); Liu et al. (2018a); Pham et al. (2018); Liu et al. (2018b); Cai et al. (2018b) while using much less computational resource.

Architecture Accuracy (%) Params Search Cost Search
Top1 Top5 (M) (GPU days) Method
MobileNetV2 Sandler et al. (2018) 72.0 91.0 3.4 - Manual
ShuffleNetV2 2x (V2) Ma et al. (2018) 73.7 - 5 - Manual
NASNet-A Zoph et al. (2018) 74.0 91.6 5.3 1800 RL
AmoebaNet-A Real et al. (2018) 74.5 92.0 5.1 3150 Evolution
AmoebaNet-C Real et al. (2018) 75.7 92.4 6.4 3150 Evolution
PNAS Liu et al. (2018a) 74.2 91.9 5.1 225 SMBO
DARTS Liu et al. (2018b) 73.1 91.0 4.9 4 Gradient-based
DDPNAS (Ours) 74.3 91.8 4.51 0.06 Pruning
Table 2: Comparison with the state-of-the-art image classification methods on ImageNet. All the NAS networks are searched on CIFAR-10, and then directly transferred to ImageNet.
Model Top-1 Search time GPU latency
GPU days
MobileNetV2 72.0 - 6.1ms
ShuffleNetV2 72.6 - 7.3ms
Proxyless (GPU) Cai et al. (2018c) 74.8 4 5.1ms
Proxyless (CPU) Cai et al. (2018c) 74.1 4 7.4ms
DDPNAS 75.2 2 6.09ms
Table 3: Comparison with the state-of-the-art image classification on ImageNet with the mobile settings. The networks are directly searched on ImageNet with MobileNetV2 Sandler et al. (2018) as the backbone.

4.2 Search on ImageNet

The minimal time and GPU memory consumption make applying our algorithm on ImageNet feasible. We further conduct a search experiment on ImageNet by following Cai et al. (2018c). In particular, we employ a set of mobile convolutional layers with various kernels and expanding ratios . To further accelerate the search, we directly use the network with the CPU and GPU structure obtained in Cai et al. (2018c). In this way, the zero and identity layers in the search space are abandoned. And we only search the hyper-parameters related to the convolutional layers.

On ImageNet, we keep the same search hyper-parameters as on CIFAR-10. We follow training settings in Cai et al. (2018c), train the models for epochs with a learning rate (annealed down to zero following a cosine schedule), and a batch size of across Tesla V100 GPUs. Experimental results are reported in Tab. 3, where our DDPNAS achieves superior performance compared to both human-designed and automating searched architectures with much less computation cost.

5 Conclusion

In this paper, we presented DDPNAS, the first pruning-based architecture search algorithm based on dynamic distributions for convolutional networks, which is able to reduce the search time by pruning the search space in early training stage. DDPNAS can drastically reduce the computation cost while achieving excellent model accuracies on CIFAR-10 and ImageNet comparing with other NAS methods. Furthermore, DDPNAS can directly search on ImageNet, outperforming human-designed networks and other NAS methods under mobile settings.