1 Introduction
Machine learning researchers have invested much effort over the last decades into feature engineering, the process of handcrafting features for machine learning algorithms. With the proliferation of deep learning, this process has been replaced by the manual design of larger and more complex models.
Model design requires domain expertise and many rounds of trialanderror. Neural architecture search (NAS) (Zoph and Le (2016)) automates this process using RL; however, searching for a new architecture can require thousands of GPU hours. Due to the cost of prevailing NAS methods, most techniques search for an architecture over a small proxy dataset and release the discovered architecture as their contribution. This is suboptimal—neither the proxy dataset nor the resource constraints targeted during the search could possibly address all downstream uses of this architecture.
Differentiable NAS (DNAS) (Cai et al. (2018b); Liu et al. (2018)) methods aim to alleviate this limitation by building a superset network (supernetwork or search space) and searching for useful subnetworks using gradient descent. These supernetworks are typically composed of densely connected building blocks with multiple parallel operations. The goal of the search method is to prune connections and operations, trading representational capacity for efficiency, to fit a certain computation budget. DNAS methods can be viewed as pruning methods with the subtle difference that they are applied on manually designed supernetworks with redundant components while pruning methods (LeCun et al. (1990); Han et al. (2015)) are usually applied on standard models.
The canonical approach to DNAS is to select an operator from a fixed set of operators by gating their outputs and treating them as unmodifiable (blackbox) units. In this sense, DNAS has inherited some of the limitations of RL methods since they cannot dynamically change the units during optimization. For instance, to learn the width of each layer, DNAS and RL methods typically enumerate a set of fixedwidth operators, generating independent outputs for each. This is not only computationally expensive but also a coarse way of exploring subnetworks.
We propose (), a search method inspired by structured pruning. For each output feature of a layer (i.e., output channel in convolutional layer, or neuron in a dense layer), we assign a Bernoulli random variable (
mask) indicating whether that feature should be used. We use the LogisticSigmoid distribution (Maddison et al. (2016)) to relax the binary constraint and learn the masking probabilities using gradient descent, optimizing for various resource constraints such as model size or FLOPs. We export an architecture, defined as a mapping from layers to numbers of neurons, by sampling a mask at the end of training.
can be applied to any search space by simply inserting masks after each layer. Our method is finegrained in that we search over a larger space of architectures than ordinary DNAS methods by applying masks on operator outputs as well as on intermediate layers that compose the operator. can simultaneously select a subset of operators and modify them as well.
In some sense, DNAS has shifted the problem of architecture design to search space design. Many DNAS works target a single metric on a single, manuallydesigned search space; however, each search space may come with its own merits. This coupling between search space and algorithm makes it hard to (1) compare different search algorithms, and (2) understand the biases inherent to different search spaces (Sciuto et al. (2019); Radosavovic et al. (2019)). NASbench101 (Ying et al. (2019)) addresses the former, providing a large set of architectures trained on CIFAR to evaluate RLbased NAS algorithms. On the other hand, our method can be used to study the latter. Since our method can easily be injected into any DNAS search space, we can characterize their bias toward certain metrics. We find that some produce models that are more Paretoefficient for model size while others are more Paretoefficient for FLOPs/latency.
When applied to wellknown search spaces (Bender et al. (2018); Wu et al. (2019)), matches or outperforms the original search method. When applied on the OneShot search space, achieve stateoftheart small model accuracy on ImageNet (by a 25% margin). When using ImageNetlearned architectures as backbones for detection, achieves +4 mAP over mobile baselines on COCO. When applied to commonly used ResNet models, outperforms pruning baselines.
2 Related Work
Neural architecture search (NAS) automates the design of neural net models with machine learning. Early approaches (Zoph and Le (2016); Baker et al. (2016)
) train a controller to build the network with reinforcement learning (RL). These methods require training and evaluating thousands of candidate models and are prohibitively expensive for most applications. Weight sharing methods (
Brock et al. (2017); Pham et al. (2018); Cai et al. (2018a)) amortize the cost of evaluation; however, (Sciuto et al. (2019)) suggest that these amortized evaluations are noisy estimates of actual performance.
Of growing interest are mobile NAS methods which produce smaller architectures that fit certain computational budgets or are optimized for certain hardware platforms. MnasNet (Tan et al. (2019)) is an RLbased method that optimizes directly for specific metrics (e.g., mobile latency) but takes several thousand GPUhours to search. Oneshot and differentiable neural architecture search (DNAS) (Bender et al. (2018); Liu et al. (2018)) methods cast the problem as finding optimal subnetworks in a continuous relaxation of NAS search spaces which is optimized using gradient descent.
Our work is most closely related to probabilistic DNAS methods which learn stochastic gating variables to select operators in these search spaces. (Cai et al. (2018b)) use hard (binary) gates and a straightthrough estimation of the gradient, whereas (Xie et al. (2018); Wu et al. (2019); Dong and Yang (2019)) use soft (nonbinary) gates sampled from the GumbelSoftmax distribution (Jang et al. (2016); Maddison et al. (2016)) to relax the discrete choice over operators. In contrast, our method performs a finegrained search over the set and composition of operators. Some methods learn a single cell structure that is repeated throughout the network (Dong and Yang (2019); Xie et al. (2018); Bender et al. (2018)) whereas our method learns cell structures independently.
Our work draws inspiration from structured pruning methods (Luo et al. (2017); Liu et al. (2017); Wen et al. (2016)). MorphNet (Gordon et al. (2018)) adjusts the number of neurons in each layer with an penalty on BatchNorm scale coefficients, treating them as gates. (Louizos et al. (2017)) propose a method to induce exact sparsity for oneround compression. Recent work by (Mei et al. (2020)) independently proposes finegrained search with an penalty. In contrast, we propose a stochastic method that samples sparse architectures throughout the search process.
Recent analytical works highlight the importance of search space design. Of particular relevance is the study in (Radosavovic et al. (2019)) which finds that randomly sampled architectures from certain spaces (e.g., DARTS (Liu et al. (2018))) are superior when normalizing for certain measures of complexity. (Sciuto et al. (2019)) find that randomly sampling the search space produces architectures on par with both controllerbased and DNAS methods. (Xie et al. (2019)) suggest that the wiring of search spaces plays a critical role in the performance of subnetworks. The success of NAS methods, therefore, can be attributed in no small part to search space design.
3 Method
We search for architectures that minimize both a task loss and a computational cost . Our approach is akin to stochastic differentiable search methods such as (Xie et al. (2018)) which formulate the architecture search problem as sampling subnetworks in a large supernetwork composed of redundant components (operators). While efficient, these methods add restrictive priors to the learned architectures: (1) the operators (e.g., a set of depthwiseseparable convolutions with various widths and kernel sizes) are handdesigned and the search algorithm cannot modify them, and (2) the search algorithm is limited to selecting a single operator for each layer in the network.
relaxes both constraints by (1) modifying operators during the search process, and (2) allowing more than one operator per layer. To concretely illustrate the benefit of (1), we focus on the width (i.e., number of filters or neurons) of each convolution. To modify the width during the search process, learns a sampling distribution over individual neurons in the supernetwork instead of a distribution over operators. As a result, the operators in the learned architectures can have finegrained, variable widths which are not limited to a predefined set of values.
progresses in two phases: an architecture learning phase () where we output an optimized architecture by minimizing both and , followed by a retraining phase with alone. Our loss for AL is similar to sparsityinducing regularizers (Gordon et al. (2018)). Sec. 3.1 describes our stochastic relaxation of and sampling method, 3.2 describes the masking layer and regularization penalty, and Sec. 3.3 describes our formula for finegrained search on existing spaces.
3.1 Inducing Sparsity with the Distribution
Let be the weights of the network. We assume computational costs of the form , i.e., a function of the set of indicators for whether each weight is nonzero. FLOPs, size, and latency can be expressed exactly or wellapproximated in this form. The objective is then:
(1) 
We refer to as the regularization strength. Since has zero gradient when , we cannot minimize with gradient descent. Instead, we formulate the problem as a stochastic estimation task. Let denote a binary mask to be applied on , where are independent Bernoulli variables. We minimize the usage of by minimizing the probability that the mask is 1 so we can safely prune . Our sampled architectures are defined by . By substituting with , our objective becomes:
(2) 
where denotes elementwise product. Unless otherwise specified, all expectations are taken w.r.t. and we drop the subscript on for brevity. We can estimate the gradient w.r.t. with blackbox methods, e.g., perturbation methods (Spall et al. (1992)) or the logderivative trick (Williams (1992)
); however, these estimators generally suffer from high variance. Instead, we relax
with a continuous sample from the distribution:where . The distribution is the binary case of the GumbelSoftmax (a.k.a. Concrete) distribution (Jang et al. (2016); Maddison et al. (2016)). As , approaches with probability respectively. Factoring out logistic noise as a parameterfree component allows us to backpropagate through the mask and learn with gradient descent. The resulting gradient estimator has lower variance than blackbox methods (Maddison et al. (2016)). We optimize in practice for numerical stability.
Our stochastic relaxation allows us to better model the sparsity of the learned architectures during the search phase than deterministic relaxations. To illustrate, consider the common deterministic approach to relax with an norm where (Wen et al. (2016); Gordon et al. (2018); Mei et al. (2020)). In this case, the weights can be far from during training, which can be problematic if the network relies on the information encoded in these pseudosparse weights to make accurate predictions. Instead, we want to simulate real sparsity during AL. Other deterministic methods apply a saturating nonlinearity (e.g., sigmoid or softmax) to force values close to (Liu et al. (2018)). However, these functions suffer from vanishing gradients at extrema: once a weight is close to zero, it remains close to zero. This limits the number of sparse networks explored during AL. In contrast, our sampled mask is close to at all times at low , which forces the network to adapt to sparse activations, and the mask can be nonzero even as approaches 0, which allows the network to visit diverse sparse states during AL.
3.2 Group Masking and Regularization
As neurons are pruned during the search process, we can prune downstream dependencies as well. We group each neuron and its downstream dependencies by sharing a single mask across all their elements. To illustrate, consider the weight matrices of two 1x1 convolutions and below, where the output of is fed into . If neuron , then can be pruned and vice versa. Therefore, all elements in and share a scalar mask .
This rowcolumn grouping can be implemented conveniently by applying a separate mask on each channel of the activations produced by each convolution and fullyconnected layer. This allows us to encapsulate all architecture learning metavariables () in a dropin layer which can easily be inserted in the search space.
Let be the contribution of and to the total cost . To encourage sparsity, we can either regularize the mask () or the distribution parameters (). As , the former penalizes the cost of sampled architectures while the latter penalizes the expected cost. In our example above, the sampled and expected costs (in number of parameters) are:
(3)  
(4) 
Note however that and are dynamic quantities: as inputs to and outputs of are masked out by adjacent masking layers, and decrease as well. To capture this dynamic behavior, we apply our differentiable relaxation from Sec. 3.1 again to approximate and . Let and be perchannel masks on inputs to and outputs of . The sampled and expected costs are then:
(5)  
(6) 
We observe that minimizing (6) is more stable than minimizing (5). We use (6) for our results in Sec. 4, and scale appropriately for different costs such as FLOPs.
After , we export a single architecture, defined as a mapping from each convolution layer to its expected number of neurons under the learned distribution parameters . In our example above, convolution would have neurons in the exported architecture.
3.3 FineGrained Search
To apply our method to DNAS search spaces, we simply insert masking layers after convolution layers as illustrated in Fig. 1. We run our search algorithm on the OneShot and FBNet search spaces. The OneShot search space is composed of a series of cells which are in turn composed of densely connected blocks. Each block consists of several operators, each of which applies a series of convolutions on the blocks’ inputs. Similarly, the FBNet search space is composed of stages which are in turn composed of blocks. The outputs of the operators are added together. We refer the reader to (Bender et al. (2018); Wu et al. (2019)) for more details.
DNAS methods generally gate the operator outputs directly to select a single operator with, e.g., a softmax a layer. In contrast, our architectures can have more than one operator per layer. Operators are removed from the network by learning to sample allzero masks on the operator’s output or the output of any of its intermediate activations. This process is illustrated in Fig. 1(c) – note that can select between 0 and all operators in each block. Since our method simultaneously optimizes for the set of operators and their widths, the space of possible architectures which we search is an exponentially larger superset of the original search space.
matches the performance of the original search algorithms when applied to the original OneShot and FBNet search spaces with no modifications. However, these search spaces are designed for coarsegrained search (operatorlevel sampling). We propose two minor modifications to the search space to take full advantage of finegrained search. Importantly, these modifications do not improve the accuracy of the architectures in and of themselves; they only give more flexibility for finegrained search and reduce the runtime of the search phase.
Concat Aggregator. By adding operator outputs, we enforce all output dimensions to match during . This restricts finegrained search in that each operator must have the same output shape. Instead, we can concatenate them and pass them through a 1x1 convolution (concat aggregator), which is a generalization of the additive aggregator. The benefits are twofold: (1) operator outputs can have variable sizes, and (2) can learn a better mixing formula for operator outputs. In practice, we observe that the concat aggregator works better on the OneShot search space when targeting model size and the additive aggregator works better on FBNet when targeting FLOPs.
Removing Redundant Operators.
To explore various architectural hyperparameter choices, coarsegrained NAS methods enumerate a discrete set of options. For instance, to learn whether a convolution in a given block should have 16 or 32 filters would require including two separate weight tensors in the set of options. Not only is this computationally inefficient – scaling both latency and memory with each additional operator – but the enumeration may not be granular enough to contain the optimal size. In contrast, finegrained search can shrink the 32filter convolution to be functionally equivalent to the 16filter convolution; therefore, we only need to include the former. In practice, this results in a 3x reduction in the number of operators in the FBNet search space and a 2.5x reduction in search runtime, with no loss of quality in the learned architectures.
4 Results
We use TensorFlow (
Abadi et al. (2016)) for all our experiments. Our algorithm takes hours to search and hours to retrain on ImageNet using a single 4x4 (32core) Cloud TPU.4.1 on OneShot Search Space
In this section, we evaluate our search algorithm on the OneShot search space (Bender et al. (2018)) to find efficient architectures for ImageNet classification (Russakovsky et al. (2015)). To compare against their results, we target model size. We use the same search space instantiation — 8 cells, 4 blocks per cell, separable convolutions, and downsampling every 2 cells. We merge the outputs of each path with a concat aggregator. Despite increasing the number of parameters, the concat aggregator does not increase the base accuracy of the supernetwork. The search space is illustrated in Fig. 1(a) — note that we apply masks on operator outputs as well as on individual convolutions that compose the operator. Our reproduction of their supernetwork matches their published results.
The masklogits variables
are initialized to (). We set without annealing and use our relaxation of for both forward and backward passes. To learn architectures of different sizes, we vary the regularization coefficient. For , we train for 100 epochs using ADAM with batch size 512 and learning rate 1.6 decayed by 0.5 every 35 epochs. For retraining, we use the same settings, except we train until convergence (400 epochs) and double the batch size and learning rate (1024 and 3.2, resp.) for faster convergence (
Smith et al. (2017)).Model  Top1 Acc.  #Params 

A ()  69.90.1%  1.30.02M 
OneShotSmall (Bender et al. (2018))  67.9%  1.4M 
MnasNetSmall (Tan et al. (2019))  64.9%  1.9M 
MobileNetV3Small0.75 (Howard et al. (2019))  65.4%  2.4M 
B ()  75.00.5%  2.70.06M 
MobileNetV3Small1.0 (Howard et al. (2019))  67.4%  2.9M 
OneShotSmall (Bender et al. (2018))  72.4%  3.0M 
MnasNet65 (Tan et al. (2019))  73.0%  3.6M 
FBNetA (Wu et al. (2019))  73.0%  4.3M 
C ()  77.10.03%  4.40.04M 
AtomNASB (Mei et al. (2020))  75.5%  4.4M 
OneShotSmall (Bender et al. (2018))  74.2%  5.1M 
EfficientNetB0 (Tan and Le (2019))  76.3%  5.3M 
MobileNetV3Large1.0 (Howard et al. (2019))  75.2%  5.4M 
Fig. 2 shows the performance of against the OneShot algorithm and random search. Table 5 shows our results compared with other mobile classification models. Our search algorithm outperforms both random search and the OneShot search algorithm, and achieves stateoftheart top1 accuracy in the mobile regime across several mobile NAS baselines, outperforming EfficientNetB0 and MobileNetV3. Our search time is comparable with other DNAS methods and significantly faster than MnasNet, which supplies the base network for MobileNetV3 and EfficientNet. Note that our full search space has 78.5% top1 accuracy on ImageNet which is an upper bound on the performance of our subnetworks. Although this upper bound is well below stateoftheart ImageNet accuracy, we are still able to produce stateoftheart smallmodels.


4.2 Comparing Search Spaces
We investigate whether certain search spaces are suited for particular computational costs and provide evidence in favor. A rigorous study would require enumerating and evaluating all searchable subnetworks on each space, which is infeasible. Instead, (Sciuto et al. (2019); Radosavovic et al. (2019)) study the efficiency of search spaces by randomly sampling architectures. This analysis is useful in determining the inherent advantages of each search space independently of the search algorithm being used. However, search algorithms may be biased toward particular subspaces of architectures based on the specific cost targeted during search (Gordon et al. (2018)) and uniform sampling may not capture this bias. Therefore, in addition to random sampling, it may be useful to compare search spaces via the performance of a search algorithm under different cost objectives.
We investigate with FBNet (Wu et al. (2019)) since its construction significantly differs from the OneShot search space and similar constructions are used in other works (Howard et al. (2019); Mei et al. (2020)). We use FLOPs as a second metric of interest. To make a meaningful comparison between search spaces, we first verify that matches the performance of FBNet search as shown in Table 2 (left).^{1}^{1}1 FBNet{B, C} and {B, C} were evaluated using our reimplementation of their training code. We then run with both FLOPs and size costs on OneShot and FBNet search spaces as shown in Table 2 (right). finds more parameterefficient networks in the OneShot search space and FLOPsefficient networks in the FBNet search space by significant margins.
4.3 on ResNet Search Space
We compare the performance of with (1) width multiplier, a commonly used compression heuristic that uniformly scales down the number of filters in each layer (
Howard et al. (2017)) and (2) MorphNet, a deterministic model compression technique which uses regularization to induce sparsity (Gordon et al. (2018)). We use MorphNet as a baseline since it can target various computational costs and (Mei et al. (2020)) use a similar technique.Fig. 3 shows our results on ResNet50 and ResNet152 on ImageNet. On both networks, outperforms width multiplier and performs on par with MorphNet.
4.4 Mobile Object Detection
In this section, we demonstrate the performance of our ImageNetlearned architectures as backbones for mobile object detection, using the SSDLite metaarchitecture (Sandler et al. (2018)
) designed for small models. We connect the output of cell 5 (stride 16) to the first layer of the feature extractor and output of the final 1x1 before global pool (stride 32) to the second layer. We compare against MobileNets and MnasNet which both use SSDLite.
Backbone  Params  mAP  mAP  mAP  mAP 

Small  1.91M  19.1  2.6  15.4  37.5 
MobileNetV3Small  1.77M  14.9  0.7  5.6  28.0 
Large  3.02M  25.8  4.4  24.1  47.8 
MobileNetV3Large  3.22M  21.8  1.9  12.7  40.7 
MnasNetA1  4.90M  23.0  3.8  21.7  42.0 
Our results are shown in Table 3. We achieve a +4 mAP margin over MobileNetV3. Note that instead of transferring ImageNetlearned architectures, we could also apply our search method to learn the backbone directly on the detection dataset, using differentiable relaxations of search spaces designed for detection such as (Chen et al. (2019)). This would likely produce more efficient architectures and is left as future work.
4.5 On Convergence and Reducing Runtime
Target Size  Epochs ()  Acc.  

2M params  100  71.8%  
40  71.5%  
20  70.4%  
5M params  100  76.4%  
40  76.2%  
20  75.8% 
We explore the limits of reducing the samplecomplexity of the architecture learning phase. Given a target model size, we explore the tradeoff between running for longer with a weak and converging faster with strong . We demonstrate with two different target sizes (2M and 5M params). The results are shown in Fig. 4. In both cases, we can truncate to 40 epochs with negligible drop in accuracy, reducing the runtime of our search by 2.5x. Searching for only 20 epochs reduces model quality by 0.51% but results in a 2x speedup over the OneShot method while still producing better models.
5 Conclusion
We present a finegrained differentiable architecture search method which stochastically samples subnetworks and discovers wellperforming models that minimize resource constraints such as memory or FLOPs. While most DNAS methods select from a fixed set of operations, our method modifies operators during optimization, thereby searching a much larger set of architectures. We demonstrate the effectiveness of our approach on two contemporary DNAS search spaces (Wu et al. (2019); Bender et al. (2018)) and produce SOTA small models on ImageNet. While most NAS works focus on FLOPs or latency, there is significant practical benefit for lowmemory models in both serverside and ondevice applications.
can be applied to any model or search space by inserting a masksampling layer after every convolution. Due to its small search cost, our method can learn efficient architectures for any task or dataset onthefly.
Broader Impact
Deep models have been doubling in size every few months since 2012, and have a large carbon footprint, see (Strubell et al. (2019); Schwartz et al. (2019)). Moreover stateoftheart models are often too large to deploy on lowresource devices limiting their uses to flagship mobile devices that are too expensive for most consumers. By automating the design of models that are lightweight and consume little energy, and doing so with an algorithm that is also lightweight and adaptive to different constraints, our community can make sure that the fruits of ML/A.I. are shared more broadly with society, are not limited to the most affluent, and do not become a major contributor to climate change.
Contributions
Shraman led the research, ran most of the experiments, and wrote most of the paper. Yair helped with writing and provided valuable feedback through code review.
Elad and Yair jointly proposed the idea to apply structured pruning for DNAS. Hanhan, Shraman, and Yair jointly proposed the idea to apply GumbelSoftmax for finegrained search. Yair developed the LogisticSigmoid regularizer, and the method matured through continuous discussion between Elad, Shraman, and Yair.
Max ran experiments on object detection and provided critical engineering help. Elad and Hanhan ran experiments on ResNet.
References
 Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 Baker et al. (2016) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
 Bender et al. (2018) Gabriel Bender, PieterJan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying oneshot architecture search. In International Conference on Machine Learning, pages 549–558, 2018.
 Brock et al. (2017) Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.

Cai et al. (2018a)
Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang.
Efficient architecture search by network transformation.
In
ThirtySecond AAAI conference on artificial intelligence
, 2018a.  Cai et al. (2018b) Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018b.
 Chen et al. (2019) Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Chunhong Pan, and Jian Sun. Detnas: Neural architecture search on object detection. arXiv preprint arXiv:1903.10979, 2019.

Dong and Yang (2019)
Xuanyi Dong and Yi Yang.
Searching for a robust neural architecture in four gpu hours.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1761–1770, 2019.  Gordon et al. (2018) Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, TienJu Yang, and Edward Choi. Morphnet: Fast & simple resourceconstrained structure learning of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1586–1595, 2018.

Han et al. (2015)
Song Han, Jeff Pool, John Tran, and William Dally.
Learning both weights and connections for efficient neural network.
In Advances in neural information processing systems, pages 1135–1143, 2015.  He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Howard et al. (2019) Andrew Howard, Mark Sandler, Grace Chu, LiangChieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244, 2019.
 Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 LeCun et al. (1990) Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
 Lindauer and Hutter (2019) Marius Lindauer and Frank Hutter. Best practices for scientific research on neural architecture search. arXiv preprint arXiv:1909.02453, 2019.
 Liu et al. (2018) Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 Liu et al. (2017) Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 2017.
 Louizos et al. (2017) Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312, 2017.
 Luo et al. (2017) JianHao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Mei et al. (2020) Jieru Mei, Yingwei Li, Xiaochen Lian, Xiaojie Jin, Linjie Yang, Alan Yuille, and Jianchao Yang. Atomnas: Finegrained endtoend neural architecture search. arXiv preprint arXiv:1912.09640, 2020.
 Pham et al. (2018) Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 Radosavovic et al. (2019) Ilija Radosavovic, Justin Johnson, Saining Xie, WanYen Lo, and Piotr Dollár. On network design spaces for visual recognition. arXiv preprint arXiv:1905.13214, 2019.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
 Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 Schwartz et al. (2019) Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green ai, 2019.
 Sciuto et al. (2019) Christian Sciuto, Kaicheng Yu, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142, 2019.
 Smith et al. (2017) Samuel L Smith, PieterJan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
 Spall et al. (1992) James C Spall et al. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE transactions on automatic control, 37(3):332–341, 1992.
 Strubell et al. (2019) Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. doi: 10.18653/v1/p191355. URL http://dx.doi.org/10.18653/v1/p191355.
 Tan and Le (2019) Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
 Tan et al. (2019) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
 Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Wu et al. (2019) Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019.
 Xie et al. (2019) Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1284–1293, 2019.
 Xie et al. (2018) Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926, 2018.
 Xu et al. (2020) Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, GuoJun Qi, Qi Tian, and Hongkai Xiong. Pcdarts: Partial channel connections for memoryefficient differentiable architecture search. International Conference on Learning Representations, 2020.
 Ying et al. (2019) Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, and Frank Hutter. Nasbench101: Towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635, 2019.
 Zoph and Le (2016) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix
Hyperparameters and Training Details
We use the following training setup for OneShot, FiGSOneShot, FBNet, and FiGSFBNet:

Batch size 512 and smooth exponential learning rate decay initialized to 1.6 and decayed by 0.5 every 35 epochs.

Moving average decay rate of 0.9997 for BatchNorm eval statistics and eval weights.

ADAM optimizer with default hyperparameters: .

Weight decay with coefficient .

Standard ResNet data augmentation (He et al. (2016)): random crop, flip, color adjustment.
We use the same setup for our ResNet results in Sec. 4.3, except we set the LR schedule to be closer to (He et al. (2016)): initializing to 0.64 and smoothly decaying by 0.2 every 30 epochs.
We use the above training setup for both and retraining, with the exception that we retrain until convergence. To accelerate retraining, we double the batch size and learning rate (1024 and 3.2, respectively) as per (Smith et al. (2017)). This does not improve the accuracy of our models. We do not tune hyperparameters of our learned architectures.
We provide regularization strengths () for OneShot in Table 1. Regularization strengths for FBNet(B,C) are () respectively.
To find an appropriate orderofmagnitude for , we logscale searched (once) for on OneShot. We found that setting produced indistinguishable results, and fixed for all experiments.
Recent works (Lindauer and Hutter (2019); Mei et al. (2020)) mention the use of special techniques in NAS works. To be explicit, we do not use these special techniques in training our models:

SqueezeExcite layers.

Swish activation.

CutOut, MixUp, AutoAugment, or any other augmentation not explicitly listed in our training setup.

Dropout, DropBlock, ScheduledDropPath, ShakeShake or any other regularization not explicitly listed in our training setup above.
Without these techniques, we are able to outperform stateoftheart architectures like EfficientNetB0 which use some of these techniques. Given the results of (Mei et al. (2020)), we are optimistic that applying techniques like SqueezeExcite, Swish, and AutoAugment can further increase the Paretoefficiency of our networks, but that is outside the scope of this work.
All experiments (including OneShot, FBNet, MorphNet baselines) were run on the same hardware (32core Cloud TPU) using TensorFlow.
Search Space Details
For FiGSOneShot, we use the same search space instantiation presented in Bender et al. (2018) (sec 3.4) for ImageNet — 8 cells, 4 blocks per cell, separable convolutions, and downsampling with stride=2 average pooling every 2 cells. We use a base width () of 64 filters. We verify our search space implementation by reproducing their “All On" results in Table 1. To assist with finegrained search, we make one modification, as mentioned in Sec. 3.3: we combine operator outputs by concatenating them and passing through a 1x1 convolution (instead of adding) to decouple their output dimensions. The extra 1x1 convolution does not increase the accuracy of the supernetwork or learned architectures in and of itself. As shown in Fig. 4, the concat aggregator helps FiGS produce better architectures.
For FiGSFBNet, we do not include group convolutions in our set of operators so we only compare against FBNet{B,C} which also do not include group convolutions.
Miscellany
The multiple points for EfficientNetB0 in Fig 2 were generated by applying a uniform width multiplier .
Model  Top1 Acc.  #Params  RatiotoOurs 

A ()  69.90.1%  1.30.02M  1.0x 
OneShotSmall (Bender et al. (2018))  67.9%  1.4M  1.1x 
MnasNetSmall (Tan et al. (2019))  64.9%  1.9M  1.5x 
MobileNetV3Small0.75 (Howard et al. (2019))  65.4%  2.4M  1.8x 
B ()  75.00.5%  2.70.06M  1.0x 
MobileNetV20.75x (Sandler et al. (2018))  69.8%  2.6M  1.0x 
MobileNetV3Small1.0 (Howard et al. (2019))  67.4%  2.9M  1.1x 
OneShotSmall (Bender et al. (2018))  72.4%  3.0M  1.2x 
MobileNetV21.0x (Sandler et al. (2018))  72.0%  3.4M  1.3x 
MnasNet65 (Tan et al. (2019))  73.0%  3.6M  1.4x 
AtomNASA (Mei et al. (2020))  74.6%  3.9M  1.5x 
MobileNetV3Large0.75 (Howard et al. (2019))  73.3%  4.0M  1.5x 
FBNetA (Wu et al. (2019))  73.0%  4.3M  1.8x 
C ()  77.10.03%  4.40.04M  1.0x 
AtomNASB (Mei et al. (2020))  75.5%  4.4M  1.0x 
FBNetB (Wu et al. (2019))  74.1%  4.5M  1.0x 
MnasNetA2 (Tan et al. (2019))  75.6%  4.8M  1.1x 
OneShotSmall (Bender et al. (2018))  74.2%  5.1M  1.2x 
MobileNetV21.3x (Sandler et al. (2018))  74.4%  5.3M  1.2x 
PCDARTS (Xu et al. (2020))  75.8%  5.3M  1.2x 
EfficientNetB0 (Tan and Le (2019))  76.3%  5.3M  1.2x 
MobileNetV3Large1.0 (Howard et al. (2019))  75.2%  5.4M  1.2x 
FBNetC (Wu et al. (2019))  74.9%  5.5M  1.3x 
Comments
There are no comments yet.