Fine-Grained Stochastic Architecture Search

by   Shraman Ray Chaudhuri, et al.

State-of-the-art deep networks are often too large to deploy on mobile devices and embedded systems. Mobile neural architecture search (NAS) methods automate the design of small models but state-of-the-art NAS methods are expensive to run. Differentiable neural architecture search (DNAS) methods reduce the search cost but explore a limited subspace of candidate architectures. In this paper, we introduce Fine-Grained Stochastic Architecture Search (FiGS), a differentiable search method that searches over a much larger set of candidate architectures. FiGS simultaneously selects and modifies operators in the search space by applying a structured sparse regularization penalty based on the Logistic-Sigmoid distribution. We show results across 3 existing search spaces, matching or outperforming the original search algorithms and producing state-of-the-art parameter-efficient models on ImageNet (e.g., 75.4 backbones for object detection with SSDLite, we achieve significantly higher mAP on COCO (e.g., 25.8 with 3.0M params) than MobileNetV3 and MnasNet.



page 1

page 2

page 3

page 4


Going Beyond Neural Architecture Search with Sampling-based Neural Ensemble Search

Recently, Neural Architecture Search (NAS) has been widely applied to au...

AtomNAS: Fine-Grained End-to-End Neural Architecture Search

Designing of search space is a critical problem for neural architecture ...

FBNetV3: Joint Architecture-Recipe Search using Neural Acquisition Function

Neural Architecture Search (NAS) yields state-of-the-art neural networks...

Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search

Efficient video architecture is the key to deploying video recognition s...

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search

High sensitivity of neural architecture search (NAS) methods against the...

Fine-Grained Neural Architecture Search

We present an elegant framework of fine-grained neural architecture sear...

Dataset Condensation with Distribution Matching

Computational cost to train state-of-the-art deep models in many learnin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning researchers have invested much effort over the last decades into feature engineering, the process of hand-crafting features for machine learning algorithms. With the proliferation of deep learning, this process has been replaced by the manual design of larger and more complex models.

Model design requires domain expertise and many rounds of trial-and-error. Neural architecture search (NAS) (Zoph and Le (2016)) automates this process using RL; however, searching for a new architecture can require thousands of GPU hours. Due to the cost of prevailing NAS methods, most techniques search for an architecture over a small proxy dataset and release the discovered architecture as their contribution. This is suboptimal—neither the proxy dataset nor the resource constraints targeted during the search could possibly address all downstream uses of this architecture.

Differentiable NAS (DNAS) (Cai et al. (2018b); Liu et al. (2018)) methods aim to alleviate this limitation by building a superset network (super-network or search space) and searching for useful sub-networks using gradient descent. These super-networks are typically composed of densely connected building blocks with multiple parallel operations. The goal of the search method is to prune connections and operations, trading representational capacity for efficiency, to fit a certain computation budget. DNAS methods can be viewed as pruning methods with the subtle difference that they are applied on manually designed super-networks with redundant components while pruning methods (LeCun et al. (1990); Han et al. (2015)) are usually applied on standard models.

The canonical approach to DNAS is to select an operator from a fixed set of operators by gating their outputs and treating them as unmodifiable (black-box) units. In this sense, DNAS has inherited some of the limitations of RL methods since they cannot dynamically change the units during optimization. For instance, to learn the width of each layer, DNAS and RL methods typically enumerate a set of fixed-width operators, generating independent outputs for each. This is not only computationally expensive but also a coarse way of exploring sub-networks.

We propose (), a search method inspired by structured pruning. For each output feature of a layer (i.e., output channel in convolutional layer, or neuron in a dense layer), we assign a Bernoulli random variable (

mask) indicating whether that feature should be used. We use the Logistic-Sigmoid distribution (Maddison et al. (2016)

) to relax the binary constraint and learn the masking probabilities using gradient descent, optimizing for various resource constraints such as model size or FLOPs. We export an architecture, defined as a mapping from layers to numbers of neurons, by sampling a mask at the end of training.

can be applied to any search space by simply inserting masks after each layer. Our method is fine-grained in that we search over a larger space of architectures than ordinary DNAS methods by applying masks on operator outputs as well as on intermediate layers that compose the operator. can simultaneously select a subset of operators and modify them as well.

In some sense, DNAS has shifted the problem of architecture design to search space design. Many DNAS works target a single metric on a single, manually-designed search space; however, each search space may come with its own merits. This coupling between search space and algorithm makes it hard to (1) compare different search algorithms, and (2) understand the biases inherent to different search spaces (Sciuto et al. (2019); Radosavovic et al. (2019)). NAS-bench-101 (Ying et al. (2019)) addresses the former, providing a large set of architectures trained on CIFAR to evaluate RL-based NAS algorithms. On the other hand, our method can be used to study the latter. Since our method can easily be injected into any DNAS search space, we can characterize their bias toward certain metrics. We find that some produce models that are more Pareto-efficient for model size while others are more Pareto-efficient for FLOPs/latency.

When applied to well-known search spaces (Bender et al. (2018); Wu et al. (2019)), matches or outperforms the original search method. When applied on the One-Shot search space, achieve state-of-the-art small model accuracy on ImageNet (by a 2-5% margin). When using ImageNet-learned architectures as backbones for detection, achieves +4 mAP over mobile baselines on COCO. When applied to commonly used ResNet models, outperforms pruning baselines.

2 Related Work

Neural architecture search (NAS) automates the design of neural net models with machine learning. Early approaches (Zoph and Le (2016); Baker et al. (2016)

) train a controller to build the network with reinforcement learning (RL). These methods require training and evaluating thousands of candidate models and are prohibitively expensive for most applications. Weight sharing methods (

Brock et al. (2017); Pham et al. (2018); Cai et al. (2018a)) amortize the cost of evaluation; however, (Sciuto et al. (2019)

) suggest that these amortized evaluations are noisy estimates of actual performance.

Of growing interest are mobile NAS methods which produce smaller architectures that fit certain computational budgets or are optimized for certain hardware platforms. MnasNet (Tan et al. (2019)) is an RL-based method that optimizes directly for specific metrics (e.g., mobile latency) but takes several thousand GPU-hours to search. One-shot and differentiable neural architecture search (DNAS) (Bender et al. (2018); Liu et al. (2018)) methods cast the problem as finding optimal subnetworks in a continuous relaxation of NAS search spaces which is optimized using gradient descent.

Our work is most closely related to probabilistic DNAS methods which learn stochastic gating variables to select operators in these search spaces. (Cai et al. (2018b)) use hard (binary) gates and a straight-through estimation of the gradient, whereas (Xie et al. (2018); Wu et al. (2019); Dong and Yang (2019)) use soft (non-binary) gates sampled from the Gumbel-Softmax distribution (Jang et al. (2016); Maddison et al. (2016)) to relax the discrete choice over operators. In contrast, our method performs a fine-grained search over the set and composition of operators. Some methods learn a single cell structure that is repeated throughout the network (Dong and Yang (2019); Xie et al. (2018); Bender et al. (2018)) whereas our method learns cell structures independently.

Our work draws inspiration from structured pruning methods (Luo et al. (2017); Liu et al. (2017); Wen et al. (2016)). MorphNet (Gordon et al. (2018)) adjusts the number of neurons in each layer with an penalty on BatchNorm scale coefficients, treating them as gates. (Louizos et al. (2017)) propose a method to induce exact sparsity for one-round compression. Recent work by (Mei et al. (2020)) independently proposes fine-grained search with an penalty. In contrast, we propose a stochastic method that samples sparse architectures throughout the search process.

Recent analytical works highlight the importance of search space design. Of particular relevance is the study in (Radosavovic et al. (2019)) which finds that randomly sampled architectures from certain spaces (e.g., DARTS (Liu et al. (2018))) are superior when normalizing for certain measures of complexity. (Sciuto et al. (2019)) find that randomly sampling the search space produces architectures on par with both controller-based and DNAS methods. (Xie et al. (2019)) suggest that the wiring of search spaces plays a critical role in the performance of sub-networks. The success of NAS methods, therefore, can be attributed in no small part to search space design.

3 Method

Figure 1: Search Spaces and Operator Selection with . (a) and (b)

: and search spaces (resp.). “DW" denotes depthwise convolution. Orange edges in (b) indicate tensors that must have the same shape due to the additive skip connection.

(c): To control the set of operators, one only needs to insert masking layers (blue) after them. Numbers next to edges indicate the number of non-zero channels in the mask. We deselect operators by sampling a zero mask. Note that the additive aggregator (top) forces the output shapes of each operator to match whereas the concat aggregator (bottom) allows arbitrary shapes and selecting a subset.

We search for architectures that minimize both a task loss and a computational cost . Our approach is akin to stochastic differentiable search methods such as (Xie et al. (2018)) which formulate the architecture search problem as sampling subnetworks in a large supernetwork composed of redundant components (operators). While efficient, these methods add restrictive priors to the learned architectures: (1) the operators (e.g., a set of depthwise-separable convolutions with various widths and kernel sizes) are hand-designed and the search algorithm cannot modify them, and (2) the search algorithm is limited to selecting a single operator for each layer in the network.

relaxes both constraints by (1) modifying operators during the search process, and (2) allowing more than one operator per layer. To concretely illustrate the benefit of (1), we focus on the width (i.e., number of filters or neurons) of each convolution. To modify the width during the search process, learns a sampling distribution over individual neurons in the supernetwork instead of a distribution over operators. As a result, the operators in the learned architectures can have fine-grained, variable widths which are not limited to a pre-defined set of values.

progresses in two phases: an architecture learning phase () where we output an optimized architecture by minimizing both and , followed by a retraining phase with alone. Our loss for AL is similar to sparsity-inducing regularizers (Gordon et al. (2018)). Sec. 3.1 describes our stochastic relaxation of and sampling method, 3.2 describes the masking layer and regularization penalty, and Sec. 3.3 describes our formula for fine-grained search on existing spaces.

3.1 Inducing Sparsity with the Distribution

Let be the weights of the network. We assume computational costs of the form , i.e., a function of the set of indicators for whether each weight is nonzero. FLOPs, size, and latency can be expressed exactly or well-approximated in this form. The objective is then:


We refer to as the regularization strength. Since has zero gradient when , we cannot minimize with gradient descent. Instead, we formulate the problem as a stochastic estimation task. Let denote a binary mask to be applied on , where are independent Bernoulli variables. We minimize the usage of by minimizing the probability that the mask is 1 so we can safely prune . Our sampled architectures are defined by . By substituting with , our objective becomes:


where denotes element-wise product. Unless otherwise specified, all expectations are taken w.r.t. and we drop the subscript on for brevity. We can estimate the gradient w.r.t. with black-box methods, e.g., perturbation methods (Spall et al. (1992)) or the log-derivative trick (Williams (1992)

); however, these estimators generally suffer from high variance. Instead, we relax

with a continuous sample from the distribution:

where . The distribution is the binary case of the Gumbel-Softmax (a.k.a. Concrete) distribution (Jang et al. (2016); Maddison et al. (2016)). As , approaches with probability respectively. Factoring out logistic noise as a parameter-free component allows us to back-propagate through the mask and learn with gradient descent. The resulting gradient estimator has lower variance than black-box methods (Maddison et al. (2016)). We optimize in practice for numerical stability.

Our stochastic relaxation allows us to better model the sparsity of the learned architectures during the search phase than deterministic relaxations. To illustrate, consider the common deterministic approach to relax with an norm where (Wen et al. (2016); Gordon et al. (2018); Mei et al. (2020)). In this case, the weights can be far from during training, which can be problematic if the network relies on the information encoded in these pseudo-sparse weights to make accurate predictions. Instead, we want to simulate real sparsity during AL. Other deterministic methods apply a saturating nonlinearity (e.g., sigmoid or softmax) to force values close to (Liu et al. (2018)). However, these functions suffer from vanishing gradients at extrema: once a weight is close to zero, it remains close to zero. This limits the number of sparse networks explored during AL. In contrast, our sampled mask is close to at all times at low , which forces the network to adapt to sparse activations, and the mask can be non-zero even as approaches 0, which allows the network to visit diverse sparse states during AL.

3.2 Group Masking and Regularization

As neurons are pruned during the search process, we can prune downstream dependencies as well. We group each neuron and its downstream dependencies by sharing a single mask across all their elements. To illustrate, consider the weight matrices of two 1x1 convolutions and below, where the output of is fed into . If neuron , then can be pruned and vice versa. Therefore, all elements in and share a scalar mask .

This row-column grouping can be implemented conveniently by applying a separate mask on each channel of the activations produced by each convolution and fully-connected layer. This allows us to encapsulate all architecture learning meta-variables () in a drop-in layer which can easily be inserted in the search space.

Let be the contribution of and to the total cost . To encourage sparsity, we can either regularize the mask () or the distribution parameters (). As , the former penalizes the cost of sampled architectures while the latter penalizes the expected cost. In our example above, the sampled and expected costs (in number of parameters) are:


Note however that and are dynamic quantities: as inputs to and outputs of are masked out by adjacent masking layers, and decrease as well. To capture this dynamic behavior, we apply our differentiable relaxation from Sec. 3.1 again to approximate and . Let and be per-channel masks on inputs to and outputs of . The sampled and expected costs are then:


We observe that minimizing (6) is more stable than minimizing (5). We use (6) for our results in Sec. 4, and scale appropriately for different costs such as FLOPs.

After , we export a single architecture, defined as a mapping from each convolution layer to its expected number of neurons under the learned distribution parameters . In our example above, convolution would have neurons in the exported architecture.

3.3 Fine-Grained Search

To apply our method to DNAS search spaces, we simply insert masking layers after convolution layers as illustrated in Fig. 1. We run our search algorithm on the One-Shot and FBNet search spaces. The One-Shot search space is composed of a series of cells which are in turn composed of densely connected blocks. Each block consists of several operators, each of which applies a series of convolutions on the blocks’ inputs. Similarly, the FBNet search space is composed of stages which are in turn composed of blocks. The outputs of the operators are added together. We refer the reader to (Bender et al. (2018); Wu et al. (2019)) for more details.

DNAS methods generally gate the operator outputs directly to select a single operator with, e.g., a softmax a layer. In contrast, our architectures can have more than one operator per layer. Operators are removed from the network by learning to sample all-zero masks on the operator’s output or the output of any of its intermediate activations. This process is illustrated in Fig. 1(c) – note that can select between 0 and all operators in each block. Since our method simultaneously optimizes for the set of operators and their widths, the space of possible architectures which we search is an exponentially larger superset of the original search space.

matches the performance of the original search algorithms when applied to the original One-Shot and FBNet search spaces with no modifications. However, these search spaces are designed for coarse-grained search (operator-level sampling). We propose two minor modifications to the search space to take full advantage of fine-grained search. Importantly, these modifications do not improve the accuracy of the architectures in and of themselves; they only give more flexibility for fine-grained search and reduce the runtime of the search phase.

Concat Aggregator. By adding operator outputs, we enforce all output dimensions to match during . This restricts fine-grained search in that each operator must have the same output shape. Instead, we can concatenate them and pass them through a 1x1 convolution (concat aggregator), which is a generalization of the additive aggregator. The benefits are two-fold: (1) operator outputs can have variable sizes, and (2) can learn a better mixing formula for operator outputs. In practice, we observe that the concat aggregator works better on the One-Shot search space when targeting model size and the additive aggregator works better on FBNet when targeting FLOPs.

Removing Redundant Operators.

To explore various architectural hyperparameter choices, coarse-grained NAS methods enumerate a discrete set of options. For instance, to learn whether a convolution in a given block should have 16 or 32 filters would require including two separate weight tensors in the set of options. Not only is this computationally inefficient – scaling both latency and memory with each additional operator – but the enumeration may not be granular enough to contain the optimal size. In contrast, fine-grained search can shrink the 32-filter convolution to be functionally equivalent to the 16-filter convolution; therefore, we only need to include the former. In practice, this results in a 3x reduction in the number of operators in the FBNet search space and a 2.5x reduction in search runtime, with no loss of quality in the learned architectures.

4 Results

We use TensorFlow (

Abadi et al. (2016)) for all our experiments. Our algorithm takes hours to search and hours to retrain on ImageNet using a single 4x4 (32-core) Cloud TPU.

4.1 on One-Shot Search Space

Figure 2: Left: Model size vs Accuracy for state-of-the-art mobile architectures on ImageNet. The architectures learned by produce SOTA results. Right: vs. Random Search. We sample 30 architectures (yellow) from the One-Shot space with random subsets of operators and width multiplier . The right-most red point is the supernetwork. outperforms random search.

In this section, we evaluate our search algorithm on the One-Shot search space (Bender et al. (2018)) to find efficient architectures for ImageNet classification (Russakovsky et al. (2015)). To compare against their results, we target model size. We use the same search space instantiation — 8 cells, 4 blocks per cell, separable convolutions, and downsampling every 2 cells. We merge the outputs of each path with a concat aggregator. Despite increasing the number of parameters, the concat aggregator does not increase the base accuracy of the supernetwork. The search space is illustrated in Fig. 1(a) — note that we apply masks on operator outputs as well as on individual convolutions that compose the operator. Our reproduction of their supernetwork matches their published results.

The mask-logits variables

are initialized to (). We set without annealing and use our relaxation of for both forward and backward passes. To learn architectures of different sizes, we vary the regularization coefficient

. For , we train for 100 epochs using ADAM with batch size 512 and learning rate 1.6 decayed by 0.5 every 35 epochs. For retraining, we use the same settings, except we train until convergence (400 epochs) and double the batch size and learning rate (1024 and 3.2, resp.) for faster convergence (

Smith et al. (2017)).

Model Top-1 Acc. #Params
-A   () 69.90.1% 1.30.02M
One-Shot-Small (Bender et al. (2018)) 67.9% 1.4M
MnasNet-Small (Tan et al. (2019)) 64.9% 1.9M
MobileNetV3-Small-0.75 (Howard et al. (2019)) 65.4% 2.4M
-B   () 75.00.5% 2.70.06M
MobileNetV3-Small-1.0 (Howard et al. (2019)) 67.4% 2.9M
One-Shot-Small (Bender et al. (2018)) 72.4% 3.0M
MnasNet-65 (Tan et al. (2019)) 73.0% 3.6M
FBNet-A (Wu et al. (2019)) 73.0% 4.3M
-C   () 77.10.03% 4.40.04M
AtomNAS-B (Mei et al. (2020)) 75.5% 4.4M
One-Shot-Small (Bender et al. (2018)) 74.2% 5.1M
EfficientNet-B0 (Tan and Le (2019)) 76.3% 5.3M
MobileNetV3-Large-1.0 (Howard et al. (2019)) 75.2% 5.4M
Table 1: Comparison with modern mobile classification architectures and DNAS methods on ImageNet. produces the smallest and most accurate models in each category, and significantly outperforms the One-Shot baseline. Error bars were computed by running 6 times with the same , exporting a single architecture after each run as described in Sec. 3.2, and retraining from scratch.

Fig. 2 shows the performance of against the One-Shot algorithm and random search. Table 5 shows our results compared with other mobile classification models. Our search algorithm outperforms both random search and the One-Shot search algorithm, and achieves state-of-the-art top-1 accuracy in the mobile regime across several mobile NAS baselines, outperforming EfficientNet-B0 and MobileNetV3. Our search time is comparable with other DNAS methods and significantly faster than MnasNet, which supplies the base network for MobileNetV3 and EfficientNet. Note that our full search space has 78.5% top-1 accuracy on ImageNet which is an upper bound on the performance of our sub-networks. Although this upper bound is well below state-of-the-art ImageNet accuracy, we are still able to produce state-of-the-art small-models.

Model Acc. MAdds
-B 72.2 295M
FBNet-B 72.3 295M
-C 73.5 385M
FBNet-C 73.3 385M

Target Search Space Acc. Params MAdds
Params 77.1 4.4M
74.7 5.2M
FLOPs 68.0 400M
74.0 400M
Table 2: Left: on the FBNet search space. Using our drop-in sampling layer we are able to effectively search the FBNet space, and match the performance of models found by (Wu et al. (2019)). Right: Search spaces may have intrinsic biases from their manual construction. We find that models produced from the One-Shot space are parameter efficient while those from the FBNet space are FLOPs efficient.

4.2 Comparing Search Spaces

We investigate whether certain search spaces are suited for particular computational costs and provide evidence in favor. A rigorous study would require enumerating and evaluating all searchable subnetworks on each space, which is infeasible. Instead, (Sciuto et al. (2019); Radosavovic et al. (2019)) study the efficiency of search spaces by randomly sampling architectures. This analysis is useful in determining the inherent advantages of each search space independently of the search algorithm being used. However, search algorithms may be biased toward particular sub-spaces of architectures based on the specific cost targeted during search (Gordon et al. (2018)) and uniform sampling may not capture this bias. Therefore, in addition to random sampling, it may be useful to compare search spaces via the performance of a search algorithm under different cost objectives.

We investigate with FBNet (Wu et al. (2019)) since its construction significantly differs from the One-Shot search space and similar constructions are used in other works (Howard et al. (2019); Mei et al. (2020)). We use FLOPs as a second metric of interest. To make a meaningful comparison between search spaces, we first verify that matches the performance of FBNet search as shown in Table 2 (left).111 FBNet-{B, C} and -{B, C} were evaluated using our re-implementation of their training code. We then run with both FLOPs and size costs on One-Shot and FBNet search spaces as shown in Table 2 (right). finds more parameter-efficient networks in the One-Shot search space and FLOPs-efficient networks in the FBNet search space by significant margins.

Figure 3: FiGS vs. architecture compression baselines on ResNet-{50, 152}. FiGS outperforms width multiplier and performs on par with MorphNet.

4.3 on ResNet Search Space

We compare the performance of with (1) width multiplier, a commonly used compression heuristic that uniformly scales down the number of filters in each layer (

Howard et al. (2017)) and (2) MorphNet, a deterministic model compression technique which uses regularization to induce sparsity (Gordon et al. (2018)). We use MorphNet as a baseline since it can target various computational costs and (Mei et al. (2020)) use a similar technique.

Fig. 3 shows our results on ResNet-50 and ResNet-152 on ImageNet. On both networks, outperforms width multiplier and performs on par with MorphNet.

4.4 Mobile Object Detection

In this section, we demonstrate the performance of our ImageNet-learned architectures as backbones for mobile object detection, using the SSDLite meta-architecture (Sandler et al. (2018)

) designed for small models. We connect the output of cell 5 (stride 16) to the first layer of the feature extractor and output of the final 1x1 before global pool (stride 32) to the second layer. We compare against MobileNets and MnasNet which both use SSDLite.

Backbone Params mAP mAP mAP mAP
-Small 1.91M 19.1 2.6 15.4 37.5
MobileNetV3-Small 1.77M 14.9 0.7 5.6 28.0
-Large 3.02M 25.8 4.4 24.1 47.8
MobileNetV3-Large 3.22M 21.8 1.9 12.7 40.7
MnasNet-A1 4.90M 23.0 3.8 21.7 42.0
Table 3: Mobile object detection on COCO 2017 test-dev with SSDLite meta-architecture. -Large outperforms both MobileNetV3-Large and MnasNet-A1 with fewer params.

Our results are shown in Table 3. We achieve a +4 mAP margin over MobileNetV3. Note that instead of transferring ImageNet-learned architectures, we could also apply our search method to learn the backbone directly on the detection dataset, using differentiable relaxations of search spaces designed for detection such as (Chen et al. (2019)). This would likely produce more efficient architectures and is left as future work.

4.5 On Convergence and Reducing Runtime

Target Size Epochs () Acc.
2M params 100 71.8%
40 71.5%
20 70.4%
5M params 100 76.4%
40 76.2%
20 75.8%
Table 4: Effects of regularization strength and search budget on final model accuracy. Early stopping allows 2.5x-5x saving in time with minimal accuracy drop.

We explore the limits of reducing the sample-complexity of the architecture learning phase. Given a target model size, we explore the trade-off between running for longer with a weak and converging faster with strong . We demonstrate with two different target sizes (2M and 5M params). The results are shown in Fig. 4. In both cases, we can truncate to 40 epochs with negligible drop in accuracy, reducing the runtime of our search by 2.5x. Searching for only 20 epochs reduces model quality by 0.5-1% but results in a 2x speedup over the One-Shot method while still producing better models.

5 Conclusion

We present a fine-grained differentiable architecture search method which stochastically samples sub-networks and discovers well-performing models that minimize resource constraints such as memory or FLOPs. While most DNAS methods select from a fixed set of operations, our method modifies operators during optimization, thereby searching a much larger set of architectures. We demonstrate the effectiveness of our approach on two contemporary DNAS search spaces (Wu et al. (2019); Bender et al. (2018)) and produce SOTA small models on ImageNet. While most NAS works focus on FLOPs or latency, there is significant practical benefit for low-memory models in both server-side and on-device applications.

can be applied to any model or search space by inserting a mask-sampling layer after every convolution. Due to its small search cost, our method can learn efficient architectures for any task or dataset on-the-fly.

Broader Impact

Deep models have been doubling in size every few months since 2012, and have a large carbon footprint, see (Strubell et al. (2019); Schwartz et al. (2019)). Moreover state-of-the-art models are often too large to deploy on low-resource devices limiting their uses to flagship mobile devices that are too expensive for most consumers. By automating the design of models that are lightweight and consume little energy, and doing so with an algorithm that is also lightweight and adaptive to different constraints, our community can make sure that the fruits of ML/A.I. are shared more broadly with society, are not limited to the most affluent, and do not become a major contributor to climate change.


Shraman led the research, ran most of the experiments, and wrote most of the paper. Yair helped with writing and provided valuable feedback through code review.

Elad and Yair jointly proposed the idea to apply structured pruning for DNAS. Hanhan, Shraman, and Yair jointly proposed the idea to apply Gumbel-Softmax for fine-grained search. Yair developed the Logistic-Sigmoid regularizer, and the method matured through continuous discussion between Elad, Shraman, and Yair.

Max ran experiments on object detection and provided critical engineering help. Elad and Hanhan ran experiments on ResNet.



Hyperparameters and Training Details

We use the following training setup for One-Shot, FiGS-One-Shot, FBNet, and FiGS-FBNet:

  • Batch size 512 and smooth exponential learning rate decay initialized to 1.6 and decayed by 0.5 every 35 epochs.

  • Moving average decay rate of 0.9997 for BatchNorm eval statistics and eval weights.

  • ADAM optimizer with default hyperparameters: .

  • Weight decay with coefficient .

  • Standard ResNet data augmentation (He et al. (2016)): random crop, flip, color adjustment.

We use the same setup for our ResNet results in Sec. 4.3, except we set the LR schedule to be closer to (He et al. (2016)): initializing to 0.64 and smoothly decaying by 0.2 every 30 epochs.

We use the above training setup for both and retraining, with the exception that we retrain until convergence. To accelerate retraining, we double the batch size and learning rate (1024 and 3.2, respectively) as per (Smith et al. (2017)). This does not improve the accuracy of our models. We do not tune hyperparameters of our learned architectures.

We provide regularization strengths () for -One-Shot in Table 1. Regularization strengths for -FBNet-(B,C) are () respectively.

To find an appropriate order-of-magnitude for , we log-scale searched (once) for on -One-Shot. We found that setting produced indistinguishable results, and fixed for all experiments.

Recent works (Lindauer and Hutter (2019); Mei et al. (2020)) mention the use of special techniques in NAS works. To be explicit, we do not use these special techniques in training our models:

  • Squeeze-Excite layers.

  • Swish activation.

  • CutOut, MixUp, AutoAugment, or any other augmentation not explicitly listed in our training setup.

  • Dropout, DropBlock, ScheduledDropPath, Shake-Shake or any other regularization not explicitly listed in our training setup above.

Without these techniques, we are able to outperform state-of-the-art architectures like EfficientNet-B0 which use some of these techniques. Given the results of (Mei et al. (2020)), we are optimistic that applying techniques like Squeeze-Excite, Swish, and AutoAugment can further increase the Pareto-efficiency of our networks, but that is outside the scope of this work.

All experiments (including One-Shot, FBNet, MorphNet baselines) were run on the same hardware (32-core Cloud TPU) using TensorFlow.

Search Space Details

For FiGS-One-Shot, we use the same search space instantiation presented in Bender et al. (2018) (sec 3.4) for ImageNet — 8 cells, 4 blocks per cell, separable convolutions, and downsampling with stride=2 average pooling every 2 cells. We use a base width () of 64 filters. We verify our search space implementation by reproducing their “All On" results in Table 1. To assist with fine-grained search, we make one modification, as mentioned in Sec. 3.3: we combine operator outputs by concatenating them and passing through a 1x1 convolution (instead of adding) to decouple their output dimensions. The extra 1x1 convolution does not increase the accuracy of the supernetwork or learned architectures in and of itself. As shown in Fig. 4, the concat aggregator helps FiGS produce better architectures.

Figure 4:

Effect of additive vs. concat aggregator on fine-grained search on the One-Shot search space. The degree of freedom in setting the channel count of operator outputs allows to learn better architectures.

For FiGS-FBNet, we do not include group convolutions in our set of operators so we only compare against FBNet-{B,C} which also do not include group convolutions.


The multiple points for EfficientNet-B0 in Fig 2 were generated by applying a uniform width multiplier .

Model Top-1 Acc. #Params Ratio-to-Ours
-A   () 69.90.1% 1.30.02M 1.0x
One-Shot-Small (Bender et al. (2018)) 67.9% 1.4M 1.1x
MnasNet-Small (Tan et al. (2019)) 64.9% 1.9M 1.5x
MobileNetV3-Small-0.75 (Howard et al. (2019)) 65.4% 2.4M 1.8x
-B   () 75.00.5% 2.70.06M 1.0x
MobileNetV2-0.75x (Sandler et al. (2018)) 69.8% 2.6M 1.0x
MobileNetV3-Small-1.0 (Howard et al. (2019)) 67.4% 2.9M 1.1x
One-Shot-Small (Bender et al. (2018)) 72.4% 3.0M 1.2x
MobileNetV2-1.0x (Sandler et al. (2018)) 72.0% 3.4M 1.3x
MnasNet-65 (Tan et al. (2019)) 73.0% 3.6M 1.4x
AtomNAS-A (Mei et al. (2020)) 74.6% 3.9M 1.5x
MobileNetV3-Large-0.75 (Howard et al. (2019)) 73.3% 4.0M 1.5x
FBNet-A (Wu et al. (2019)) 73.0% 4.3M 1.8x
-C   () 77.10.03% 4.40.04M 1.0x
AtomNAS-B (Mei et al. (2020)) 75.5% 4.4M 1.0x
FBNet-B (Wu et al. (2019)) 74.1% 4.5M 1.0x
MnasNet-A2 (Tan et al. (2019)) 75.6% 4.8M 1.1x
One-Shot-Small (Bender et al. (2018)) 74.2% 5.1M 1.2x
MobileNetV2-1.3x (Sandler et al. (2018)) 74.4% 5.3M 1.2x
PC-DARTS (Xu et al. (2020)) 75.8% 5.3M 1.2x
EfficientNet-B0 (Tan and Le (2019)) 76.3% 5.3M 1.2x
MobileNetV3-Large-1.0 (Howard et al. (2019)) 75.2% 5.4M 1.2x
FBNet-C (Wu et al. (2019)) 74.9% 5.5M 1.3x
Table 5: Extended version of Table 1: Comparison with modern mobile classification architectures and DNAS methods on ImageNet. FiGS produces the smallest and most accurate models in each category. Ratio-to-Ours indicates how much larger each network is compared to ours.