1 Introduction
ConvNets are the de facto
method for computer vision. In many computer vision tasks, a better ConvNet design usually leads to significant accuracy improvement. In previous works, accuracy improvement comes at the cost of higher computational complexity, making it more challenging to deploy ConvNets to mobile devices, on which computing capacity is limited. Instead of solely focusing on accuracy, recent work also aims to optimize for efficiency, especially latency. However, designing efficient and accurate ConvNets is a difficult problem. Challenges mainly come from the following aspects.
Intractable design space: The design space of a ConvNet is combinatorial. Using VGG16 [18] as a motivating example: VGG16 contains 16 layers. Assume for each layer of the network, we can choose a different kernel size from and a different filter number from . Even with such simplified design choices and shallow layers, the design space contains possible architectures. However, training a ConvNet is very timeconsuming, typically taking days or even weeks. As a result, previous ConvNet design rarely explores the design space. A typical flow of manual ConvNet design is illustrated in Figure 2(a). Designers propose initial architectures and train them on the target dataset. Based on the performance, designers evolve the architectures accordingly. Limited by the cost of training ConvNets, the design flow can only afford a few iterations of experiments, which is far too few to sufficiently explore the design space.
Starting from [30], recent works adopt neural architecture search (NAS) to explore the design space automatically. Many previous works [30, 31, 20]
use reinforcement learning (RL) to guide the search and a typical flow is illustrated in Figure
2(b). A controller samples architectures from the search space to be trained. To reduce the training cost, sampled architectures are trained on a smaller proxy dataset such as CIFAR10 or trained for fewer epochs on ImageNet. The performance of the trained networks is then used to train and improve the controller. Previous works
[30, 31, 20] has demonstrated the effectiveness of such methods in finding accurate and efficient ConvNet models. However, training each architecture is still timeconsuming, and it usually takes thousands of architectures to train the controller. As a result, the computational cost of such methods is prohibitively high, which leads to the next challenge of ConvNet design.Nontransferable optimality: the optimality of ConvNet architectures is conditioned on many factors such as input resolutions and target devices. Once these factors change, the optimal architecture is likely to be different. A common practice to reduce the FLOP count of a network is to shrink the input resolution. A smaller input resolution may require a smaller receptive field of the network and therefore shallower layers. On a different device, the same operator can have different latency, so we need to adjust the ConvNet architecture to achieve the best accuracyefficiency tradeoff. Ideally, we should design different ConvNet architectures casebycase. In practice, however, limited by the cost of previous manual and automated approaches, we can only afford to design one ConvNet and use it for all conditions.
Inconsistent efficiency metrics: Most of the efficiency metrics we care about are dependent on not only the ConvNet architecture but also the hardware and software configurations on the target device. Such metrics include latency, power, energy, and in this paper, we mainly focus on latency. To simplify the problem, most of the previous works adopt hardwareagnostic metrics such as FLOPs (more strictly, number of multiplyadd operations) to evaluate a ConvNet’s efficiency. However, a ConvNet with lower FLOP count is not necessarily faster. For example, NasNetA [31] has a similar FLOP count as MobileNetV1 [6], but its complicated and fragmented celllevel structure is not hardware friendly, so the actual latency is slower [17]. The inconsistency between hardware agnostic metrics and actual efficiency makes the ConvNet design more difficult.
To address the above problems, we propose to use differentiable neural architecture search (DNAS) to discover hardwareaware efficient ConvNets. The flow of our algorithm is illustrated in Figure 1. DNAS allows us to explore a layerwise search space where we can choose a different block for each layer of the network. Following [21], DNAS represents the search space by a super net whose operators execute stochastically. We relax the problem of finding the optimal architecture to find a distribution that yields the optimal architecture. By using the Gumbel Softmax technique [9], we can directly train the architecture distribution using gradientbased optimization such as SGD. The search process is extremely fast compared with previous reinforcement learning (RL) based method. The loss used to train the stochastic super net consists of both the crossentropy loss that leads to better accuracy and the latency loss that penalizes the network’s latency on a target device. To estimate the latency of an architecture, we measure the latency of each operator in the search space and use a lookup table model to compute the overall latency by adding up the latency of each operator. Using this model allows us to quickly estimate the latency of architectures in this enormous search space. More importantly, it makes the latency differentiable with respect to layerwise block choices.
We name the models discovered by DNAS as FBNets. FBNets surpass the stateoftheart efficient ConvNets designed manually and automatically. FBNetB achieves 74.1% top1 accuracy with 295M FLOPs and 23.1 ms latency on an Samsung S8 phone, 2.4x smaller and 1.5x faster than MobileNetV21.3 with the same accuracy. Being better than MnasNet, FBNetB’s search cost is 216 GPUhours, 421x lower than the cost for MnasNet estimated based on [20]. Such low search cost enables us to redesign ConvNets casebycase. For different resolution and channel scaling, FBNets achieve 1.5% to 6.4% absolute gain in top1 accuracy compared with MobileNetV2 models. The smallest FBNet achieves 50.2% accuracy and 2.9 ms latency (345 frames per second) with a batch size of 1 on Samsung S8. Using DNAS to search for devicespecific ConvNet, an iPhonexoptimized model achieves 1.4x speedup on an iPhone X compared with a Samsungoptimized model.
2 Related work
Efficient ConvNet models: Designing efficient ConvNet has attracted many research attention in recent years. SqueezeNet [8] is one of the early works focusing on reducing the parameter size of ConvNet models. It is originally designed for classification, but later extended to object detection [22] and LiDAR pointcloud segmentation [24, 26]. Following SqueezeNet, SqueezeNext [3] and ShiftNet [23] achieve further parameter size reduction. Recent works change the focus from parameter size to FLOPs. MobileNetV1 and MobileNetV2 [6, 17] use depthwise convolutions to replace the more expensive spatial convolutions. ShuffleNet [29] uses group convolution and shuffle operations to reduce the FLOP count further. More recent works realize that FLOP count does not always reflect the actual hardware efficiency. To improve actual latency, ShuffleNetV2 [13] proposes a series of practical guidelines for efficient ConvNet design. Synetgy [28] combines ideas from ShuffleNetV2 and ShiftNet to codesign hardware friendly ConvNets and FPGA accelerators.
Neural Architecture Search: [30, 31] first proposes to use reinforcement learning (RL) to search for neural architectures to achieve competitive accuracy with low FLOPs. Early NAS methods are computationally expensive. Recent works try to reduce the computational cost by weight sharing [16] or using gradientbased optimization [12]. [25, 1] further develop the idea of differentiable neural architecture search combining Gumbel Softmax [9]. Early works of NAS [31, 16, 12] focus on the cell level architecture search, and the same cell structure is repeated in all layers of a network. However, such fragmented and complicated celllevel structures are not hardware friendly, and the actual efficiency is low. Most recently, [20] explores a stagelevel hierarchical search space, allowing different blocks for different stages of a network, while blocks inside a stage are still the same. Instead of focusing on FLOPs, [20] aims to optimize the latency on target devices. Besides searching for new architectures, works such as [27, 5] focus on adapting existing models to improve efficiency.
3 Method
In this paper, we use differentiable neural architecture search (DNAS) to solve the problem of ConvNet design. We formulate the neural architecture search problem as
(1) 
Given an architecture space , we seek to find an optimal architecture such that after training its weights , it can achieve the minimal loss . In our work, we focus on three factors of the problem: a) the search space
. b) The loss function
that considers actual latency. c) An efficient search algorithm.3.1 The Search Space
Previous works [30, 31, 16, 11, 12] focus on cell level architecture search. Once a cell structure is searched, it is used in all the layers across the network. However, many searched cell structures are very complicated and fragmented and are therefore slow when deployed to mobile CPUs [17, 13]. Besides, at different layers, the same cell structure can have a different impact on the accuracy and latency of the overall network. As shown in [20] and in our experiments, allowing different layers to choose different blocks leads to better accuracy and efficiency.
In this work, we construct a layerwise search space with a fixed macroarchitecture, and each layer can choose a different block. The macroarchitecture is described in Table 1. The macro architecture defines the number of layers and the input/output dimensions of each layer. The first and the last three layers of the network have fixed operators. For the rest of the layers, their block type needs to be searched. The filter numbers for each layer are handpicked empirically. We use relatively small channel sizes for early layers, since the input resolution at early layers is large, and the computational cost (FLOP count) is quadratic to input size.
Input shape  Block  f  n  s 

3x3 conv  16  1  2  
TBS  16  1  1  
TBS  24  4  2  
TBS  32  4  2  
TBS  64  4  2  
TBS  112  4  1  
TBS  184  4  2  
TBS  352  1  1  
1x1 conv  1984  1  1  
7x7 avgpool    1  1  
fc  1000  1   
denotes the stride of the first block in a stage.
Each searchable layer in the network can choose a different block from the layerwise search space. The block structure is inspired by MobileNetV2 [17] and ShiftNet [23], and is illustrated in Figure 3
. It contains a pointwise (1x1) convolution, a KbyK depthwise convolution where K denotes the kernel size, and another 1x1 convolution. “ReLU” activation functions follow the first 1x1 convolution and the depthwise convolution, but there are no activation functions following the last 1x1 convolution. If the output dimension stays the same as the input dimension, we use a skip connection to add the input to the output. Following
[17, 23], we use a hyperparameter, the expansion ratio
, to control the block. It determines how much do we expand the output channel size of the first 1x1 convolution compared with its input channel size. Following [20], we also allow choosing a kernel size of 3 or 5 for the depthwise convolution. In addition, we can choose to use group convolution for the first and the last 1x1 convolution to reduce the computation complexity. When we use group convolution, we follow [29] to add a channel shuffle operation to mix the information between channel groups.In our experiments, our layerwise search space contains 9 candidate blocks, with their configurations listed in Table 2. Note we also have a block called “skip”, which directly feed the input feature map to the output without actual computations. This candidate block essentially allows us to reduce the depth of the network.
In summary, our overall search space contains 22 layers and each layer can choose from 9 candidate blocks from Table 2, so it contains possible architectures. Finding the optimal layerwise block assignment from such enormous search space is a nontrivial task.
Block type  expansion  Kernel  Group 

k3_e1  1  3  1 
k3_e1_g2  1  3  2 
k3_e3  3  3  1 
k3_e6  6  3  1 
k5_e1  1  5  1 
k5_e1_g2  1  5  2 
k5_e3  3  5  1 
k5_e6  6  5  1 
skip       
3.2 LatencyAware Loss Function
The loss function used in (1) has to reflect not only the accuracy of a given architecture but also the latency on the target hardware. To achieve this goal, we define the following loss function:
(2) 
The first term denotes the crossentropy loss of architecture with parameter . The second term denotes the latency of the architecture on the target hardware measured in microsecond. The coefficient controls the overall magnitude of the loss function. The exponent coefficient modulates the magnitude of the latency term.
The crossentropy term can be easily computed. However, the latency term is more difficult, since we need to measure the actual runtime of an architecture on a target device. To cover the entire search space, we need to measure about architectures, which is an impossible task.
To solve this problem, we use a latency lookup table model to estimate the overall latency of a network based on the runtime of each operator. More formally, we assume
(3) 
where denotes the block at layer from architecture . This assumes that on the target processor, the runtime of each operator is independent of other operators. The assumption is valid for many mobile CPUs and DSPs, where operators are computed sequentially one by one. This way, by benchmarking the latency of a few hundred operators used in the search space, we can easily estimate the actual runtime of the architectures in the entire search space. More importantly, as will be explained in section 3.3, using the lookup table model makes the latency term in the loss function (2) differentiable with respect to layerwise block choices, and this allows us to use gradientbased optimization to solve problem (1).
3.3 The Search Algorithm
Solving the problem (1) through bruteforce enumeration of the search space is very expensive. The inner problem of optimizing
involves training a neural network. For ImageNet classification, training a ConvNet typically takes several days or even weeks. The outer problem of optimizing
has a combinatorially large search space.Most of the early works on NAS [30, 31, 20] follow the paradigm above. To reduce the computational cost, the inner problem is replaced by training candidate architectures on an easier proxy dataset. For example, [30, 31] trains the architecture on the CIFAR10 dataset, and [20] trains on ImageNet but only for 5 epochs. The learned architectures are then transferred to the target dataset. To avoid exhaustively iterating through the search space, [30, 31, 20] use reinforcement learning to guide the exploration. Despite these improvements, solving problem (1) is still prohibitively expensive – training a network on the proxy dataset is still timeconsuming, and thousands of architectures need to be trained before reaching the optimal solution.
We adopt a different paradigm of solving problem (1). We first represent the search space by a stochastic super net. The super net has the same macroarchitecture as described in Table 1, and each layer contains 9 parallel blocks as described in Table 2
. During the inference of the super net, only one candidate block is sampled and executed with the sampling probability of
(4) 
contains parameters that determine the sampling probability of each block at layer. Equivalently, the output of layer can be expressed as
(5) 
where
is a random variable in
and is evaluated to 1 if block is sampled. The sampling probability is determined by equation (4). denotes the output of block at layer given the input feature map . We let each layer sample independently, therefore, the probability of sampling an architecture can be described as(6) 
where
denotes the a vector consists of all the
for each block at layer. denotes that in the sampled architecture , block is chosen at layer.Instead of solving for the optimal architecture , which has a discrete search space, we relax the problem to optimize the probability of the stochastic super net to achieve the minimum expected loss. Formally, we rewrite the discrete optimization problem (1) as
(7) 
It is obvious the loss function in (7) is differentiable with respect to the architecture weights
and therefore can be optimized by stochastic gradient descent (SGD). However, the loss is not directly differentiable to the sampling parameter
, since we cannot pass the gradient through the discrete random variable
to . To sidestep this, we relax the discrete mask variableto be a continuous random variable computed by the Gumbel Softmax function
[9, 14](8)  
where is a random noise following the Gumbel distribution. The Gumbel Softmax function is controlled by a temperature parameter . As approaches 0, it approximates the discrete categorical sampling following the distribution in (6). As becomes larger, becomes a continuous random variable. Regardless of the value of , the mask is directly differentiable with respect to the parameter . The technique of using Gumbel Softmax for neural architecture search is also proposed in [25, 1].
As a result, it is clear that the crossentropy term from the loss function (2) is differentiable with respect to the mask and therefore . For the latency term, since we use the lookup table based model for efficiency estimation, equation (3) can be written as
(9) 
The latency of each operator is a constant coefficient, so the overall latency of architecture is differentiable with respect to the mask , therefore .
As a result, the loss function (2) is fully differentiable with respect to both weights and the architecture distribution parameter . This allows us to use SGD to efficiently solve problem (1).
Our search process is now equivalent to training the stochastic super net. During the training, we compute to train each operator’s weight in the super net. This is no different from training an ordinary ConvNet. After operators get trained, different operators can have a different contribution to the accuracy and the efficiency of the overall network. Therefore, we compute to update the sampling probability for each operator. This step selects operators with better accuracy and lower latency and suppresses the opposite ones. After the super net training finishes, we can then obtain the optimal architectures by sampling from the architecture distribution .
As will be shown in the experiment section, the proposed DNAS algorithm is orders of magnitude faster than previous RL based NAS while generating better architectures.
4 Experiments
4.1 ImageNet Classification
To demonstrate the efficacy of our proposed method, we use DNAS to search for ConvNet models on ImageNet 2012 classification dataset [2], and we name the discovered models FBNets. We aim to discover models with high accuracy and low latency on target devices. In our first experiment, we target Samsung Galaxy S8 with a Qualcomm Snapdragon 835 platform. The model is deployed with Caffe2 with int8 inference engine for mobile devices.
Before the search starts, we first build a latency lookup table described in section 3.2 on the target device. Next, we train a stochastic super net with search space described in section 3.3. We set the input resolution of the network to 224by224. To reduce the training time, we randomly choose 100 classes from the original 1000 classes to train the stochastic super net. We train the stochastic super net for 90 epochs. In each epoch, we first train the operator weights and then the architecture probability parameter . is trained on 80% of ImageNet training set using SGD with momentum. The architecture distribution parameter is trained on the rest 20% of ImageNet training set with Adam optimizer [10]. To control the temperature of the Gumbel Softmax from equation (8), we use an exponentially decaying temperature. After the search finishes, we sample several architectures from the trained distribution
, and train them from scratch. Our architecture search framework is implemented in pytorch
[15] and searched models are trained in Caffe2. More training details will be provided in the supplementary materials.Our experiment results are summarized in Table 3. We compare our searched models with stateoftheart efficient models both designed automatically and manually. The primary metrics we care about are top1 accuracy on the ImageNet validation set and the latency. If the latency is not available, we use FLOP as the secondary efficiency metric. For baseline models, we directly cite the parameter size, FLOP count, and top1 accuracy from the original paper. Since our network is deployed with caffe2 with highly efficient in8 implementation, we have an unfair latency advantage against other baselines. Therefore, we implement the baseline models ourselves and measure their latency under the same environment for a fair comparison. For automatically designed models, we also compare the search method, search space, and search cost.
Model 



#Params  #FLOPs 




1.0MobileNetV2 [17]  manual      3.4M  300M  21.7 ms  72.0  
1.5ShuffleNetV2 [13]  manual      3.5M  299M  22.0 ms  72.6  
CondenseNet (G=C=8) [7]  manual      2.9M  274M  28.4 ms  71.0  
MnasNet65 [13]  RL  stagewise  91K / 421x  3.6M  270M    73.0  
DARTS [12]  gradient  cell  288 / 1.33x  4.9M  595M    73.1  
FBNetA (ours)  gradient  layerwise  216 / 1.0x  4.3M  249M  19.8 ms  73.0  
1.3MobileNetV2 [17]  manual      5.3M  509M  33.8 ms  74.4  
CondenseNet (G=C=4) [7]  manual      4.8M  529M  28.7 ms  73.8  
MnasNet [20]  RL  stagewise  91K / 421x  4.2M  317M  23.7 ms  74.0  
NASNetA [31]  RL  cell  48K / 222x  5.3M  564M    74.0  
PNASNet [11]  SMBO  cell  6K / 27.8x  5.1M  588M    74.2  
FBNetB (ours)  gradient  layerwise  216 / 1.0x  4.5M  295M  23.1 ms  74.1  
1.4MobileNetV2 [17]  manual      6.9M  585M  37.4 ms  74.7  
2.0ShuffleNetV2 [13]  manual      7.4M  591M  33.3 ms  74.9  
MnasNet92 [20]  RL  stagewise  91K / 421x  4.4M  388M    74.8  
FBNetC (ours)  gradient  layerwise  216 / 1.0x  5.5M  375M  28.1 ms  74.9 
Table 3 divides the models into three categories according to their accuracy level. In the first group, FBNetA achieves 73.0% accuracy, better than 1.0MobileNetV2 (+1.0%), 1.5ShuffleNet V2 (+0.4%), and CondenseNet (+2%), and are on par with DARTS and MnasNet65. Regarding latency, FBNetA is 1.9 ms (relative 9.6%), 2.2 ms (relative 11%), and 8.6 ms (relative 43%) better than the MobileNetV2, ShuffleNetV2, and CondenseNet counterparts. Although we did not optimize for FLOP count directly, FBNetA’s FLOP count is only 249M, 50M smaller (relative 20%) than MobileNetV2 and ShuffleNetV2, 20M (relative 8%) smaller than MnasNet, and 2.4X smaller than DARTS. In the second group, FBNetB achieves comparable accuracy with 1.3MobileNetV2, but the latency is 1.46x lower, and the FLOP count is 1.73x smaller, even smaller than 1.0MobileNetV2 and 1.5ShuffleNet V2. Compared with MnasNet, FBNetB’s accuracy is 0.1% higher, latency is 0.6ms lower, and FLOP count is 22M (relative 7%) smaller. We do not have the latency of NASNetA and PNASNet, but the accuracy is comparable, and the FLOP count is 1.9x and 2.0x smaller. In the third group, FBNetC achieves 74.9% accuracy, same as 2.0ShuffleNetV2 and better than all others. The latency is 28.1 ms, 1.33x and 1.19x faster than MobileNet and ShuffleNet. The FLOP count is 1.56x, 1.58x, and 1.03x smaller than MobileNet, ShuffleNet, and MnasNet92.
Among all the automatically searched models, FBNet’s performance is much stronger than DARTS, PNAS, and NAS, and better than MnasNet. However, the search cost is orders of magnitude lower. MnasNet [20] does not disclose the exact search cost (in terms of GPUhours). However, it mentions that the controller samples 8,000 models during the search and each model is trained for five epochs. According to our experiments, training of MNasNet for one epoch takes 17 minutes using 8 GPUs. So the estimated cost for training 8,000 models for 5 epochs is about GPU hours. In comparison, the FBNet search takes 8 GPUs for only 27 hours, so the computational cost is only 216 GPU hours, or 421x faster than MnasNet, 222x faster than NAS, 27.8x faster than PNAS, and 1.33x faster than DARTS.
We visualize some of our searched FBNets, MobileNetV2, and MnasNet in Figure 4.
4.2 Different Resolution and Channel Size Scaling
A common technique to reduce the computational cost of a ConvNet is to reduce the input resolution or channel size without changing the ConvNet structure. This approach is likely to be suboptimal. We hypothesize that with a different input resolution and channel size scaling, the optimal ConvNet structure will be different. To test this, we use DNAS to search for several different combinations of input resolution and channel size scaling. Thanks to the superior efficiency of DNAS, we can finish the search very quickly. The result is summarized in Table 4. Compared with MobileNetV2 under the same input size and channel size scaling, our searched models achieve 1.5% to 6.4% better accuracy with similar latency. Especially the FBNet960.351 model achieves 50.2% (+4.7%) accuracy and 2.9 ms latency (345 frames per second) on a Samsung Galaxy S8.

Model  #Parameters  #FLOPs  CPU Latency  Top1 acc (%)  

(224, 0.35)  MobileNetV22240.35  1.7M  59M  9.3 ms  60.3  
MNasNetscale2240.35  1.9M  76M  10.7 ms  62.4 (+2.1)  
FBNet2240.35  2.0M  72M  10.7 ms  65.3 (+5.0)  
(192, 0.50)  MobileNetV2  2.0M  71M  8.4 ms  63.9  
MnasNetsearch1920.5        65.6 (+1.7)  
FBNet1920.5 (ours)  2.6M  73M  9.9 ms  65.9 (+2.0)  
(128, 1.0)  MobileNetV2  3.5M  99M  8.4 ms  65.3  
MnasNetscale1281.0  4.2M  103M  9.2 ms  67.3 (+2.0)  
FBNet1281.0 (ours)  4.2M  92M  9.0 ms  67.0 (+1.7)  
(128, 0.50)  MobileNetV2  2.0M  32M  4.8 ms  57.7  
FBNet1280.5 (ours)  2.4M  32M  5.1 ms  60.0 (+2.3)  
(96, 0.35)  MobileNetV2  1.7M  11M  3.8 ms  45.5  
FBNet960.351 (ours)  1.8M  12.9M  2.9 ms  50.2 (+4.7)  
FBNet960.352 (ours)  1.9M  13.7M  3.6 ms  51.9 (+6.4) 
Model  #Parameters  #FLOPs 


Top1 acc (%)  

FBNetiPhoneX  4.47M  322M  19.84 ms (target)  23.33 ms  73.20  
FBNetS8  4.43M  293M  27.53 ms  22.12 ms (target)  73.27 
We visualize the architecture of FBNet960.351 in Figure 4, we can see that many layers are skipped, and the network is much shallower than FBNet{A, B, C}, whose input size is 224. We conjecture that this is because with smaller input size, the receptive field needed to parse the image also becomes smaller, so having more layers will not effectively increase the accuracy.
4.3 Different Target Devices
In previous ConvNet design practices, the same ConvNet model is deployed to many different devices. However, this is suboptimal since different computing platforms and software implementation can have different characteristics. To validate this, we conduct search targeting two mobile devices: Samsung Galaxy S8 with Qualcomm Snapdragon 835 platforms, and iPhone X with A11 Bionic processors. We use the same architecture search space, but different latency lookup tables collected from two target devices. All the architecture search and training protocols are the same. After we searched and trained two models, we deploy them to both Samsung Galaxy S8 and iPhone X to benchmark the overall latency. The result is summarized in Table. 5.
As we can see, the two models reach similar accuracy ( vs. ). FBNetiphoneX model’s latency is 19.84 ms on its target device, but when deployed to a Samsung S8, its latency increases to 23.33 ms. On the other hand, FBNetS8 reaches a latency of 22.12 ms on a Samsung S8, but when deployed to an iPhone X, the latency hikes to 27.53 ms, 7.69 ms (relatively 39%) higher than FBNetiPhone X. This demonstrates the necessity of redesigning ConvNets for different target devices.
Two models are visualized in Figure 4. Note that FBNetS8 uses many blocks with 5x5 depthwise convolution while FBNetiPhoneX only uses them in the last two stages. We examine the depthwise convolution operators used in the two models and compare their runtime on both devices. As shown in Figure 5, the upper three operators are faster on iPhone X, therefore they are automatically adopted in FBNetiPhoneX. The lower three operators are significantly faster on Samsung S8, and they are also automatically adopted in FBNetS8. Notice the drastic runtime differences of the lower three operators on two target devices. It explains why the SamsungS8optimized model performs poorly on an iPhone X. This shows DNAS can automatically optimize the operator adoptions and generate different ConvNets optimized for different devices.
5 Conclusion
We present DNAS, a differentiable neural architecture search framework. It optimizes over a layerwise search space and represents the search space by a stochastic super net. The actual target device latency of blocks is used to compute the loss for super net training. FBNets, a family of models discovered by DNAS surpass stateoftheart models, both manually and automatically designed: FBNetB achieves 74.1% top1 accuracy with 295M FLOPs and 23.1 ms latency, 2.4x smaller and 1.5x faster than MobileNetV21.3 with the same accuracy. It also achieves better accuracy and lower latency than MnasNet, the stateoftheart efficient model designed automatically; we estimate the search cost of DNAS is 420x smaller. Such efficiency allows us to conduct searches for different input resolutions and channel scaling. Discovered models achieve 1.5% to 6.4% accuracy gains. The smallest FBNet achieves 50.2% accuracy with a latency of 2.9 ms (345 frames/sec) with batch size 1. Over the Samsungoptimized FBNet, the improved FBNet achieves 1.4x speed up on an iPhone X, showing DNAS is able to adapt to different target devices automatically.
References
 [1] Anonymous. Snas: stochastic neural architecture search. In Submitted to International Conference on Learning Representations, 2019. under review.

[2]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
Imagenet: A largescale hierarchical image database.
In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
, pages 248–255. Ieee, 2009.  [3] A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, and K. Keutzer. Squeezenext: Hardwareaware neural network design. arXiv preprint arXiv:1803.10615, 2018.
 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [5] Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, and S. Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
 [6] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [7] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. group, 3(12):11, 2017.
 [8] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [9] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 [10] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [11] C. Liu, B. Zoph, J. Shlens, W. Hua, L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.
 [12] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [13] N. Ma, X. Zhang, H.T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018.
 [14] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 [15] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [16] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 [17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [18] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [20] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
 [21] T. Veniat and L. Denoyer. Learning time/memoryefficient deep architectures with budgeted super networks. arXiv preprint arXiv:1706.00046, 2017.

[22]
B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer.
Squeezedet: Unified, small, low power fully convolutional neural networks for realtime object detection for autonomous driving.
In CVPR Workshops, pages 446–454, 2017.  [23] B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, and K. Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions. arXiv preprint arXiv:1711.08141, 2017.
 [24] B. Wu, A. Wan, X. Yue, and K. Keutzer. Squeezeseg: Convolutional neural nets with recurrent crf for realtime roadobject segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1887–1893. IEEE, 2018.
 [25] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, and K. Keutzer. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090, 2018.
 [26] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer. Squeezesegv2: Improved model structure and unsupervised domain adaptation for roadobject segmentation from a lidar point cloud. arXiv preprint arXiv:1809.08495, 2018.
 [27] T.J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. Netadapt: Platformaware neural network adaptation for mobile applications. Energy, 41:46, 2018.
 [28] Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. Vissers, J. Wawrzynek, et al. Synetgy: Algorithmhardware codesign for convnet accelerators on embedded fpgas. arXiv preprint arXiv:1811.08634, 2018.
 [29] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017. arXiv preprint arXiv:1707.01083.
 [30] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 [31] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017.
Appendix A Experiment details
We describe more experiment details in this appendix to facilitate other researchers to reproduce our work. Our architecture search is divided into two stages. In the first stage, we train the stochastic super net to find an optimal architecture distribution. In the second stage, we sample architectures from the distribution and train them from scratch.
To train the stochastic super net, we randomly sample 100 classes from the original 1,000 classes of ImageNet. Training the super net on this smaller proxy dataset is much faster. We train the stochastic super net for 90 epochs with a batch size of 192. In each epoch, we first train the operator parameters on 80% of the training set using stochastic gradient descent with momentum. The initial learning rate is 0.1, and decay following a cosine decaying schedule. The momentum is 0.9, and weight decay is . Next, we train the architecture distribution parameter on the rest 20% of the training set with Adam optimizer [10] with a learning rate of and weight decay of . The split of weight and architecture parameter training ensure the architecture generalize to the validation dataset. To control the Gumbel Softmax in (8), we use an initial temperature of 5.0 and exponentially anneal it by every epoch. For the loss function in (2), we set to 0.2 and to 0.6. We use the standard ResNet data augmentation [4] to process the input images. We found that at the beginning of the training, operators are usually not sufficiently trained, so their contributions to the accuracy are not clear. However, their costs are always significantly different from each other. As a consequence, the super net may always pick lowcost operators at the beginning of the training. To prevent this, we postpone the training of the architecture parameter by 10 epochs to allow operator weights to be sufficiently trained first. At the end of the super net training, we sample 6 architectures from the final distribution to be trained from scratch.
To train the sampled architectures, the training protocols are different for different models. Here we describe the training protocol for FBNet{A, B, C}. These models have an input resolution of 224, channel size scaling of 1.0. We train the models with a batch size of 256 on 8 GPUs for 360 epochs. We set the initial learning rate to be 0.1, and decay 10x at 90, 180, and 270 epochs. The momentum is 0.9, weight decay is . We use dropout at the last convolution layer of the network, and the dropout ratio is 0.2. We use the standard GoogleNet data augmentation [19] to randomly resize the image during training.