MobileNetV3 in pytorch and ImageNet pretrained models
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2 classification while reducing latency by 15 MobileNetV2-Small is 4.6 to MobileNetV2. MobileNetV3-Large detection is 25 accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30 faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.READ FULL TEXT VIEW PDF
For real time applications utilizing Deep Neural Networks (DNNs), it is
Despite the blooming success of architecture search for vision tasks in
We present FasterSeg, an automatically designed semantic segmentation ne...
Recently, neural architecture search (NAS) has been exploited to design
We propose a new efficient architecture for semantic segmentation, based...
Recent advancements in Neural Architecture Search(NAS) resulted in findi...
In this paper we describe a new mobile architecture, MobileNetV2, that
MobileNetV3 in pytorch and ImageNet pretrained models
Implementing Searching for MobileNetV3 paper using Pytorch
Implementation of MobileNetV3 in pytorch
A PyTorch implementation of MobileNetV3
67.1% MobileNetV3-Small model on ImageNet, MobileNetV3-Large training in progress
Efficient neural networks are becoming ubiquitous in mobile applications enabling entirely new on-device experiences. They are also a key enabler of personal privacy allowing a user to gain the benefits of neural networks without needing to send their data to the server to be evaluated. Advances in neural network efficiency not only improve user experience via higher accuracy and lower latency, but also help preserve battery life through reduced power consumption.
This paper describes the approach we took to develop MobileNetV3 Large and Small models in order to deliver the next generation of high accuracy efficient neural network models to power on-device computer vision. The new networks push the state of the art forward and demonstrate how to blend automated search with novel architecture advances to build effective models.
The goal of this paper is to develop the best possible mobile computer vision architectures optimizing the accuracy-latency trade off on mobile devices. To accomplish this we introduce (1) complementary search techniques, (2) new efficient versions of nonlinearities practical for the mobile setting, (3) new efficient network design, (4) a new efficient segmentation decoder. We present thorough experiments demonstrating the efficacy and value of each technique evaluated on a wide range of use cases and mobile phones.
The paper is organized as follows. We start with a discussion of related work in Section 2. Section 3 reviews the efficient building blocks used for mobile models. Section 4 reviews architecture search and the complementary nature of MnasNet and NetAdapt algorithms. Section 5 describes novel architecture design improving on the efficiency of the models found through the joint search. Section 6 presents extensive experiments for classification, detection and segmentation in order do demonstrate efficacy and understand the contributions of different elements. Section 7 contains conclusions and future work.
Designing deep neural network architecture for the optimal trade-off between accuracy and efficiency has been an active research area in recent years. Both novel hand-crafted structures and algorithmic neural architecture search have played important roles in advancing this field.
SqueezeNet extensively uses x convolutions with squeeze and expand modules primarily focusing on reducing the number of parameters. More recent works shifts the focus from reducing parameters to reducing the number of operations (MAdds) and the actual measured latency. MobileNetV1 employs depthwise separable convolution to substantially improve computation efficiency. MobileNetV2 expands on this by introducing a resource-efficient block with inverted residuals and linear bottlenecks. ShuffleNet utilizes group convolution and channel shuffle operations to further reduce the MAdds. CondenseNet learns group convolutions at the training stage to keep useful dense connections between layers for feature re-use. ShiftNet proposes the shift operation interleaved with point-wise convolutions to replace expensive spatial convolutions.
To automate the architecture design process, reinforcement learning (RL) was first introduced to search efficient architectures with competitive accuracy[53, 54, 3, 27, 35]. A fully configurable search space can grow exponentially large and intractable. So early works of architecture search focus on the cell level structure search, and the same cell is reused in all layers. Recently,  explored a block-level hierarchical search space allowing different layer structures at different resolution blocks of a network. To reduce the computational cost of search, differentiable architecture search framework is used in [28, 5, 45] with gradient-based optimization. Focusing on adapting existing networks to constrained mobile platforms, [48, 15, 12] proposed more efficient automated network simplification algorithms.
Quantization [23, 25, 47, 41, 51, 52, 37] is another important complementary effort to improve the network efficiency through reduced precision arithmetic. Finally, knowledge distillation [4, 17] offers an additional complementary method to generate small accurate ”student” networks with the guidance of a large ”teacher” network.
Mobile models have been built on increasingly more efficient building blocks. MobileNetV1  introduced depthwise separable convolutions as an efficient replacement for traditional convolution layers. Depthwise separable convolutions effectively factorize traditional convolution by separating spatial filtering from the feature generation mechanism. Depthwise separable convolutions are defined by two separate layers: light weight depthwise convolution for spatial filtering and heavier 1x1 pointwise convolutions for feature generation.
MobileNetV2  introduced the linear bottleneck and inverted residual structure in order to make even more efficient layer structures by leveraging the low rank nature of the problem. This structure is shown on Figure 3
and is defined by a 1x1 expansion convolution followed by depthwise convolutions and a 1x1 projection layer. The input and output are connected with a residual connection if and only if they have the same number of channels. This structure maintains a compact representation at the input and the output while expanding to a higher-dimensional feature space internally to increase the expressiveness of nonlinear per-channel transformations.
MnasNet  built upon the MobileNetV2 structure by introducing lightweight attention modules based on squeeze and excitation into the bottleneck structure. Note that the squeeze and excitation module are integrated in a different location than ResNet based modules proposed in . The module is placed after the depthwise filters in the expansion in order for attention to be applied on the largest representation as shown on Figure 4.
For MobileNetV3, we use a combination of these layers as building blocks in order to build the most effective models. Layers are also upgraded with modified nonlinearities [36, 13, 16]. Both squeeze and excitation as well as the swish nonlinearity use the sigmoid which can be inefficient to compute as well challenging to maintain accuracy in fixed point arithmetic so we replace this with the hard sigmoid [2, 11] as discussed in section 5.2.
Network search has shown itself to be a very powerful tool for discovering and optimizing network architectures [53, 43, 5, 48]. For MobileNetV3 we use platform-aware NAS to search for the global network structures by optimizing each network block. We then use the NetAdapt algorithm to search per layer for the number of filters. These techniques are complementary and can be combined to effectively find optimized models for a given hardware platform.
Similar to , we employ a platform-aware neural architecture approach to find the global network structures. Since we use the same RNN-based controller and the same factorized hierarchical search space, we find similar results as  for Large mobile models with target latency around 80ms. Therefore, we simply reuse the same MnasNet-A1  as our initial Large mobile model, and then apply NetAdapt  and other optimizations on top of it.
However, we observe the original reward design is not optimized for small mobile models. Specifically, it uses a multi-objective reward to approximate Pareto-optimal solutions, by balancing model accuracy and latency for each model based on the target latency . We observe that accuracy changes much more dramatically with latency for small models; therefore, we need a smaller weight factor (vs the original in ) to compensate for the larger accuracy change for different latencies. Enhanced with this new weight factor , we start a new architecture search from scratch to find the initial seed model and then apply NetAdapt and other optimizations to obtain the final MobileNetV3-Small model.
The second technique that we employ in our architecture search is NetAdapt . This approach is complimentary to platform-aware NAS: it allows fine-tuning of individual layers in a sequential manner, rather than trying to infer coarse but global architecture. We refer to the original paper for the full details. In short the technique proceeds as follows:
Starts with a seed network architecture found by platform-aware NAS.
For each step:
Generate a set of new proposals. Each proposal represents a modification of an architecture that generates at least reduction in latency compared to the previous step.
For each proposal we use the pre-trained model from the previous step and populate the new proposed architecture, truncating and randomly initializing missing weights as appropriate. Fine-tune each proposal for
steps to get a coarse estimate of the accuracy.
Selected best proposal according to some metric.
Iterate previous step until target latency is reached.
In  the metric was to minimize the accuracy change. We modify this algorithm and minimize the ratio between latency change and accuracy change. That is for all proposals generated during each NetAdapt step, we pick one that maximizes: with satisfying the constraint in 2(a). The intuition is that because our proposals are discrete, we prefer proposals that maximize the slope of the trade-off curve.
This process is repeated until the latency reaches its target, and then we re-train the new architecture from scratch. We use the same proposal generator as was used in  for MobilenetV2. Specifically, we allow the following two types of proposals:
Reduce the size of any expansion layer;
Reduce bottleneck in all blocks that share the same bottleneck size - to maintain residual connections.
For our experiments we used and find that while it increases the accuracy of the initial fine-tuning of the proposals, it does not however, change the final accuracy when trained from scratch. We set , where is the latency of the seed model.
In addition to network search, we also introduce several new components to the model to further improve the final model. We redesign the computionally-expensive layers at the beginning and the end of the network. We also introduce a new nonlinearity, h-swish, a modified version of the recent swish nonlinearity, which is faster to compute and more quantization-friendly.
Once models are found through architecture search, we observe that some of the last layers as well as some of the earlier layers are more expensive than others. We propose some modifications to the architecture to reduce the latency of these slow layers while maintaining the accuracy. These modifications are outside of the scope of the current search space.
The first modification reworks how the last few layers of the network interact in order to produce the final features more efficiently. Current models based on MobileNetV2’s inverted bottleneck structure and variants use 1x1 convolution as a final layer in order to expand to a higher-dimensional feature space. This layer is critically important in order to have rich features for prediction. However, this comes at a cost of extra latency.
To reduce latency and preserve the high dimensional features, we move this layer past the final average pooling. This final set of features is now computed at 1x1 spatial resolution instead of 7x7 spatial resolution. The outcome of this design choice is that the computation of the features becomes nearly free in terms of computation and latency.
Once the cost of this feature generation layer has been mitigated, the previous bottleneck projection layer is no longer needed to reduce computation. This observation allows us to remove the projection and filtering layers in the previous bottleneck layer, further reducing computational complexity. The original and optimized last stages can be seen in figure 5. The efficient last stage reduces the latency by 10 milliseconds which is 15% of the running time and reduces the number of operations by 30 millions MAdds with almost no loss of accuracy. Section 6 contains detailed results.
Another expensive layer is the initial set of filters. Current mobile models tend to use 32 filters in a full 3x3 convolution to build initial filter banks for edge detection. Often these filters are mirror images of each other. We experimented with reducing the number of filters and using different nonlinearities to try and reduce redundancy. We settled on using the hard swish nonlinearity for this layer as it performed as well as other nonlinearities tested. We were able to reduce the number of filters to 16 while maintaining the same accuracy as 32 filters using either ReLU or swish. This saves an additional 3 milliseconds and 10 million MAdds.
While this nonlinearity improves accuracy, it comes with non-zero cost in embedded environments as the sigmoid function is much more expensive to compute on mobile devices. We deal with this problem in two ways.
We replace sigmoid function with its piece-wise linear hard analog: similar to [11, 44]. The minor difference is we use rather than a custom clipping constant. Similarly, the hard version of swish becomes
A similar version of hard-swish was also recently proposed in . The comparison of the soft and hard version of sigmoid and swish nonlinearities is shown in figure 6. Our choice of constants was motivated by simplicity and being a good match to the original smooth version. In our experiments, we found hard-version of all these functions to have no discernible difference in accuracy, but multiple advantages from a deployment perspective. First, optimized implementations of are available on virtually all software and hardware frameworks. Second, in quantized mode, it eliminates potential numerical precision loss caused by different implementations of the approximate sigmoid. Finally, even optimized implementations of quantized sigmoid tend to be far slower than their ReLU counterparts. In our experiments, replacing with in quantized mode increased inference latency by 15%111In floating point mode, memory access dominates the latency cost..
The cost of applying nonlinearity decreases as we go deeper into the network, since each layer activation memory typically halves every time the resolution drops. Incidentally, we find that most of the benefits are realized by using them only in the deeper layers. Thus in our architectures we only use at the second half of the model. We refer to the tables 1 and 2 for the precise layout.
Even with these optimizations, still introduces some latency cost. However as we demonstrate in section 6 the net effect on accuracy and latency is positive, and it provides a venue for further software optimization: once smooth sigmoid is replaced by piece-wise linear function, most of the overhead is in memory accesses, which could be eliminated by fusing the nonlinearities with the previous layers.
In , the size of the squeeze-and-excite bottleneck was relative the size of the convolutional bottleneck. Instead, we replace them all to fixed to be 1/4 of the number of channels in expansion layer. We find that doing so increases the accuracy, at the modest increase of number of parameters, and no discernible latency cost.
MobileNetV3 is defined as two models: MobileNetV3-Large and MobileNetV3-Small. These models are targeted at high and low resource use cases respectively. The models are created through applying platform-aware NAS and NetAdapt for network search and incorporating the network improvements defined in this section. See table 1 and 2 for full specification of our networks.
|conv2d 1x1, NBN||-||1280||-||HS||1|
|conv2d 1x1, NBN||-||k||-||-||1|
. NBN denotes no batch normalization.
|conv2d 1x1, NBN||-||1280||-||HS||1|
|conv2d 1x1, NBN||-||k||-||-||1|
We present experimental results to demonstrate the effectiveness of the new MobileNetV3 models. We report results on classification, detection and segmentation. We also report various ablation studies to shed light on the effects of various design decisions.
As has become standard, we use ImageNet for all our classification experiments and compare accuracy versus various measures of resource usage such as latency and multiply adds (MAdds).
We train our models using synchronous training setup on 4x4 TPU Pod 
using standard tensorflow RMSPropOptimizer with 0.9 momentum. We use the initial learning rate of 0.1, with batch size 4096 (128 images per chip), and learning rate decay rate of 0.01 every 3 epochs. We use dropout of 0.8, and l2 weight decay 1e-5 and the same image preprocessing as Inception. Finally we use exponential moving average with decay 0.9999. All our convolutional layers use batch-normalization layers with average decay of 0.99.
To measure latencies we use standard Google Pixel phones and run all networks through the standard TFLite Benchmark Tool. We use single-threaded large core in all our measurements. We don’t report multi-core inference time, since we find this setup not very practical for mobile applications.
As can be seen on figure 1 our models outperform the current state of the art such as MnasNet , ProxylessNas  and MobileNetV2 . We report the floating point performance on different Pixel phones in the table 3. We include quantization results in table 4.
In figure 7 we show the MobileNetV3 performance trade-offs as a function of multiplier and resolution. Note how MobileNetV3-Small outperforms the MobileNetV3-Large with multiplier scaled to match the performance by nearly 3%. On the other hand, resolution provides an even better trade-offs than multiplier. However, it should be noted that resolution is often determined by the problem (e.g. segmentation and detection problem generally require higher resolution), and thus can’t always be used as a tunable parameter.
In table 5 and figure 8 we show how the decision where to insert affects the latency. Of particular importance we note that using on the entire network results in slight increase of accuracy (0.2), while adding nearly 20% in latency, and again ending up under the efficient frontier.
On the other hand using moves the efficient frontier up compared to despite still being about more expensive. Finally, we note, that as gets optimized by fusing it into the convolutional operator, we expect the latency gap between and to drop significantly if not disappear. However, such improvement can’t be expected between and , since computing sigmoid is inherently more expensive.
|Top 1||Latency P-1|
|74.5 (-.7%)||59 (-12%)|
|75.4 (+.2 %)||78 (+20%)|
|75.0 (-.3%)||64 (-3%)|
In figure 9 we show how introduction of different components moved along the latency/accuracy curve.
Following MobileNetV2 , we attach the first layer of SSDLite to the last feature extractor layer that has an output stride of , and attach the second layer of SSDLite to the last feature extractor layer that has an output stride of . Following the detection literature, we refer to these two feature extractor layers as and , respectively. For MobileNetV3-Large, is the expansion layer of the -th bottleneck block. For MobileNetV3-Small, is the expansion layer of the -th bottleneck block. For both networks, is the layer immediately before pooling.
We additionally reduce the channel counts of all feature layers between and by . This is because the last few layers of MobileNetV3 are tuned to output classes, which may be redundant when transferred to COCO with classes.
The results on COCO test set are given in Tab. 6. With the channel reduction, MobileNetV3-Large is faster than MobileNetV2 with near identical mAP. MobileNetV3-Small with channel reduction is also and mAP higher than MobileNetV2 and MnasNet at similar latency. For both MobileNetV3 models the channel reduction trick contributes to approximately latency reduction with no mAP loss, suggesting that Imagenet classification and COCO object detection may prefer different feature extractor shapes.
|Backbone||mAP||Latency (ms)||Params (M)||MAdd (B)|
In this subsection, we employ MobileNetV2  and the proposed MobileNetV3 as network backbones for the task of mobile semantic segmentation. Additionally, we compare two segmentation heads. The first one, referred to as R-ASPP, was proposed in . R-ASPP is a reduced design of the Atrous Spatial Pyramid Pooling module [7, 8, 9], which adopts only two branches consisting of a convolution and a global-average pooling operation [29, 50]. In this work, we propose another light-weight segmentation head, referred to as Lite R-ASPP (or LR-ASPP), as shown in Fig. 10. Lite R-ASPP, improving over R-ASPP, deploys the global-average pooling in a fashion similar to the Squeeze-and-Excitation module , in which we employ a large pooling kernel with a large stride (to save some computation) and only one convolution in the module. We apply atrous convolution [18, 40, 33, 6] to the last block of MobileNetV3 to extract denser features, and further add a skip connection  from low-level features to capture more detailed information.
We conduct the experiments on the Cityscapes dataset  with metric mIOU , and only exploit the ‘fine’ annotations. We employ the same training protocol as [8, 39]. All our models are trained from scratch without pretraining on ImageNet , and are evaluated with a single-scale input. Similar to object detection, we observe that we could reduce the channels in the last block of network backbone by a factor of 2 without degrading the performance significantly. We think it is because the backbone is designed for 1000 classes ImageNet image classification  while there are only 19 classes on Cityscapes, implying there is some channel redundancy in the backbone.
We report our Cityscapes validation set results in Tab. 7. As shown in the table, we observe that (1) reducing the channels in the last block of network backbone by a factor of 2 significantly improves the speed while maintaining similar performances (row 1 vs. row 2, and row 5 vs. row 6), (2) the proposed segmentation head LR-ASPP is slightly faster than R-ASPP  while performance is improved (row 2 vs. row 3, and row 6 vs. row 7), (3) reducing the filters in the segmentation head from 256 to 128 improves the speed at the cost of slightly worse performance (row 3 vs. row 4, and row 7 vs. row 8), (4) when employing the same setting, MobileNetV3 model variants attain similar performance while being slightly faster than MobileNetV2 counterparts (row 1 vs. row 5, row 2 vs. row 6, row 3 vs. row 7, and row 4 vs. row 8), (5) MobileNetV3-Small attains similar performance as MobileNetV2-0.5 while being faster, and (6) MobileNetV3-Small is significantly better than MobileNetV2-0.35 while yielding similar speed.
Tab. 8 shows our Cityscapes test set results. Our segmentation models with MobileNetV3 as network backbone significantly outperforms ESPNetv2 , CCC2 , and ESPNetv1  by 10.5%, 10.6%, 12.3%, respectively while being faster in terms of Madds. The performance drops slightly by 0.6% when not employing the atrous convolution to extract dense feature maps in the last block of MobileNetV3, but the speed is improved to 1.98B (for half-resolution input), which is 1.7, 1.59, and 2.24 times faster than ESPNetv2, CCC2, and ESPNetv1, respectively. Furthermore, our models with MobileNetV3-Small as network backbone still outperforms all of them by at least a healthy margin of 6.2%. Our fastest model variant is 13.6% better than ESPNetv2-small with a slightly faster inference speed.
|N||Backbone||RF2||SH||F||mIOU||Params||Madds||CPU (f)||CPU (h)|
|Backbone||OS||mIOU||Madds (f)||Madds (h)||CPU (f)||CPU (h)|
|ESPNetv2 small ||-||54.7||2.26B||0.56B||-||-|
In this paper we introduced MobileNetV3 Large and Small models demonstrating new state of the art in mobile classification, detection and segmentation. We have described our efforts to harness multiple types of network architecture search as well as advances in network design to deliver the next generation of mobile models. We have also shown how to adapt nonlinearities like swish and apply squeeze and excite in a quantization friendly and efficient manner introducing them into the mobile model domain as effective tools. We also introduced a new form of lightweight segmentation decoders called LR-ASPP. While it remains an open question of how best to blend automatic search techniques with human intuition, we are pleased to present these first positive results and will continue to refine methods as future work.
We would like to thank Dmitry Kalenichenko, Menglong Zhu, Jon Shlens, Xiao Zhang, Benoit Jacob, Alex Stark, Achille Brighton and Sergey Ioffe for helpful feedback and discussion.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
Semantic segmentation of satellite images using a modified cnn with hard-swish activation function.In VISIGRAPP, 2019.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
NIPS Deep Learning and Representation Learning Workshop, 2015.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights.In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, NIPS, pages 963–971, 2014.
We give detailed table containing multiply-adds, accuracy, parameter count and latency in Table 9.