This repo contains code for *FD-MobileNet: Improved MobileNet with A Fast Downsampling Strategy*.
We present Fast-Downsampling MobileNet (FD-MobileNet), an efficient and accurate network for very limited computational budgets (e.g., 10-140 MFLOPs). Our key idea is applying an aggressive downsampling strategy to MobileNet framework. In FD-MobileNet, we perform 32× downsampling within 12 layers, only half the layers in the original MobileNet. This design brings three advantages: (i) It remarkably reduces the computational cost. (ii) It increases the information capacity and achieves significant performance improvements. (iii) It is engineering-friendly and provides fast actual inference speed. Experiments on ILSVRC 2012 and PASCAL VOC 2007 datasets demonstrate that FD-MobileNet consistently outperforms MobileNet and achieves comparable results with ShuffleNet under different computational budgets, for instance, surpassing MobileNet by 5.5 3.6 device, FD-MobileNet achieves 1.11× inference speedup over MobileNet and 1.82× over ShuffleNet under the same complexity.READ FULL TEXT VIEW PDF
This repo contains code for *FD-MobileNet: Improved MobileNet with A Fast Downsampling Strategy*.
Deep convolutional neural networks (CNNs) have become one of the most important methods in computer vision tasks such as image classification [1, 2, 3, 4], object detection [5, 6, 7, 8] and semantic segmentation [9, 10]. However, state-of-the-art CNNs require enormous computational resources and huge model sizes, which prevents them from being deployed on mobile or embedded devices.
For this reason, the inference-time compression and acceleration of deep neural networks has attracted the attention of the deep learning community in recent years. The related work is conventionally categorized into four classes. Tensor decomposition methods [11, 12]
factorize a convolutional layer into several smaller convolutional layers, which reduces the overall complexity and the number of parameters. This class of methods conventionally involve a low-rank estimation process and a fine-tuning process, leading to a slow training procedure.Parameter quantization methods [13, 14] propose to utilize low-precision parameters in neural networks and provide significant theoretical speedup and enormous memory savings. However, current hardware is not well optimized for low-precision computation so specific hardware is required for quantization methods to achieve an ideal speedup. Network pruning methods [15, 16] attempt to discover and alleviate parameter and structure redundancy in deep neural networks. Early pruning approaches adopt an unstructured pruning scheme and induces random memory accesses, which is not well supported by current hardware. Recent research on network pruning mainly focuses on structured pruning to leverage existing hardware. At last, compact networks [17, 18, 19] are specifically designed to employ both accurate and computationally economical networks on mobile or embedded devices.
Unlike the other methods which are mainly focused on compressing pre-trained models, compact networks can be trained from scratch. Additionally, compact networks are orthogonal to the other methods and can be further accelerated. In view of these advantages, various compact network architectures have been proposed. Among these networks, MobileNet  and ShuffleNet  achieve the state-of-the-art performance.
ShuffleNet is composed of a variant of the bottleneck unit  named the ShuffleNet unit. The ShuffleNet unit utilizes bypass connections for better representation capability. Beneficial from the powerful ShuffleNet unit, ShuffleNet achieves significant performance improvements over previous architectures [17, 18]. However, the bypass connection structure introduces multiple information paths in the computing graph, which induces frequent memory/cache switches in the engineering implementation on mobile or embedded devices. Consequently, the actual inference speed of ShuffleNet on physical devices is not ideal.
On the contrary, MobileNet exploits depthwise separable convolutions as its building blocks in a simple stacking architecture. This design allows a more efficient utilization of memory and cache, and MobileNet is significantly faster than ShuffleNet in actual inference speed under the same complexity. However, MobileNet adopts a slow downsampling strategy, which induces severe performance degradation when the computational budget is relatively small, for instance, 10-140 MFLOPs. In such a slow downsampling strategy, more layers have large feature maps, so the feature representation is more detailed. However, the number of channels in the network is restricted, thus the information capacity is relatively small. If the width of a network is further shrunk to fit an extremely limited complexity, the information capacity will become too small and the performance of the network will collapse.
In this paper, we present a highly efficient and accurate network named Fast-Downsampling MobileNet (FD-MobileNet) for extremely limited computational resources (e.g., 10 to 140 MFLOPs). Instead of merely shrinking the width of the network to fit small computational budgets, we compose FD-MobileNet by adopting a fast downsampling strategy into the MobileNet framework. In the proposed FD-MobileNet, we perform 32 downsampling within the first 12 layers, which is only half of the number in the original MobileNet. After that, a sequence of depthwise separable convolutions are applied for better representation capability. Benefiting from the fast downsampling strategy, FD-MobileNet has the following three advantages: (i) The computational cost of FD-MobileNet is reduced as the spatial dimensions of the feature maps are smaller. (ii) FD-MobileNet allows more channels than the MobileNet counterpart under the same complexity. This remarkably increases the information capacity of FD-MobileNet, which is critical to the performance of very small networks. (iii) FD-MobileNet inherits the simple architecture from MobileNet and provides a fast inference speed in engineering implementation.
We conduct extensive experiments to examine the effectiveness of the proposed FD-MobileNet. Firstly, we compare FD-MobileNet with other state-of-the-art compact networks on the ILSVRC 2012 dataset . Then, we examine the generalization ability of FD-MobileNet on the PASCAL VOC 2007 dataset . Experiments show that the proposed FD-MobileNet significantly outperforms MobileNet and achieves comparable performance with ShuffleNet under various computational budgets. For instance, FD-MobileNet achieves improvements of 5.5% on the ILSVRC 2012 top-1 accuracy and 3.6% on the VOC 2007 mAP over MobileNet under the computational budget of 12 MFLOPs. At last, we furthermore evaluate the actual inference speed of FD-MobileNet on an ARM-based device. Under a complexity of 12 MFLOPs, FD-MobileNet provides 1.11 speedup over MobileNet and 1.82 over ShuffleNet. Our code will be made publicly available later.
In this section, we present the design of Fast-Downsampling MobileNet (FD-MobileNet). FD-MobileNet is composed of the highly efficient depthwise separable convolutions and adopts a fast downsampling strategy. Benefiting from this design, FD-MobileNet achieves both high accuracy and high efficiency under very limited computational budgets.
Depthwise Separable Convolutions. Following MobileNet , FD-MobileNet exploits depthwise separable convolutions  as the building blocks. A depthwise separable convolution factorizes a standard convolution into a depthwise convolution and a pointwise convolution with 89 times reduction in FLOPs. In practice, depthwise separable convolutions can achieve comparable performance with standard convolutions while provide great efficiency on computation-limited devices.
Fast Downsampling Strategy. Modern CNNs adopt a hierarchical architecture, where the spatial dimensions of the layers within the same stage is kept identical, and the spatial dimensions in the next stage is reduced by downsampling. In view of the restricted computational budgets, compact networks suffer from both the weak feature representation and the restricted information capacity. Different downsampling strategies provide a trade-off between detailed feature representation and large information capacity for compact networks. In a slow downsampling strategy, downsampling is performed in the later layers of the network, thus more layers have large spatial dimensions. On the contrary, downsampling is performed at the beginning of the network in a fast downsampling strategy, which significantly reduces the computational cost. Consequently, given a fixed computational budget, a slow downsampling strategy is inclined to generate more detailed features, whereas a fast downsampling strategy can increase the number of channels and allows more information to be encoded.
When the computational budget is extremely small, the information capacity plays a more important role in the performance of a network. Conventionally, the number of channels is reduced to adapt a compact network architecture to a certain complexity. In the case where a slow downsampling scheme is adopted, the network becomes too narrow to encode adequate information, which induces severe performance degradation. For instance, under a complexity of 12 MFLOPs, the original MobileNet architecture only has 128 channels in the last layer before the global pooling, thus the information capacity is very limited.
Based on this insight, we propose to adopt a fast downsampling strategy in the architecture of FD-MobileNet and postpone the feature extraction process to the smallest resolution. The faster downsampling is implemented by consecutively applying depthwise separable convolutions with large strides at the beginning of the network. Here we do not use max pooling because we find it does not gain performance improvements but introduces extra computation. The proposed FD-MobileNet accepts an image with a size of 224224 pixels, and performs 4 downsampling within the first 2 layers while performs 32 downsampling within merely 12 layers, whereas the number of layers performing the same downsampling in the original MobileNet is 4 and 24, respectively. More specifically, the 12 layers are composed of 1 standard convolutional layer, 5 depthwise separable convolutions (each has a depthwise convolutional layer and a pointwise convolutional layer), and 1 depthwise convolutional layer. Fig. 1 illustrates the comparison of the downsampling strategies of FD-MobileNet, MobileNet and ShuffleNet under the computational budget of 140 MFLOPs. From the figure, it is observed that FD-MobileNet is significantly shallower than the other architecture before the feature maps are shrunk to 77.
|Conv, 32, /2||10.8|
|DWConv, 32, /2||7.3|
|DWConv, 64, /2||20.6|
|DWConv, 128, /2||19.9|
|DWConv, 256, /2||84.7|
|Global Average Pooling||1.0|
|1000-d fc, Softmax|
Remaining Layers. The utilization of the fast downsampling strategy significantly reduces the computation cost of the layers before the smallest spatial dimensions (77). Under the computational budget of 140 MFLOPs, MobileNet spends about 129 MFLOPs on the largest 4 resolutions, whereas FD-MobileNet only spends about 59 MFLOPs, as shown in Table 1. Consequently, more layers and more channels can be leveraged in the proposed architecture. Here we exploit 6 depthwise separable convolutions to improve the representation capability of generated features. The output channels of the first 5 depthwise separable convolutions are 512, while the last one is 1024, which is twice the number in the MobileNet counterpart (0.5 MobileNet-224). The increase in the number of channels contributes to larger information capacity, which is critical to the performance of the networks under extremely limited computational resources.
Overall Architecture. The overall architecture of FD-MobileNet is demonstrated in Table 1. FD-MobileNet adopts a simple stacking architecture with 24 layers, including 1 standard convolutional layer, 11 depthwise separable convolutions, and 1 fully-connected layer. Following 23]
and a ReLU activation is applied after each convolutional layer. To conveniently adapt FD-MobileNet to different computational budgets, we introduce a hyper-parametertermed width multiplier as in  to uniformly adjust the width of FD-MobileNet. We use a simple notation “FD-MobileNet ” to represent a network with a width multiplier , and the network in Table 1 is denoted as “FD-MobileNet 1”.
Inference Efficiency. Current deep learning frameworks accomplish the inference of a neural network by building an acyclic computing graph. For mobile or embedded devices, memory and cache resources are limited. As a result, complicated computing graphs can induce frequent memory/cache switches, which slows down the actual inference speed. FD-MobileNet inherits the simple architecture of the original MobileNet, and there is only one information path in the computing graph. This makes FD-MobileNet very friendly to engineering implementation and efficient on physical devices.
We first evaluate the effectiveness of FD-MobileNet on the ILSVRC 2012 dataset 
. The ILSVRC 2012 dataset is composed of 1.2 million training images and 50,000 validation images. In the experiments, the networks are trained on the training set using PyTorch
with four GPUs for 90 epochs. Following, the batch size is set to 256 and a momentum of 0.9 is used. The learning rate starts from 0.1 and decays by an order of magnitude every 30 epochs. As the networks are relatively small, a weight decay of 4e-5 is utilized as recommended in . For data augmentation, we adopt a slightly less aggressive multi-scale augmentation scheme without using color jittering. On evaluation, the center-crop top-1 accuracy rates on the validation set are reported. Each validation image is first resized with its shorter edge to 256 pixels, and then evaluated using the center pixels crop. Table 2 demonstrated the comparison of the top-1 accuracy of FD-MobileNet, MobileNet and ShuffleNet under three computational budgets.
|ShuffleNet 1 ||137||65.9|
|0.5 MobileNet-224 ||149||63.7|
|FD-MobileNet 1 (ours)||144||65.3|
|ShuffleNet 0.5 ||38||57.3|
|0.25 MobileNet-224 ||41||50.6|
|FD-MobileNet 0.5 (ours)||40||56.2|
|ShuffleNet 0.25 ||13||46.7|
|0.125 MobileNet-224 ||12||39.6|
|FD-MobileNet 0.25 (ours)||12||45.1|
|0.5 MobileNet-224 ||53.8||59.0||66.8||52.3||33.5||29.8||56.9||71.4||61.5||29.8||59.2||51.3||59.3||69.9||64.6||63.5||29.5||48.9||51.8||65.0||52.1|
|FD-MobileNet 1 (ours)||55.4||58.1||67.1||49.4||32.7||28.8||62.2||71.1||67.2||32.6||59.4||58.0||63.0||72.3||65.7||65.8||26.9||53.5||51.9||65.0||56.7|
|0.25 MobileNet-224 ||42.3||47.5||53.8||35.0||24.0||18.5||43.9||60.2||51.3||17.5||47.6||47.5||47.4||60.0||58.7||55.5||19.2||38.3||36.3||48.9||34.2|
|FD-MobileNet 0.5 (ours)||45.1||46.4||53.2||38.2||29.3||16.8||47.1||63.0||56.2||22.3||48.8||49.4||47.3||66.9||60.6||56.8||20.0||44.5||40.9||57.0||38.0|
|0.125 MobileNet-224 ||29.1||33.4||38.6||20.7||16.0||2.9||31.4||48.5||42.7||13.2||26.8||28.2||34.5||46.9||45.4||42.3||13.4||29.3||21.7||29.6||16.1|
|FD-MobileNet 0.25 (ours)||32.7||40.6||43.6||21.4||16.2||8.2||33.7||50.7||41.2||15.6||37.2||33.4||36.2||54.7||50.4||41.4||8.1||29.6||25.3||47.7||18.2|
From the table, FD-MobileNet achieves substantial improvements over MobileNet under different computational budgets. It is observed that FD-MobileNet surpasses MobileNet by a margin of 1.6% under a complexity of 140 MFLOPs, and performs 5.6% and 5.5% better when the computational budget is 40 and 12 MFLOPs, respectively. It is noteworthy that FD-MobileNet provides significantly improvements over MobileNet when the computational budget is very small (e.g., 40 and 12 MFOPs). We attribute these improvements to the effectiveness of the fast downsampling strategy in FD-MobileNet. The original MobileNet adopts a slow downsampling strategy, thus more layers have relatively large feature maps and are more computationally intensive. Consequently, MobileNet is relatively narrow to maintain computational efficiency, which limits the information capacity. On the other side, FD-MobileNet exploits a much faster downsampling strategy, which allows more channels to be leveraged and alleviates the information capacity degradation. For instance, under 12 MFLOPs, the last layer of MobileNet outputs only 128 channels, whereas the number in FD-MobileNet is doubled. The increase in the information capacity significantly improves the performance of FD-MobileNet.
Compared with ShuffleNet, FD-MobileNet achieves comparable or slightly worse results. We conjecture that these differences are owed to the effectiveness of the bypass connection structure of the ShuffleNet unit. The bypass connection structure has proven powerful in various computer vision tasks [3, 5, 8]. However, on low-power mobile or embedded devices, the bypass connection structure induces frequent memory/cache switches and harms the actual inference speed. On the contrary, the simple architecture of FD-MobileNet contributes to an efficient utilization of memory and cache. Details are discussed in Section 3.3.
We furthermore conduct extensive experiments on PASCAL VOC 2007 detection dataset  to examine the generalization ability of the proposed FD-MobileNet. PASCAL VOC 2007 dataset consists of about 10,000 images split into three (train/val/test) sets. In the experiments, the detectors are trained on VOC 2007 trainval set, and the single-model results on VOC 2007 test set are reported. We adopt the Faster R-CNN detection pipeline  and compare the performance of FD-MobileNet and MobileNet on 600 resolution under three computational budgets (140, 40 and 12 MFLOPs). The detectors are trained for 15 epochs with a batch size of 1. The learning rate starts from 1e-3, and is divided by 10 every 5 epochs. The weight decay is set to 4e-5. Other hyper-parameter settings follow the original Faster R-CNN in . During testing, 300 proposals are sent to the R-CNN subnet to generate the final predictions.
The comparison of the results are demonstrated in Table 3. It is observed that FD-MobileNet achieves significant improvements over MobileNet under different computational budgets. Under the computational budget of 140 MFLOPs, the FD-MobileNet detector surpasses the MobileNet detector by a margin of 1.6% on mAP. The gap is enlarged when the complexity is lower. When the complexity is restricted to 40 and 12 MFLOPs, FD-MobileNet outperforms MobileNet by 2.8% and 3.6% on mAP, respectively. More specifically, on single class results, FD-MobileNet performs better than MobileNet on most classes. From Table 3
, FD-MobileNet provides more significant improvements over MobileNet when the computational budget is smaller. For instance, when the computational budget is 12 MFLOPs, FD-MobileNet achieves consistent improvements on the classes which are hard for MobileNet, such as bottle (5.3%), chair (2.4%) and boat (0.2%). These improvements have proven that FD-MobileNet have strong generalization ability for transfer learning.
|ShuffleNet 1 ||137||522.27|
|0.5 MobileNet-224 ||149||431.73|
|FD-MobileNet 1 (ours)||144||391.66|
|ShuffleNet 0.5 ||38||204.97|
|0.25 MobileNet-224 ||41||155.84|
|FD-MobileNet 0.5 (ours)||40||139.47|
|ShuffleNet 0.25 ||13||103.79|
|0.125 MobileNet-224* ||12||63.73|
|FD-MobileNet 0.25 (ours)||12||57.17|
To investigate the performance on physical devices, we further compare the actual inference time of FD-MobileNet, MobileNet and ShuffleNet on an ARM-based platform. The experiments are conducted using an optimized NCNN framework  on an i.MX 6 series CPU (single-core, 800 MHz).
Table 4 shows the inference time of the three compact networks under computational budgets of 140, 40 and 12 MFLOPs, respectively. Compared with MobileNet, FD-MobileNet achieves about 1.1 speedup over MobileNet under the three computational budgets. These improvements are attributed to the effectiveness of the fast downsampling architecture of FD-MobileNet. Compared with ShuffleNet, FD-MobileNet provides significantly faster inference speed. When the computational budgets are 140 and 40 MFLOPs, FD-MobileNet gains 1.33 and 1.47 speedup over ShuffleNet, respectively. The speedup is elevated under a complexity of 12 MFLOPs: FD-MobileNet is 1.82 faster than ShuffleNet. It is noteworthy that under 140 and 40 MFLOPs, the ShuffleNet models have fewer FLOPs than the FD-MobileNet counterparts, but they are much slower. This slowdown is caused by the inefficiency of the bypass connection structure of the ShuffleNet unit. On low-power devices, the bypass connection structure leads to frequent memory and cache switch, which slows down the actual inference speed. On the contrary, the simple stacking architecture allows FD-MobileNet to leverage memory and cache more efficiently, which contributes to a faster actual inference speed. These results indicate that FD-MobileNet is effective in actual mobile or embedded applications.
In this work, we present Fast-Downsampling MobileNet (FD-MobileNet), a highly efficient and accurate network for very limited computational budgets. FD-MobileNet is built by adopting a fast downsampling strategy in the state-of-the-art MobileNet framework. Compare with the original MobileNet, the utilization of the fast downsampling scheme allows more channels, which increases the information capacity of the network and contributes to significant performance improvements. Experiments on the ILSVRC 2012 classification dataset and the PASCAL VOC 2007 detection dataset show that FD-MobileNet consistently outperforms MobileNet under different computational budgets. Evaluations of the actual inference time demonstrate that FD-MobileNet achieves significant speedup over ShuffleNet on an ARM-based device under the same complexity. For future work, we plan to adopt the fast downsampling strategy in other compact networks such as ShuffleNet for better performance.
This work is supported by the National Key Research and Development Program of China (2016YFB1000100).
“Imagenet classification with deep convolutional neural networks,”in Advances in neural information processing systems, 2012, pp. 1097–1105.
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
International Conference on Machine Learning, 2015, pp. 448–456.