1 Introduction
Recent progress in efficient CNN architectures [16, 12, 31, 13, 47, 28, 35] successfully decreases the computational cost of ImageNet classification from 3.8G FLOPs (ResNet50 [11]) by two orders of magnitude to about 40M FLOPs (e.g. MobileNet, ShuffleNet), with a reasonable performance drop. However, they suffer from a significant performance degradation when reducing computational cost further. For example, the top1 accuracy of MobileNetV3 degrades substantially from 65.4% to 58.0% and 49.8% when the computational cost drops from 44M to 21M and 12M MAdds, respectively. In this paper, we aim at improving accuracy at the extremely low FLOP regime from 21M to 4M MAdds, which marks the computational cost decrease of another order of magnitude (from 40M).
The problem of dealing with extremely low computational cost (4M–21M FLOPs) is very challenging, considering that 2.7M MAdds are consumed by a thin stem layer that contains a single convolution with 3 input channels and 8 output channels over a
grid (stride=2). The remaining resources are too limited to design the convolution layers and 1,000 class classifier required for effective classification. As shown in Figure
1, a common strategy to reduce the width or depth of existing efficient CNNs (e.g. MobileNet [12, 31, 13] and ShuffleNet [47, 28]) results in a severe performance degradation. Note that we focus on new operator design while fixing the input resolution to 224224 even for the budget of 4M FLOPs.In this paper, we handle the extremely low FLOPs from two perspectives: node connectivity and nonlinearity, which are related to the network width and depth. First, we show that lowering node connectivity to enlarge network width provides a good tradeoff for a given computational budget. Second, we rely on improved layer nonlinearities to compensate for reduced network depth, which determines the nonlinearity of the network. These two factors motivate the design of more efficient convolution and activation functions.
Regarding convolutions, we propose a MicroFactorized convolution (MFConv) to factorize a pointwise convolution into two group convolution layers, where the group number adapts to the number of channels as:
where is the channel reduction ratio in between. As analyzed in Section 3.1, this equation achieves a good tradeoff between the number of channels and node connectivity for a given computational cost. Mathematically, the pointwise convolution matrix is approximated by a block matrix ( blocks), whose blocks have rank1. This guarantees minimal path redundancy (with only one path between any inputoutput pair) and maximum input coverage (per output channel), enabling more channels implementable by the network for a given computational budget.
With regards to nonlinearities, we propose a new activation function, named Dynamic ShiftMax (DYShiftMax), which nonlinearly fuses channels with dynamic coefficients. In particular, the new activation forces the network to learn to fuse different circular channel shifts of the input feature maps, using coefficients that adapt to the input, and to select the best among these fusions. This is shown to enhance the representation power of the group factorization with little computational cost.
Based upon the two new operators (MFConv and DYShiftMax), we obtain a family of models, called MicroNets. Figure 1 summarizes the ImageNet performance, where MicroNets outperform the stateoftheart by a large margin. In particular, our MicroNet models of 12M and 21M FLOPs outperform MobileNetV3 by 9.6% and 4.5% in terms of top1 accuracy, respectively. For the extremely challenging regime of 6M FLOPs, MicroNet achieves 51.4% top1 accuracy, outperforming by 1.6% over MobileNetV3, which is twice as complex (12M FLOPs).
Even though MicroNet is manually designed for theoretical FLOPs, it outperforms MobileNetV3 (which is searched over inference latency) with fast inference on edge devices. Furthermore, our MicroNet surpasses MobileNetV3 on object detection and keypoint detection, but uses substantially less computational cost.
2 Related Work
Efficient CNNs: MobileNets [12, 31, 13] decompose convolution into a depthwise and a pointwise convolution. ShuffleNets [47, 28] further simplify pointwise convolution by group convolution and channel shuffle. [34] uses MixConv to mix up multiple kernel sizes in a convolution. [38] uses butterfly transform to approximate pointwise convolution. EfficientNet [35, 36] proposes a compound scaling method to scale depth/width/resolution uniformly. AdderNet [2] trades massive multiplications for cheaper additions. GhostNet [9]
generates more feature maps from cheap linear transformations. Sandglass
[48] alleviates information loss by flipping the structure of inverted residual block. [45, 1] train one network to support multiple subnetworks.Dynamic Neural Networks:
Dynamic networks improve the representation capability by adapting architectures or parameters to the input. [22, 26, 39, 41] perform dynamic routing within a supernetwork. [39] and [41]use reinforcement learning to learn a controller for skipping part of an existing model. MSDNet
[15] allows earlyexit for easy samples based on the prediction confidence. [46] searches for the optimal MSDNet. [21] learns dynamic routing across scales for semantic segmentation. [44] adapts image resolution to achieve efficient inference. Another line of work keeps the architectures fixed, but adapts parameters. HyperNet [8] uses another network to generate parameters for the main network. SENet [14] adapt weights over channels based on squeezing global context. SKNet [20] adapts attention over kernels with different sizes. Dynamic convolution [43, 4]aggregates multiple convolution kernels based on their attention. Dynamic ReLU
[5] adapts slopes and intercepts of two linear functions in ReLU [29, 17]. [27] uses grouped fully connected layer to generate convolutional weights directly. [3] presents spatialaware dynamic convolution. [32] proposes dynamic group convolution. [37] applies dynamic convolution on instance segmentation.3 MicroFactorized Convolution
The goal of MicroFactorized convolution is to optimize the tradeoff between the number of channels and node connectivity. Here, the connectivity of a layer is defined as the number of paths per output node, where a path connects an input node and an output node.
3.1 MicroFactorized Pointwise Convolution
We propose the use of groupadaptive convolution to factorize a pointwise convolution. For conciseness, we assume the convolution kernel has the same number of input and output channels () and ignore bias terms. The kernel matrix is factorized into two groupadaptive convolutions, where the number of groups depends on the number of channels , according to
(1) 
where is a matrix, is a matrix that compresses the number of channels by a factor of , and is a matrix that expands the number of channels back to . and are diagonal block matrices with blocks, each implementing the convolution of a group of channels. is a permutation matrix, shuffling channels similarly to [47]. The computational complexity of the factorized layer is . Figure 2Left shows an example of the factorization, for , and .
The channels of matrix are denoted hidden channels. The grouping structure limits the number of these channels that are affected by (affect) each input (output) of the layer. Specifically, each hidden channel connects to input channels and each output channel connects to hidden channels. The number of inputoutput connections per output channel denotes the connectivity of the layer. When the computational budget and the compression factor are fixed, the number of channels and connectivity change with in opposite directions,
(2) 
This is illustrated in Figure 3. As the number of groups increases, increases but decreases. The two curves intersect () when
(3) 
in which case each output channel connects to all input channels exactly once (). This guarantees that no redundant paths exist between any inputoutput pair (minimum path redundancy) while guaranteeing the existence of a path between each pair (maximum input coverage). Eq. 3 is a defining property of microfactorized pointwise convolution. It implies that the number of groups is not fixed, but defined by the number of channels and the compression factor , according to a square root law that optimally balances the number of channels and input/output connectivity. Mathematically, the resulting convolution matrix is divided into rank1 blocks, as shown in Figure 2Left.
3.2 MicroFactorized Depthwise Convolution
Figure 2Middle shows how microfactorization can be applied to a depthwise convolution. The convolution kernel is factorized into a and a kernel. This follows Eq. 1, with per channel kernel matrix , vector , vector and a scalar of value 1. This low rank approximation reduces the computational complexity from to .
Combining MicroFactorized Pointwise and Depthwise Convolutions: MicroFactorized pointwise and depthwise convolutions can be combined in two different ways: (a) regular combination, and (b) lite combination. The former simply concatenates the two convolutions. The lite combination, shown in Figure 2Right, uses MicroFactorized depthwise convolutions to expand the number of channels, by applying multiple spatial filters per channel. It then applies one groupadaptive convolution to fuse and squeeze the number of channels. Compared to its regular counterpart, it spends more resources on learning spatial filters (depthwise) by saving channel fusion (pointwise) computations, which is empirically validated to be more effective for implementation of lower network layers.
4 Dynamic ShiftMax
So far, we have discussed the design of efficient static networks, which do not change their weights according to the input. We now introduce dynamic ShiftMax (DYShiftMax), a new dynamic nonlinearity that strengthens connections between the groups created by microfactorization. This is complementary to MicroFactorized pointwise convolution, which focuses on connections within a group.
Let (
) denote an input vector (or tensor) with
channels that are divided into groups of channels each. The group circular shift (shifting channels) of is the vector such that . Dynamic ShiftMax outputs the maximum of fusions, each of which combines multiple () group shifts as:(4) 
where is a dynamic weight, i.e. a weight that depends on the input . It is implemented as a hyperfunction (with output dimension) that consists of a sequence of average pooling, two fully connected layers, and a sigmoid layer, as in SqueezeandExcitation [16].
In this way, DYShiftMax implements two forms of nonlinearity: it (a) outputs the maximum of fusions of groups, and (b) weighs each fusion by a dynamic parameter . The first nonlinearity is complementary to MicroFactorized pointwise convolution, which focuses on connectivity within each group, strengthening the connections between groups. The second enables the network to tailor this strengthening to the input . The two operations increase the representation power of the network, compensating for the loss inherent to the reduced number of layers.
DYShiftMax synthesizes weights from input . Its computational complexity is a sum of (a) average pooling , (b) generation of the weights , and (c) application of dynamic ShiftMax per channel and spatial location . This leads to a lightweight model when and are small. Empirically, a good tradeoff between classification performance and complexity is achieved when and .
M0  M1  M2  M3  
Output  Block  Block  Block  Block  
112112  stem  3  4  2  stem  3  6  3  stem  3  8  4  stem  3  12  4 
5656  MicroA  3  16  8  MicroA  3  24  8  MicroA  3  32  12  MicroA  3  48  16 
MicroA  3  32  12  MicroA  3  32  16  MicroA  3  48  16  MicroA  3  64  24  
2828  MicroB  3  144  24  MicroB  3  144  24  
MicroB  5  64  16  MicroB  5  96  16  MicroC  5  192  32  MicroC  3  192  32  
MicroC  5  128  32  MicroC  5  192  32  MicroC  5  192  32  MicroC  5  192  32  
1414  MicroC  5  384  64  MicroC  5  384  64  
MicroC  5  480  80  
MicroC  5  480  80  
MicroC  5  256  64  MicroC  5  384  64  MicroC  5  576  96  MicroC  5  720  120  
77  MicroC  3  384  96  MicroC  3  576  96  MicroC  3  768  128  MicroC  3  720  120 
MicroC  3  864  144  
11  avg pool 2fc softmax  
4M MAdds, 1.0M Param  6M MAdds, 1.8M Param  12M MAdds, 2.4M Param  21M MAdds, 2.6M Param 
5 MicroNet
Below we describe in detail the design of MicroNet, using MicroFactorized convolution and dynamic ShiftMax.
5.1 MicroBlocks
MicroNet models consist of three MicroBlocks of Figure 4, which combine MicroFactorized pointwise and depthwise convolutions in different ways. All of the MicroBlocks use the dynamic ShiftMax activation function.
MicroBlockA: The MicroBlockA of Figure 4a, uses the lite combination of MicroFactorized pointwise and depthwise convolutions of Figure 2Right. It expands the number of channels with MicroFactorized depthwise convolution, and compresses them with a groupadaptive convolution. It is best suited to implement lower network layers of higher resolution (e.g. or ).
MicroBlockB: The MicroBlockB of Figure 4b is used to connect MicroBlockA and MicroBlockC. Different from MicroBlockA, it uses a full MicroFactorized pointwise convolution, which includes two groupadaptive convolutions. Hence, it both compresses and expands the number of channels. All MicroNet models have a single MicroBlockB (see Table 1).
MicroBlockC: The MicroBlockC of Figure 4c implements the regular combination of MicroFactorized depthwise and pointwise convolutions. It is best suited for the higher network layers (see Table 1) since it assigns more computation to channel fusion (pointwise) than the lite combination. The skip connection is used when the input and output have the same dimension.
Each microblock has three hyperparameters: kernel size , number of output channels , compression factor of the bottleneck of MicroFactorized pointwise convolution. Note that the number of groups in the two groupadaptive convolutions is determined by Eq. 3.
5.2 Architectures
All models are manually designed to optimize for FLOPs, which is a theoretical and device independent metric. We hope this can be leveraged by new hardware design and optimization for edge devices. We aware that FLOPs is not equivalent to inference latency at existing hardware and will show in experiment that MicroNet also improves accuracy and latency. We propose four models (M0, M1, M2, M3) of different computational cost (4M, 6M, 12M, 21M MAdds) based on the MicroBlocks above. Table 1 presents their full specification. These networks follow the same pattern from low to high layers: stem layer MicroBlockA MicroBlockB MicroBlockC. All models are handcrafted, without network architecture search (NAS). The network hyperparameters are selected based on simple rules: is fixed (4 for M0, 6 for MicroNetM1,M2,M3), increases from low to high levels, depth increases from M0 to M3. For the deepest model (M3), we only use one dynamic ShiftMax layer per block after the depthwise convolution. The stem layer includes a convolution and a group convolution, and is followed by a ReLU. The second convolution expands the number of channels.
5.3 Relation to Prior Work
MicroNet has various connections to the recent deep learning literature. It is related to the popular MobileNet
[12, 31, 13] and ShuffleNet [47, 28] models. It shares the inverted bottleneck structure with MobileNet and the use of group convolution with ShuffleNet. In contrast, MicroNet differs from these models in both its convolutions and activation functions. First, it factorizes pointwise convolutions into groupadaptive convolutions, with the number of groups that is channel adaptive and guarantees minimum path redundancy. Second, it factorizes depthwise convolution. Third, it relies on a novel activation function, dynamic ShiftMax, to strengthen group connectivity in a nonlinear and input dependent manner. Dynamic ShiftMax itself generalizes the recently proposed dynamic ReLU [5] (i.e. dynamic ReLU is a special case where and each channel is activated alone).6 Experiments
We evaluate MicroNet on three tasks: (a) image classification, (b) object detection, and (c) keypoint detection. In this section, the baseline MobileNetV3Small in [13] is denoted as MobileNetV3, for conciseness.
6.1 ImageNet Classification
We start by evaluating the four MicroNet models (M0–M3) on the task of ImageNet [6] classification. ImageNet has 1000 classes, including 1,281,167 images for training and 50,000 images for validation.
All models are trained using an SGD optimizer with 0.9 momentum. The image resolution is 224
224. Data augmentation of standard random cropping and flipping is used. We use a minibatch size of 512, and a learning rate of 0.02. Each model is trained for 600 epochs with cosine learning rate decay. The weight decay is 3e5 and dropout rate is
for smaller MicroNets (M0, M1, M2). For the largest model M3, the weight decay is 4e5 and dropout rate is .MicroFac Conv  ShiftMax  

DW  PW  Lite  static  dynamic  Param  MAdds  Top1  
Mobile  1.3M  10.6M  44.9  
✓  1.7M  10.6M  46.4  
✓  ✓  1.7M  10.6M  50.0  
Micro  ✓  ✓  ✓  1.8M  10.5M  51.7  
✓  ✓  ✓  ✓  1.9M  11.8M  54.4  
✓  ✓  ✓  ✓  2.4M  12.4M  58.5 
6.1.1 Ablation Studies
Several ablations were performed using MicroNetM2. All models are trained for 300 epochs. The default hyper parameters of DYShiftMax were set as =2, =2.
From MobileNet to MicroNet: Table 2 shows the path from MobileNet to MicroNet. Both share the inverted bottleneck structure. Here, we modify MobileNetV2 (without SE [14]) such that it has complexity (10.6M MAdds) similar to the static MicroFactorized convolution variants of row 2–4. The introduction of MicroFactorized depthwise convolutions improves performance by . MicroFactorized pointwise convolutions adds another and the lite combination at lower layers adds a final gain of . Altogether the three factorizations boost the top1 accuracy of the static network from 44.9% to 51.7%. The addition of static and dynamic ShiftMax further increases this gain by 2.7% and 6.8% respectively, for a small increase in computation. This demonstrates that both MicroFactorized Convolutions and Dynamic ShiftMax are effective and complementary mechanisms for the implementation of networks with extremely low computational cost.
Number of Groups : MicroFactorized pointwise convolution includes two groupadaptive convolutions, with a number of groups equal to the integer closest to . Table 3a compares this to networks of similar structure and FLOPs (about 10.5M MAdds), but using a fixed group cardinality. Groupadaptive convolution achieves higher accuracy, demonstrating the importance of its optimal tradeoff between input/output connectivity and the number of channels.
This is further confirmed by Table 3b, which compares different options for the adaptive number of groups. This is controlled by a multiplier such that . Larger corresponds to more channels but less input/output connectivity (see Figure 3). The optimal balance is achieved when is between 0.5 and 1. Top1 accuracy drops when either increases (more channels but less connectivity) or decreases (fewer channels but more connectivity) from this optimal point. The value is used in the remainder of the paper. Note that all models in Table 3b have similar computational cost (about 10.5M MAdds).
Param  MAdds  Top1  
1  1.3M  10.6M  48.8 
2  1.5M  10.5M  50.2 
4  1.7M  10.6M  50.7 
8  1.7M  10.6M  50.8 
1.8M  10.5M  51.7  
(a) Fixed group number . 
Param  MAdds  Top1  
1.5M  10.5M  50.2  
1.7M  10.6M  51.6  
✶  1.8M  10.5M  51.7 
2.1M  10.5M  50.6  
2.2M  10.7M  47.6  
(b) Adaptive group number . 
Levels  

low  high  Param  MAdds  Top1 
1.7M  10.6M  50.0  
✶ ✓  1.8M  10.5M  51.7  
✓  ✓  2.0M  10.6M  51.2 
(c) Lite combination at different levels 
Lite combination: Table 3c compares using the lite combination of MicroFactorized pointwise and depthwise convolutions (Figure 2Right) at different layers. The lite combination is more effective for lower layers. Compared to the regular combination, it saves computations from channel fusion (pointwise) to allow more spatial filters (depthwise).
Activation functions: Dynamic ShiftMax is compared to three previous activation functions: ReLU [29], SE+ReLU [14], and dynamic ReLU [5]. Table 4 shows that dynamic ShiftMax outperforms all three by a clear margin (at least 2.5%). Note that dynamic ReLU is the special case of dynamic ShiftMax with (see Eq. 4).
Activation  Param  MAdds  Top1  Top5 

ReLU[29] 
1.8M  10.5M  51.7  74.3 
SE[14]+ReLU  2.1M  10.9M  54.4  76.8 
Dynamic ReLU [5]  2.4M  11.8M  56.0  78.0 
Dynamic ShiftMax  2.4M  12.4M  58.5  80.1 
Param  MAdds  Top1  Top5  

ReLU  –  –  –  1.8M  10.5M  51.7  74.3 
✓  –  –  2.1M  11.3M  55.9  77.9  
–  ✓  –  2.0M  10.6M  53.3  76.0  
Dynamic  –  –  ✓  2.1M  11.2M  54.8  77.2 
ShiftMax  ✓  ✓  –  2.2M  11.5M  56.6  78.3 
✓  –  ✓  2.3M  12.2M  57.9  79.6  
–  ✓  ✓  2.2M  11.4M  55.5  77.8  
✓  ✓  ✓  2.4M  12.4M  58.5  80.1 
Location of DYShiftMax: Table 5 shows the top1 accuracy when dynamic ShiftMax is implemented in different combinations of the three layers of the microblocks of Figure 4. When used in a single layer, dynamic ShiftMax should be placed after the depthwise convolution. This improves the top1 accuracy over a network with ReLU activations by . Adding a Dynamic ShiftMax activation at the MicroBlock output further improves performance by . Finally, using three layers of Dynamic ShiftMax further increases the gain over the ReLU network to .
Param  MAdds  Top1  Top5  
1 
1  2.1M  10.9M  54.4  76.8 
2  1  2.2M  11.8M  55.9  78.2 
✶ 2  2  2.4M  12.4M  58.5  80.1 
2  3  2.6M  13.8M  58.1  79.7 
1  2  2.2M  11.2M  55.5  77.6 
✶ 2  2  2.4M  12.4M  58.5  80.1 
3  2  2.6M  14.2M  59.0  80.3 
3  3  2.8M  15.3M  59.1  80.3 
Hyperparameters in DYShiftMax: Table 6 shows the results of using different combinations of and in Eq. 4. We add a ReLU when as only one element is left in the max operator. The baseline of the first row (, ) is equivalent to SE+ReLU [14]. For fixed (fusion of two groups), the best of two fusions () is better than a single fusion (), but adding a third fusion does not help, since it only adds path redundancy. When is fixed at (best of two fusions), fusing more groups is consistently better but requires more FLOPs. A good tradeoff is achieved with and , enabling a gain of 4.1% over the baseline, for an additional 1.5M MAdds.
6.1.2 Comparison to Prior Networks
Table 7 compares MicroNet to the stateoftheart models, which have complexity less than 24M FLOPs. As the prior works lack of reported results within 10M FLOPs budget, we extend the popular MobileNetV3 to 6M and 4M FLOPs as baseline, by using width multiplier 0.2 and 0.15 respectively. They share the same training setup with MicroNet.
To make comparison fair, two variations of M1–M3 (e.g. M3 and M3) are used. The former (M3) requires similar model size to but fewer FLOPs than the baseline (MobileNetV3 0.5). The latter (M3) requires similar FLOPs but allows more parameters (up to 1M), best serving scenarios that FLOPs is more critical than memory. This is due to the difficulty to match both model size and FLOPs, except for the smallest model (M0). Note that M3 has similar structure to M3, only shrinking the model size by reducing network width and parameters in dynamic ShiftMax.
In all cases, MicroNet outperforms all prior networks by a clear margin. For instance, MicroNetM1, M2, M3 outperform their MobileNetV3 counterpart by 8.3%, 8.4%, and 3.3%, respectively. Given another 1M budget on model size, MicroNetM1, M2, M3 increase these gains by 2.0%, 1.2% and 1.2%, respectively. MicroNetM0 outperforms MobileNetV3 0.15 by 12.9% (46.6% vs. 33.7%), demonstrating its better handle of cutting computational cost from 6M to 4M MAdds. In particular, the top1 accuracy drops by 4.8% from MicroNetM1 to M0, while the accuracy degrades by 7.4% from MobileNetV3 0.2 to 0.15. When compared to recent MobileNet and ShuffleNet improvements, such as ButterflyTransforms [38] and TinyNet [10], MicroNet models have gains of more than 2.6% top1 accuracy but use less FLOPs. This demonstrates the effectiveness of MicroNet at extremely low FLOPs.
Model  #Param  MAdds  Top1  Top5 
MobileNetV3 0.15 
1.0M  4M  33.7  57.2 
MicroNetM0  1.0M  4M  46.6  70.6 
MobileNetV3 0.2  1.2M  6M  41.1  65.2 
MicroNetM1  1.2M  5M  49.4  72.9 
MicroNetM1  1.8M  6M  51.4  74.5 
ShuffleNetV1 0.25 [47] 
–  13M  47.3  – 
MobileNetV3 0.35 [13]  1.4M  12M  49.8  – 
HBONet (9696) [19]  –  12M  50.3  73.8 
MobileNetV3+BFT 0.5 [38]  –  15M  55.2  – 
MicroNetM2  1.4M  11M  58.2  80.1 
MicroNetM2  2.4M  12M  59.4  80.9 
HBONet (128128) [19]  –  21M  55.2  78.0 
ShuffleNetV2+BFT [38]  –  21M  57.8  – 
MobileNetV3 0.5 [13]  1.6M  21M  58.0  – 
TinyNetE (106106) [10]  2.0M  24M  59.9  81.8 
MicroNetM3  1.6M  20M  61.3  82.9 
MicroNetM3  2.6M  21M  62.5  83.1 
6.1.3 Inference Latency
We also measure the inference latency of MicroNet on an Intel(R) Xeon(R) CPU E52620 v4 (2.10GHz). Following the common settings in [31, 13], we test under singlethreaded mode with batch size . The average inference latency of 5,000 images (with resolution 224224) is reported. Figure 5Right shows the comparison between MicroNet and MobileNetV3Small. To achieve similar performance, MicroNet clearly consumes less runtime than MobileNetV3. For example, MicroNet with 55% accuracy has a latency less than 7ms, while MobileNetV3 requires about 9.5ms. The accuracylatency curve is slightly degraded when using MicroNet with fewer parameters (M1, M2, M3), but it still outperforms MicroNetV3. Although the largest MicroNet model (M3) only slightly outperforms MobileNetV3 for the same latency, MicroNet gains significantly more improvement over MobileNetV3 when the latency decreases. In particular, at a latency of 4ms, MicroNet improves over MobileNetV3 by 10%, demonstrating its strength at low computational cost.
6.1.4 Discussion
As shown in Figure 5
, MicroNet clearly outperforms MobileNetV3 under the same FLOPs, but the gap shrinks under the same latency. This is due to two reasons. First, different from MobileNetV3 that is optimized for latency by search, MicroNet is manually designed based on theoretical FLOPs. Second, the implementation of group convolution and dynamic ShiftMax are not optimized (we use PyTorch for implementation). We observe that the latency of group convolution is not proportionally reduced as the number of groups increases, and dynamic ShiftMax is significantly slower than convolution with the same FLOPs.
We believe that the runtime performance of MicroNet can be further improved by using hardwareaware architecture search to find latency friendly combination of MicroFactorized convolution and dynamic ShiftMax. MicroNet can also leverage the improvement of optimization in group convolution [7] and dynamic ShiftMax to speed up. We will investigate these in the future work.
6.2 Object Detection
We evaluate the generalization ability of MicroNet on COCO object detection
[25]. All models are trained on train2017 and evaluated in mean Average Precision (mAP) on val2017. Following [9], MicroNet is used as a dropin replacement for the backbone feature extractor in both the twostage Faster RCNN [30] with Feature Pyramid Networks (FPN) [23] and the onestage RetinaNet [24]. All models are trained using SGD for 36 epochs (3) from ImageNet pretrained weights with the hyperparameters and data augmentation suggested in [40].The detection results are shown in Table 9, where the backbone FLOPs are calculated using image size as common practice. With significantly lower backbone FLOPs (21M vs 56M), MicroNetM3 achieves higher mAP than MobileNetV3Small 1.0 both on Faster RCNN and RetinaNet frameworks, demonstrating its capability to transfer to detection task.
6.3 Human Pose Estimation
We also evaluate MicroNet on COCO single person keypoint detection. All models are trained on train2017 that includes images and person instances labeled with 17 keypoints, and evaluated on val2017 that contains 5000 images, using the mean average precision (AP) over 10 object key point similarity (OKS) thresholds. Similar to object detection, two MicroNet models (M2, M3) are considered. The models are modified for the keypoint detection task, by increasing the resolution (2) of a select set of blocks (all blocks with stride of 32). Each model contains a head with three microblocks (one of stride 8 and two of stride 4) and a pointwise convolution that generates heatmaps for 17 keypoints. Bilinear upsampling is used to increase the head resolution, and the spatial attention mechanism of [5] is used. Both models are trained from scratch for 250 epochs using Adam optimizer [18]. The human detection boxes are cropped and resized to 256192. The training and testing follow the setup of [42, 33].
Table 9 compares MicroNetM3 and M2 with a strong efficient baseline, which only requires 726.9M MAdds and 2.1M parameters. The baseline applies MobileNetV3Small 1.0 as backbone and mobile blocks (inverted residual bottleneck blocks) in the head (see [4] for details). Our MicroNetM3 only consumes 22% (163.2M/726.9M) of the FLOPs used by the baseline but achieves higher performance, demonstrating its effectiveness for lowcomplexity keypoint detection. MicroNetM2 provides a good handle for even lower complexity (116.8M FLOPs).
7 Conclusion
In this paper, we have presented MicroNet to handle extremely low computational cost. It builds on two proposed operators: MicroFactorized convolution and Dynamic ShiftMax. The former balances between the number of channels and input/output connectivity via low rank approximations on both pointwise and depthwise convolutions. The latter fuses consecutive channel groups dynamically, enhancing both node connectivity and nonlinearity to compensate for the depth reduction. A family of MicroNets achieve solid improvement for three tasks (image classification, object detection and human pose estimation) under extremely low FLOPs. We hope this work provides good baselines for efficient CNNs on multiple vision tasks.
Acknowledgement
This work was partially funded by NSF awards IIS1924937, IIS2041009.
References
 [1] (2019) Once for all: train one network and specialize it for efficient deployment. ArXiv abs/1908.09791. Cited by: §2.

[2]
(202006)
AdderNet: do we really need multiplications in deep learning?.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2.  [3] (2020) Dynamic regionaware convolution. ArXiv abs/2003.12243. Cited by: §2.
 [4] (2020) Dynamic convolution: attention over convolution kernels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §6.3, Table 9.
 [5] (2020) Dynamic relu. In ECCV, Cited by: §2, §5.3, §6.1.1, §6.3, Table 4.
 [6] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §6.1, Table 7.
 [7] (2020) Optimizing grouped convolutions on edge devices. In 2020 IEEE 31st International Conference on Applicationspecific Systems, Architectures and Processors (ASAP), Vol. , pp. 189–196. External Links: Document Cited by: §6.1.4.
 [8] (2017) HyperNetworks. ICLR. Cited by: §2.
 [9] (202006) GhostNet: more features from cheap operations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §6.2.
 [10] (2020) Model rubika’s cube: twisting resolution, depth and width for tinynets. In Advances in Neural Information Processing Systems, Vol. 33, pp. 19353–19364. Cited by: §6.1.2, Table 7.
 [11] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.

[12]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §1, §1, §2, §5.3.  [13] (201910) Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2, §5.3, §6.1.3, Table 7, §6.
 [14] (201806) Squeezeandexcitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §6.1.1, §6.1.1, §6.1.1, Table 4.
 [15] (2018) Multiscale dense networks for resource efficient image classification. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [16] (2016) SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §1, §4.
 [17] (2009) What is the best multistage architecture for object recognition?. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 [18] (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §6.3.
 [19] (2019) Hbonet: harmonious bottleneck on two orthogonal dimensions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3316–3325. Cited by: Table 7.
 [20] (201906) Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [21] (202006) Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [22] (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. External Links: Link Cited by: §2.
 [23] (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §6.2.
 [24] (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §6.2.
 [25] (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §6.2.

[26]
(2018)
Dynamic deep neural networks: optimizing accuracyefficiency tradeoffs by selective execution.
In
AAAI Conference on Artificial Intelligence (AAAI)
, Cited by: §2.  [27] (2020) WeightNet: revisiting the design space of weight networks. Vol. abs/2007.11823. Cited by: §2.
 [28] (201809) ShuffleNet v2: practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2, §5.3.
 [29] (2010) Rectified linear units improve restricted boltzmann machines.. In ICML, Cited by: §2, §6.1.1, Table 4.
 [30] (2015) Faster rcnn: towards realtime object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §6.2.
 [31] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1, §1, §2, §5.3, §6.1.3.
 [32] (202008) Dynamic group convolution for accelerating convolutional neural networks. In ECCV, Cited by: §2.

[33]
(2019)
Deep highresolution representation learning for human pose estimation
. In CVPR, Cited by: §6.3.  [34] (2019) MixConv: mixed depthwise convolutional kernels. In 30th British Machine Vision Conference 2019, Cited by: §2.
 [35] (201909–15 Jun) EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, Long Beach, California, USA, pp. 6105–6114. Cited by: §1, §2.
 [36] (202006) EfficientDet: scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [37] (202008) Conditional convolutions for instance segmentation. In ECCV, Cited by: §2.
 [38] (202006) Butterfly transform: an efficient fft based neural architecture design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §6.1.2, Table 7.
 [39] (201809) SkipNet: learning dynamic routing in convolutional networks. In The European Conference on Computer Vision (ECCV), Cited by: §2.
 [40] (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §6.2.
 [41] (201806) BlockDrop: dynamic inference paths in residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [42] (201804) Simple baselines for human pose estimation and tracking. In European conference on computer vision, pp. . Cited by: §6.3.
 [43] (2019) CondConv: conditionally parameterized convolutions for efficient inference. In NeurIPS, Cited by: §2.
 [44] (202006) Resolution adaptive networks for efficient inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [45] (2019) Slimmable neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [46] (2019) S2DNAS: transforming static cnn model for dynamic inference via neural architecture search. ArXiv abs/1911.07033. Cited by: §2.
 [47] (201806) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §3.1, §5.3, Table 7.
 [48] (202008) Rethinking bottleneck structure for efficient mobile network design. In ECCV, Cited by: §2.
Comments
There are no comments yet.