1 Introduction
Interest in building lightweight and efficient neural networks has exploded recently. It not only enables new experiences on mobile devices, but also protects user’s privacy from sending personal information to the cloud. Recent works (e.g. MobileNet [11, 25, 12] and ShuffleNet [38, 22]) have shown that both efficient operator design (e.g. depthwise convolution, channel shuffle, squeezeandexcitation [13], asymmetric convolution [5]) and architecture search ([27, 7, 2]) are important for designing efficient convolution neural networks.
However, even the stateoftheart efficient CNNs (e.g. MobileNetV3 [12]) suffer significant performance degradation when the computational constraint becomes extremely low. For instance, when the computational cost reduces from 219M to 66M MultiAdds, the top1 Imagenet classification accuracy for MobileNetV3 drops from 75.2% to 67.4%. This is because the extremely low computational cost severely constrains both the network depth (number of layers) and width (number of channels), which are crucial for the network performance but proportional to the computational cost.
This paper proposes a new operator design, named dynamic convolution, to increase the representation ability with negligible extra FLOPs. Dynamic convolution uses a set of parallel convolution kernels instead of using a single convolution kernel per layer (see Figure 2). It dynamically aggregates these convolution kernels for each individual input (e.g. image) via input dependent attention . Dynamic convolution is a nonlinear function with more representation power than its static counterpart. Meanwhile, dynamic convolution is computational efficient. It does not increase depth or width of the network, as the parallel convolution kernels share the output channels by aggregation. It only introduces extra computational cost to compute attention and aggregate kernels , which is negligible compared to convolution. The key insight is that within reasonable cost of model size (as convolution kernels are small), dynamic kernel aggregation provides an efficient way (low extra FLOPs) to boost representation capability.
Dynamic convolutional neural networks (denoted as DYCNNs) are more difficult to train, as they require joint optimization of all convolution kernels and the attention across multiple layers. The sparsity of the attention (softmax output) only allows a small subset of kernels to be optimized simultaneously, making training inefficient. We solve this problem by using temperature in softmax to make attention flat. Thus, more convolution kernels can be optimized simultaneously.
We demonstrate the effectiveness of dynamic convolution on both image classification (ImageNet) and keypoint detection (COCO). Without bells and whistles, simply replacing static convolution with dynamic convolution in MobileNet V2 and V3 achieves solid improvement with only a slight increase (4%) of computational cost (see Figure 1). For instance, with 100M MultiAdds budget, our method gains 4.0% and 2.3% top1 accuracy on image classification for MobileNetV2 and MobileNetV3, respectively.
2 Related Work
Efficient CNNs: Recently, designing efficient CNN architectures [15, 11, 25, 12, 38, 22] has been an active research area. SqueezeNet [15] reduces the number of parameters by using convolution extensively in the fire module. MobileNetV1 [11] substantially reduces FLOPs by decomposing convolution to depthwise convolution and pointwise convolution. Based upon this, MobileNetV2 [25] introduces inverted residual and linear bottlenecks. MobileNetV3 [12] applies squeezeandexcitation [13] in the residual layer and employs a platformaware neural architecture approach [27] to find the global network structures. ShuffleNet further reduces the MAdds for convolution by channel shuffle operations. ShiftNet [31] replaces expensive spatial convolution by the shift operation and pointwise convolutions. Compared with existing work, our dynamic convolution can be used to replace any static convolution kernels (e.g. , , depthwise convolution, group convolution) and is complementary to other advanced operators like squeezeandexcitation.
Model Compression and Quantization: Model compression [8, 21, 10] and quantization [3, 39, 37, 35, 28] approaches are also important for learning efficient neural networks. They are complementary to our work, helping reduce the model size for our dynamic convolution method.
Dynamic Deep Nerual Networks: Our method is related to recent works of dynamic neural networks [17, 20, 29, 32, 36, 14] that focus on skipping part of an existing model based on input image. DNN [20], SkipNet [29] and BlockDrop [32]
learn an additional controller for skipping decision by using reinforcement learning. MSDNet
[14] allows earlyexit based on the current prediction confidence. Slimmable Nets [36] learns a single neural network executable at different width. Onceforall [1] proposes a progressive shrinking algorithm to train one network that supports multiple subnetworks. The accuracy for these subnetworks is the same as independently trained networks. Compared with these works, our method has two major differences. Firstly, all convolution layers in our method are dynamic, varying per input image, while existing works focus on dynamic network structure, leaving parameters in each layer static. Secondly, our method does not require an additional controller. The attention is embedded in each layer, enabling endtoend training.Neural Architecture Search: Recent research works in neural architecture search (NAS) demonstrate its power on finding highaccuracy neural network architectures [40, 24, 41, 19, 34] as well as hardwareaware efficient network architectures [2, 27, 30]. The hardwareaware NAS methods incorporate hardware latency into the architecture search process, by making it differentiable. [7] proposed single path supernet to optimize all architectures in the search space simultaneously, and then perform evolutionary architecture search to handle computational constraints. Based upon NAS, MobileNetV3 [12] shows significant improvements over humandesigned baselines (e.g. MobileNetV2 [25]). Our dynamic convolution method can be easily used in advanced architectures found by NAS. Later in this paper, we will show that dynamic convolution not only improves the performance for humandesigned networks (e.g. MobielNetV2), but also boosts the performance for automatically searched architectures (e.g. MobileNetV3), with low extra FLOPs. In addition, our method provides a new and effective component to enrich the search space.
3 Dynamic Convolutional Neural Networks
We will describe dynamic convolutional neural networks (DYCNNs) in this section. The goal is to provide better tradeoff between network performance and computational burden, within the scope of efficient neural networks. The two most popular strategies to boost the performance are making neural networks “deeper” or “wider”. However, they will both incur heavy computation cost, thus not friendly to efficient neural networks.
We propose dynamic convolution, which does not increase either the depth or the width of the network, but increase the model capability by aggregating multiple convolutional kernels via attention. Note that these kernels are assembled differently for different input images, from where dynamic convolution gets its name. This section is organized as follows. We firstly define the generic dynamic perceptron, and then apply it to convolution. Finally, we will discuss the training strategy for dynamic convolutional neural networks (DYCNNs).
3.1 Preliminary: Dynamic Perceptron
Definition: Let us denote the traditional or static perceptron as , where and
are weight matrix and bias vector, and
is a nonlinear activation function (e.g. ReLU). We define the dynamic perceptron by aggregating multiple (
) linear functions as follows:s.t.  (1) 
where is the attention weight for the linear function .
Attention: the attention weights are not fixed, but vary for each input , assembling these linear models dynamically. They represent the optimal aggregation of linear models for a given input. Due to the nonlinearity embedded in and , the aggregated model is a nonlinear function. Thus, dynamic perceptron has more representation power than its static counterpart.
Example (learning XOR): To make the idea of dynamic perceptron more concrete, we begin with a simple task, i.e. learning the XOR function. In this example, we want our network to perform correctly on the four points . Compared with the solution using two static perceptron layers [6] as follows:
(2) 
dynamic perception only needs a single layer as follows:
(3) 
where the attentions are , . This example demonstrates that dynamic perceptron has more representation power due to the nonlinearity.
Computational Constraint: compared with static perceptron, dynamic perceptron has the same number of output channels but bigger model size. It also introduces two additional computations: (a) computing the attention weights , and (b) aggregating parameters based upon attention and . The additional computational cost should be significantly less than the linear model . Mathematically, the computational constraint can be represent as follows:
(4) 
where measures the computational cost (e.g. FLOPs). Note that fully connected layer does not satisfy this, while convolution is a proper fit for this constraint.
3.2 Dynamic Convolution
In this subsection, we showcase a specific dynamic perceptron, dynamic convolution that satisfies the computational constraint (Eq. 4). Similar to dynamic perceptron, dynamic convolution (Figure 3) has convolution kernels that share the same kernel size and input/output dimensions. They are aggregated by using the attention weights
. Following the classic design in CNN, we use batch normalization and an activation function (e.g. ReLU) after the aggregated convolution to build a dynamic convolution layer.
Attention: we apply lightweight squeeze and excitation [13] to compute kernel attentions (see Figure 3). The global spatial information is firstly squeezed by global average pooling. Then we use two fully connected layers (with a ReLU between them) and softmax to generate normalized attention weights for convolution kernels. The first fully connected layer reduces the dimension by 4. Different from SENet [13] which computes attentions over output channels, we compute attentions over convolution kernels. The computation cost for the attention is cheap. For an input feature map with dimension , attention requires MultAdds. This is much less than the computational cost of convolution, i.e. MultAdds, where is the kernel size, is the number of output channels.
Kernel Aggregation: aggregating convolution kernels is computational efficient due to the small kernel size. Aggregating convolution kernels with kernel size , input channels and output channels introduces extra MultiAdds. Compared with the computational cost of convolution (), the extra cost is neligible if . Table 1 shows the computational cost of using dynamic convolution in MobileNetV2 and MobileNetV3. For instance, when using MobileNetV2 (1.0), dynamic convolution with kernels only increases the computation cost by 4%. Note that dynamic convolution increases the model size, which is acceptable as convolution kernels are small.
V2  V2  V3large  V3small  

static  300M  97M  219M  66M 
=2  309.5M  100.5M  224.9M  67.8M 
=4  312.9M  101.4M  227.3M  68.5M 
=6  316.3M  102.3M  229.8M  69.3M 
=8  319.8M  103.2M  232.2M  70.1M 
From CNN to DYCNN: dynamic convolution can be easily used as a dropin replacement for any convolution (e.g. , , group convolution, depthwise convolution) in any CNN architecture. It is also complementary to other operators (like squeezeandexcitation [13]) and activation functions (e.g. ReLU6, hswish [12]). In the rest of paper, we use prefix DY for the networks that use dynamic convolution. For example, DYMobileNetV2 refers to using dynamic convolution in MobileNetV2.
3.3 Training Strategy for DYCNN
Training deeper dynamic convolution neural networks (DYCNN) is more challenging, as it requires joint optimization of all convolution kernels and attention across multiple layers. In Figure 4Right, the blue curves show the training and validation errors for DYMobileNetV2 with width multiplier
over 300 epochs. It converges slowly and the final top1 accuracy (64.8%) degrades from its static counterpart (65.4%).
We believe the sparsity of attention (due to softmax) only allows a small subset of kernels across layers to be optimized simultaneously, making training inefficient. And this inefficiency becomes more severe for deeper network, as the combination of activated convolution kernels (with higher attention) across layers increases exponentially. To validate this, we train an variation of DYMobileNetV2, which only uses dynamic convolution for the last convolution in the inverted residual bottleneck block and keep the other two convolution layers static. The training and validation errors are shown in Figure 4Left. The training converges faster with higher accuracy (65.9%).
We address this issue by flattening attention to enable more convolution kernels optimized simultaneously. This is achieved by using temperature in softmax as follows:
(5) 
where
is the logit. The original softmax (without
) is equivalent to . As increases, the output distribution is less sparse. We found that using a larger (e.g. ) can improve the training efficiency significantly (see the red curves in Figure 4Right). When changing from 1 to 30, the accuracy boosts from 64.8% to 69.4% for DYMobileNetV2 with width multiplier . Even the network using dynamic convolution in layers also benefits from this (shown in Figure 4Left).4 Experimental Results
In this section, we present experimental results to demonstrate the effectiveness of our dynamic convolution. We report results on image classification and single person pose estimation. We also report ablation studies to analyze different components of our approach.
4.1 ImageNet Classification
We use ImageNet [4] for all classification experiments. ImageNet has 1000 object classes, including 1,281,167 images for training and 50,000 images for validation. We evaluate dynamic convolution on three CNN architectures (MobileNetV2[25], MobileNetV3 [12] and ResNet [9]), by using dynamic convolution for all convolution layers except the first layer. All dynamic convolution layers have convolution kernels. The softmax temperature is set 30 to compute attentions, and the batch size is 256. We use different training setups for the three architectures as follows:
Training setup for DYMobileNetV2: The initial learning rate is 0.05 and is scheduled to arrive at zero within a single cosine cycle. The weight decay is 4e5. All models are trained using SGD optimizer with 0.9 momentum for 300 epochs. We use dropout rate of 0.2 and 0.1 before the last layer for the width multiplier and respectively.
Training setup for DYMobileNetV3: The initial learning rate is 0.1 and is scheduled to arrive at zero within a single cosine cycle. The weight decay is 3e5. We use SGD optimizer with 0.9 momentum for 300 epochs and dropout rate of 0.2 before the last layer. We use label smoothing for DYMobileNetV3Large.
Training setup for DYResNet: The initial learning rate is 0.1 and drops by 10 at epoch 30, 60 and 90. The weight decay is 1e4. All models are trained using SGD optimizer with 0.9 momentum for 100 epochs. We use dropout rate 0.1 before the last layer of DYResNet18.
Main Results: We compare dynamic convolution with its static counterpart for three CNN architectures (MobileNetV2, MobileNetV3 and ResNet) in Table 2. Although we focus on efficient CNNs, we evaluate dynamic convolution on two shallow ResNets (ResNet10 and ResNet18) to show its effectiveness on convolution, which is only used for the first layer in MobileNet V2 and V3. Without bells and whistles, dynamic convolution outperforms its static counterpart by a clear margin for all three architectures, with small extra computational cost (). DYResNet gains more than 2.3% top1 accuracy and DYMobileNetV2 gains more than 2.4% top1 accuracy. DYMobileNetV3Small is 2.3% more accurate than the stateoftheart MobileNetV3Small.
For MobileNetV3Large, we can not use small minibatch to reproduce the baseline performance 75.2%, which is achieved in [12] by using large minibatch 4096. As large minibatch is not feasible for us to fit in 4 GPUs, we report results on small minibatch 256. The top1 accuracy for our implementation of MobileNetV3Large and DYMobileNetV3Large are 73.7% and 74.7%, respectively.
Network  #Param  MAdds  Top1  Top5 

MobileNetV2 
3.5M  300.0M  72.0  91.0 
DYMobileNetV2  11.1M  312.9M  74.4  91.6 
MobileNetV2  2.6M  209.0M  69.8  89.6 
DYMobileNetV2  7.0M  217.5M  72.8  90.9 
MobileNetV2  2.0M  97.0M  65.4  86.4 
DYMobileNetV2  4.0M  101.4M  69.4  88.6 
MobileNetV2  1.7M  59.2M  60.3  82.9 
DYMobileNetV2  2.8M  62.0M  64.9  85.5 
MobileNetV3Small 
2.9M  66.0M  67.4  86.4 
DYMobileNetV3Small  4.8M  68.5M  69.7  88.5 
ResNet18 
11.1M  1.81G  70.4  89.7 
DYResNet18  42.7M  1.85G  72.7  90.7 
ResNet10  5.2M  0.89G  63.5  85.0 
DYResNet10  18.6M  0.91G  67.7  87.6 
4.2 Inspecting DYCNN
Kernel Aggregation  Top1  Top5 

attention:  69.4  88.6 
average:  36.0  61.5 
max:  0.1  
shuffle per image:  14.8  30.5 
shuffle across images:  27.3  48.4 
Input Resolution  
Top1  Top5  
–  –  –  –  ✓  57.3  79.9 
–  –  –  ✓  ✓  67.0  87.2 
–  –  ✓  ✓  ✓  67.5  87.4 
–  ✓  ✓  ✓  ✓  69.1  88.4 
✓  ✓  ✓  ✓  ✓  69.4  88.6 
✓  ✓  ✓  ✓  –  50.9  76.2 
✓  ✓  ✓  –  –  42.5  68.4 
✓  ✓  –  –  –  41.2  67.0 
✓  –  –  –  –  37.9  63.5 
–  –  –  –  –  36.0  61.5 
We inspect a well trained DYMobileNetV2 with width multiplier and expect two properties: (a) the convolution kernels are diverse per layer, and (b) the attention is input dependent. We examine these two properties by contradiction. Firstly, if the convolution kernels are not diverse, the performances will be stable if different attentions are used. Thus, we vary the kernel aggregation per layer in three different ways: averaging , choosing the convolution kernel with the maximum attention , and random shuffling attention over kernels per image . Compared with using the original attention, the performances of these variations are significantly degraded (shown in Table 3). When choosing the convolution kernel with the maximum attention, the top1 accuracy (0.1) is as low as randomly choosing a class. The significant instability confirms the diversity of convolution kernels. In addition, we shuffle attentions across images to check if the attention is input dependent. The poor performance (27.3% top1accuracy) indicates that it is crucial for each image to use its own attention.
Furthermore, we inspect the attention across layers and find that attentions are flat at low levels and sparse at high levels. This is helpful to explain why variations in Table 3 have poor accuracy. For instance, averaging kernels with sparse attention at high levels or picking one convolution kernel (with the maximum attention) at low levels (where attentions are flat) is problematic. Table 4 shows how attention changes across layers affect the performance. We group layers by their input resolutions, and switch on/off attention for these groups. If attention is switched off for a resolution, each layer in that resolution aggregates kernels by averaging. When enabling attention at higher levels alone (resolution and ), the top1 accuracy is 67.5%, closed to the performance (69.4%) of using attention for all layers. If attention is used for lower levels alone (resolution , and ), the top1 accuracy is pool 42.5%.
4.3 Ablation Studies on ImageNet
We run a number of ablations to analyze DYMobileNetV2, and use DYMobileNetV3Small to compare dynamic convolution with squeezeandexcitation [13].
The number of convolution kernels (): the hyperparameter controls the model complexity. Figure 5 shows the classification accuracy and computational cost for dynamic convolution with different . We compare DYMobileNetV2 with MobileNetV2 on different depth/width multipliers. Firstly, the dynamic convolution outperforms the static baseline for all depth/width multipliers, even with small . This demonstrates the strength of our method. In addition, the accuracy stop increasing once is larger than 4. This is because as increases, even though the model has more representation power, it is more difficult to optimize all convolution kernels and attention simultaneously and the network is more prone to overfitting.
Dynamic convolution in shallower and thinner networks: Figure 6 shows that the shallower DYMobileNetV2 (depth ) has better tradeoff between accuracy and computational cost than the deeper MobileNetV2 (depth ), even though shallower networks (depth 0.5) have performance degradation for both DYMobileNetV2 and MobileNetV2. Improvement on shallow networks is useful as they are friendly to parallel computation. Furthermore, dynamic convolution achieves more improvement for thinner and shallower networks with small width/depth multipliers. This is because thinner and shallower networks are underfitted due to their limited model size and dynamic convolution significantly improves their capability.
Dynamic convolution at different layers: Table 5 shows the classification accuracy for using dynamic convolution at three different layers (, depthwise, ) in an inverted residual bottleneck block in MobileNetV2 . The accuracy is improved if the dynamic convolution is used for more layers. Using dynamic convolution for all three layers yields the best accuracy. If only one layer is allowed to use dynamic convolution, using it for the last convolution yields the best performance.
Temperature of Softmax: the temperature in softmax controls the sparsity of attention weights. It is important for training DYCNNs effectively. Table 6 shows the classification accuracy for using different temperatures. has the best performance.
Comparison with SENet: Table 7 shows the comparison between dynamic convolution and squeezeandexcitation (SE [14]) on MobileNetV3Small [12], in which the locations of SE layers are considered optimal as they are found by network architecture search (NAS). Without SE, the top1 accuracy for MobileNetV3Small drops 2%. However, DYMobileNetV3Small without SE outperforms MobileNetV3Small with SE by 1.8% top1 accuracy. Combining dynamic convolution and SE gains additional 0.5% improvement. This suggests that attention over kernels and attention over output channels can work together.
Network  C1  C2  C3  Top1  Top5 

MobileNetV2  1  1  1  65.4  86.4 
DYMobileNetV2  4  1  1  67.4  87.5 
1  4  1  67.4  87.3  
1  1  4  68.2  87.9  
4  1  4  68.7  88.0  
1  4  4  68.4  87.9  
4  4  1  68.6  88.0  
4  4  4  69.4  88.6 
Network  Attention  Top1  Top5 

MobileNetV2  —  65.4  86.4 
DYMobileNetV2  64.8  85.5  
65.7  85.8  
67.5  87.4  
69.4  88.5  
69.4  88.6  
69.2  88.4 
Network  Top1  Top5 

MobileNetV3Small  67.4  86.4 
MobileNetV3Small w/o SE  65.4  85.2 
DYMobileNetV3Small  69.7  88.5 
DyMobileNetV3Small w/o SE  69.2  88.3 
4.4 COCO SinglePerson Keypoint Detection
Type  Backbone  Head  

Networks  #Param  MAdds  Operator  #Param  MAdds  AP  AP  AP  AP  AP  AR  
A  ResNet18  10.6M  1.77G  dconv  8.4M  5.4G  67.0  87.9  74.8  63.6  73.5  73.1 
DYResNet18  42.2M  1.81G  dconv  8.4M  5.4G  68.6  88.4  76.1  65.3  75.1  74.6  
A  MobileNetV2  2.2M  292.6M  dconv  8.4M  5.4G  64.7  87.2  72.6  61.3  71.0  71.0 
DYMobileNetV2  9.8M  305.3M  dconv  8.4M  5.4G  67.6  88.1  75.5  64.4  74.1  73.8  
A  MobileNetV2  0.7M  93.7M  dconv  8.4M  5.4G  57.0  83.7  63.1  53.9  63.1  63.7 
DYMobileNetV2  2.7M  98.0M  dconv  8.4M  5.4G  61.9  85.8  69.7  58.9  67.9  68.4  
A 
MobileNetV3Large  3.0M  212.1M  dconv  8.4M  5.4G  66.3  87.9  74.5  63.1  72.5  72.6 
DYMobileNetV3Large  8.6M  220.2M  dconv  8.4M  5.4G  68.2  88.2  76.5  64.8  74.8  74.2  
A  MobileNetV3Small  1.1M  62.7M  dconv  8.4M  5.4G  57.1  83.7  63.8  54.9  62.3  64.1 
DYMobileNetV3Small  2.8M  65.1M  dconv  8.4M  5.4G  59.3  84.7  66.7  56.9  64.7  66.1  
B  MobileNetV2  2.2M  292.6M  bneck  1.2M  701.1M  64.6  87.0  72.4  61.3  71.0  71.0 
DYMobileNetV2  9.8M  305.3M  bneck  6.3M  709.4M  68.2  88.4  76.0  65.0  74.7  74.2  
B  MobileNetV2  0.7M  93.7M  bneck  1.2M  701.1M  59.2  84.3  66.4  56.2  65.0  65.6 
DYMobileNetV2  2.7M  98.0M  bneck  6.3M  709.4M  62.8  86.1  70.4  59.9  68.6  69.1  
B 
MobileNetV3Large  3.0M  212.1M  bneck  1.1M  684.3M  65.7  87.4  74.1  62.3  72.2  71.7 
DYMobileNetV3Large  8.6M  220.2M  bneck  5.6M  691.9M  67.8  88.2  75.8  64.7  74.1  73.8  
B  MobileNetV3Small  1.1M  62.7M  bneck  1.0M  664.2M  57.1  83.8  63.7  55.0  62.2  64.1 
DYMobileNetV3Small  2.8M  65.1M  bneck  4.9M  671.1M  60.0  85.0  67.8  57.6  65.4  66.7 
We use COCO 2017 dataset [18] to evaluate dynamic convolution on singleperson keypoint detection. Our models are trained on train2017, including images and person instances labeled with 17 keypoints. We evaluate our method on the val2017 containing 5000 images and use the mean average precision (AP) over 10 object key point similarity (OKS) thresholds as the metric.
Implementation Details: We implement two types of networks to evaluate dynamic convolution. TypeA follows SimpleBaseline [33] by using deconvolution in head. We use MobileNetV2 and V3 as a dropin replacement for the backbone feature extractor and compare static convolution and dynamic convolution in the backbone alone. TypeB still uses MobileNetV2 and V3 as backbone. But it uses upsampling and MobileNetV2’s inverted residual bottleneck block in head. We compare dynamic convolution with its static counterpart in both backbone and head. The details of head structure are shown in Table 9. For both types, we use kernels in each dynamic convolution layer.
Input  Operator  exp size  

bneck,  768  256  2  
bneck,  768  128  1  
bneck,  384  128  1 
Training setup: We follow the training setup in [26]. The human detection boxes are cropped from the image and resized to . The data augmentation includes random rotation (), random scale (), flipping, and half body data augmentation. All models are trained from scratch for 210 epochs, using Adam optimizer [16]. The initial learning rate is set as 1e3 and is dropped to 1e4 and 1e5 at the and epoch, respectively. The temperature of softmax in DYCNNs is set as .
Testing: We follow [33, 26] to use twostage topdown paradigm: detecting person instances using a person detector and then predict keypoints. We use the same person detectors provided by [33]. The keypoints are predicted on the average heatmap of the original and flipped images by adjusting the highest heat value location with a quarter offset from the highest response to the second highest response.
Main Results and Ablations: Firstly we compare dynamic convolution and static convolution in the backbone (TypeA). The results are shown in the top half of Table 8. Dynamic convolution gains 1.6, 2.9+, 1.9+ AP for ResNet18, MobileNetV2 and V3, respectively.
Secondly, we replace the heavy deconvolution head with lightweight upsampling and MobileNetV2’s inverted residual bottleneck blocks (TypeB) to make the whole network small and efficient. Thus, we can compare dynamic convolution with its static counterpart in both backbone and head. The results are shown in the bottom half of Table 8. Similar to TypeA, dynamic convolution outperforms static baselines by a clear margin. It gains 3.6+ and 2.1+ AP for MobileNetV2 and V3, respectively. These results demonstrate that our method is also effective on keypoint detection.
We perform an ablation to investigate the effects of dynamic convolution at backbone and head separately (Table 10). Even though most of improvement comes from the dynamic convolution at the backbone, dynamic convolution at the head is also helpful. This is mainly because the backbone has more convolution layers than the head.
Backbone  Head  AP  AP  AP 

static  static  59.2  84.3  66.4 
static  dynamic  60.3  84.9  67.3 
dynamic  static  62.3  85.6  70.0 
dynamic  dynamic  62.8  86.1  70.4 
5 Conclusion
In this paper, we introduce dynamic convolution, which aggregates multiple convolution kernels dynamically based upon their attentions for each input. Compared to its static counterpart (single convolution kernel per layer), it significantly improves the representation capability with negligible extra computation cost, thus is more friendly to efficient CNNs. Our dynamic convolution can be easily integrated into existing CNN architectures. By simply replacing each convolution kernel in MobileNet (V2 and V3) with dynamic convolution, we achieve solid improvement for both image classification and human pose estimation. We hope dynamic convolution becomes a useful component for efficient network architecture.
Appendix A Appendix
In this appendix, we report running time and perform additional analysis for our dynamic convolution method.
a.1 Inference Running Time
We report the running time of dynamic MobileNetV2 (DYMobileNetV2) with four different width multipliers (, , , ) and compare with its static counterpart (MobileNetV2 [25]) in Table 11
. We use a singlethreaded core of Intel Xeon CPU E52650 v3 (2.30GHz) to measure running time (in milliseconds). The running time is calculated by averaging the inference time of 5,000 images with batch size 1. Both MobileNetV2 and DYMobileNetV2 are implemented using PyTorch
[23].Compared with its static counterpart, DYMobileNetV2 consumes about more running time and more MultiAdds. The overhead of running time is higher than MultiAdds. We believe this is because the optimizations of global average pooling and small innerproduct operations are not as efficient as convolution. With the small additional computational cost, our dynamic convolution method significantly improves the model performance.
Network  Top1  MAdds  CPU (ms) 
MobileNetV2 
72.0  300.0M  127.9 
DYMobileNetV2  74.4  312.9M  141.2 
MobileNetV2  69.8  209.0M  99.5 
DYMobileNetV2  72.8  217.5M  110.5 
MobileNetV2  65.4  97.0M  69.6 
DYMobileNetV2  69.4  101.4M  77.4 
MobileNetV2  60.3  59.2M  61.1 
DYMobileNetV2  64.9  62.0M  67.4 

a.2 Top1 Classification Accuracy per Class
Figure 7 plots the top1 classification accuracy for both the dynamic convolution (DYMobileNetV2) and the static convolution (MobileNetV2 [25]) over 1000 classes in ImageNet [4]. The comparison is performed for four different width multipliers (, , , ). Each dot is corresponding to an image class, which has two top1 accuracies for using both models (DYMobileNetV2 and MobileNetV2). Since each class only has 50 images in the validation set, it is likely that multiple classes have the same number of images with correct prediction. Thus, multiple classes may have the same accuracy and overlap in Figure 7. We use dot opacity to indicate overlapping. The darker the dot, the more classes overlap at that position.
The dynamic convolution (DYMobileNetV2) is more accurate than its static counterpart (MobileNetV2 [25]) in majority of classes (above the red diagonal line), ranging from easier classes to harder classes.
References
 [1] (2019) Once for all: train one network and specialize it for efficient deployment. ArXiv abs/1908.09791. Cited by: §2.
 [2] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
 [3] (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 3123–3131. External Links: Link Cited by: §2.

[4]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §A.2, Table 11, §4.1, Table 2, Table 3, Table 4, Table 6, Figure 7.  [5] (201910) ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
 [6] (2016) Deep learning. The MIT Press. External Links: ISBN 0262035618, 9780262035613 Cited by: §3.1.
 [7] (2019) Single path oneshot neural architecture search with uniform sampling. External Links: 1904.00420 Cited by: §1, §2.
 [8] (201610) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), Cited by: §2.
 [9] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
 [10] (201809) AMC: automl for model compression and acceleration on mobile devices. In The European Conference on Computer Vision (ECCV), Cited by: §2.
 [11] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2.
 [12] (2019) Searching for mobilenetv3. CoRR abs/1905.02244. External Links: Link, 1905.02244 Cited by: §1, §1, §2, §2, §3.2, §4.1, §4.1, §4.3.
 [13] (201806) Squeezeandexcitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.2, §3.2, §4.3, Table 7.
 [14] (2018) Multiscale dense networks for resource efficient image classification. In International Conference on Learning Representations, External Links: Link Cited by: §2, §4.3.
 [15] (2016) SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §2.
 [16] (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.4.
 [17] (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. External Links: Link Cited by: §2.
 [18] (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.4.
 [19] (2019) DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §2.

[20]
(2018)
Dynamic deep neural networks: optimizing accuracyefficiency tradeoffs by selective execution.
In
AAAI Conference on Artificial Intelligence (AAAI)
, Cited by: §2.  [21] (201710) Learning efficient convolutional networks through network slimming. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 [22] (201809) ShuffleNet v2: practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.
 [23] (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §A.1.

[24]
(2018)
Regularized evolution for image classifier architecture search
. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.  [25] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §A.1, §A.2, §A.2, Table 11, §1, §2, §2, §4.1, Table 9.
 [26] (2019) Deep highresolution representation learning for human pose estimation. In CVPR, Cited by: §4.4, §4.4.
 [27] (201906) MnasNet: platformaware neural architecture search for mobile. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
 [28] (201906) HAQ: hardwareaware automated quantization with mixed precision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [29] (201809) SkipNet: learning dynamic routing in convolutional networks. In The European Conference on Computer Vision (ECCV), Cited by: §2.
 [30] (201906) FBNet: hardwareaware efficient convnet design via differentiable neural architecture search. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [31] (2017) Shift: a zero flop, zero parameter alternative to spatial convolutions. Cited by: §2.
 [32] (201806) BlockDrop: dynamic inference paths in residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [33] (201804) Simple baselines for human pose estimation and tracking. In European conference on computer vision, pp. . Cited by: §4.4, §4.4.
 [34] (2019) SNAS: stochastic neural architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [35] (201906) Quantization networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [36] (2019) Slimmable neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [37] (201809) LQnets: learned quantization for highly accurate and compact deep neural networks. In The European Conference on Computer Vision (ECCV), Cited by: §2.
 [38] (201806) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
 [39] (2017) Trained ternary quantization. In International Conference on Learning Representations (ICLR), Cited by: §2.
 [40] (2017) Neural architecture search with reinforcement learning. CoRR abs/1611.01578. Cited by: §2.
 [41] (201806) Learning transferable architectures for scalable image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
Comments
There are no comments yet.