Dynamic Convolution: Attention over Convolution Kernels

12/07/2019 ∙ by Yinpeng Chen, et al. ∙ 1

Light-weight convolutional neural networks (CNNs) suffer performance degradation as their low computational budgets constrain both the depth (number of convolution layers) and width (number of channels) of CNNs, resulting in limited representation capability. To address this issue, we present dynamic convolution, a new design that increases model complexity without increasing the network depth or width. Instead of using a single convolution kernel per layer, dynamic convolution aggregates multiple parallel convolution kernels dynamically based upon their attentions, which are input dependent. Assembling multiple kernels is not only computationally efficient due to the small kernel size, but also has more representation power since these kernels are aggregated in a non-linear way via attention. By simply using dynamic convolution for the state-of-the-art architecture MobilenetV3-Small, the top-1 accuracy on ImageNet classification is boosted by 2.3 is achieved on COCO keypoint detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interest in building light-weight and efficient neural networks has exploded recently. It not only enables new experiences on mobile devices, but also protects user’s privacy from sending personal information to the cloud. Recent works (e.g. MobileNet [11, 25, 12] and ShuffleNet [38, 22]) have shown that both efficient operator design (e.g. depth-wise convolution, channel shuffle, squeeze-and-excitation [13], asymmetric convolution [5]) and architecture search ([27, 7, 2]) are important for designing efficient convolution neural networks.

However, even the state-of-the-art efficient CNNs (e.g. MobileNetV3 [12]) suffer significant performance degradation when the computational constraint becomes extremely low. For instance, when the computational cost reduces from 219M to 66M Multi-Adds, the top-1 Imagenet classification accuracy for MobileNetV3 drops from 75.2% to 67.4%. This is because the extremely low computational cost severely constrains both the network depth (number of layers) and width (number of channels), which are crucial for the network performance but proportional to the computational cost.

Figure 1: The trade-off between computational cost (MAdds) and top-1 ImageNet classification accuracy. Dynamic convolution significantly boosts the accuracy with a small amount of extra MAdds on MobileNet V2 and V3. Best viewed in color.
Figure 2:

Dynamic perceptron. It aggregates multiple linear functions dynamically based upon their attentions

, which are input dependent.

This paper proposes a new operator design, named dynamic convolution, to increase the representation ability with negligible extra FLOPs. Dynamic convolution uses a set of parallel convolution kernels instead of using a single convolution kernel per layer (see Figure 2). It dynamically aggregates these convolution kernels for each individual input (e.g. image) via input dependent attention . Dynamic convolution is a non-linear function with more representation power than its static counterpart. Meanwhile, dynamic convolution is computational efficient. It does not increase depth or width of the network, as the parallel convolution kernels share the output channels by aggregation. It only introduces extra computational cost to compute attention and aggregate kernels , which is negligible compared to convolution. The key insight is that within reasonable cost of model size (as convolution kernels are small), dynamic kernel aggregation provides an efficient way (low extra FLOPs) to boost representation capability.

Dynamic convolutional neural networks (denoted as DY-CNNs) are more difficult to train, as they require joint optimization of all convolution kernels and the attention across multiple layers. The sparsity of the attention (softmax output) only allows a small subset of kernels to be optimized simultaneously, making training inefficient. We solve this problem by using temperature in softmax to make attention flat. Thus, more convolution kernels can be optimized simultaneously.

We demonstrate the effectiveness of dynamic convolution on both image classification (ImageNet) and keypoint detection (COCO). Without bells and whistles, simply replacing static convolution with dynamic convolution in MobileNet V2 and V3 achieves solid improvement with only a slight increase (4%) of computational cost (see Figure 1). For instance, with 100M Multi-Adds budget, our method gains 4.0% and 2.3% top-1 accuracy on image classification for MobileNetV2 and MobileNetV3, respectively.

2 Related Work

Efficient CNNs: Recently, designing efficient CNN architectures [15, 11, 25, 12, 38, 22] has been an active research area. SqueezeNet [15] reduces the number of parameters by using convolution extensively in the fire module. MobileNetV1 [11] substantially reduces FLOPs by decomposing convolution to depthwise convolution and pointwise convolution. Based upon this, MobileNetV2 [25] introduces inverted residual and linear bottlenecks. MobileNetV3 [12] applies squeeze-and-excitation [13] in the residual layer and employs a platform-aware neural architecture approach [27] to find the global network structures. ShuffleNet further reduces the MAdds for convolution by channel shuffle operations. ShiftNet [31] replaces expensive spatial convolution by the shift operation and point-wise convolutions. Compared with existing work, our dynamic convolution can be used to replace any static convolution kernels (e.g. , , depth-wise convolution, group convolution) and is complementary to other advanced operators like squeeze-and-excitation.

Model Compression and Quantization: Model compression [8, 21, 10] and quantization [3, 39, 37, 35, 28] approaches are also important for learning efficient neural networks. They are complementary to our work, helping reduce the model size for our dynamic convolution method.

Dynamic Deep Nerual Networks: Our method is related to recent works of dynamic neural networks [17, 20, 29, 32, 36, 14] that focus on skipping part of an existing model based on input image. DNN [20], SkipNet [29] and BlockDrop [32]

learn an additional controller for skipping decision by using reinforcement learning. MSDNet

[14] allows early-exit based on the current prediction confidence. Slimmable Nets [36] learns a single neural network executable at different width. Once-for-all [1] proposes a progressive shrinking algorithm to train one network that supports multiple sub-networks. The accuracy for these sub-networks is the same as independently trained networks. Compared with these works, our method has two major differences. Firstly, all convolution layers in our method are dynamic, varying per input image, while existing works focus on dynamic network structure, leaving parameters in each layer static. Secondly, our method does not require an additional controller. The attention is embedded in each layer, enabling end-to-end training.

Neural Architecture Search: Recent research works in neural architecture search (NAS) demonstrate its power on finding high-accuracy neural network architectures [40, 24, 41, 19, 34] as well as hardware-aware efficient network architectures [2, 27, 30]. The hardware-aware NAS methods incorporate hardware latency into the architecture search process, by making it differentiable. [7] proposed single path supernet to optimize all architectures in the search space simultaneously, and then perform evolutionary architecture search to handle computational constraints. Based upon NAS, MobileNetV3 [12] shows significant improvements over human-designed baselines (e.g. MobileNetV2 [25]). Our dynamic convolution method can be easily used in advanced architectures found by NAS. Later in this paper, we will show that dynamic convolution not only improves the performance for human-designed networks (e.g. MobielNetV2), but also boosts the performance for automatically searched architectures (e.g. MobileNetV3), with low extra FLOPs. In addition, our method provides a new and effective component to enrich the search space.

3 Dynamic Convolutional Neural Networks

We will describe dynamic convolutional neural networks (DY-CNNs) in this section. The goal is to provide better trade-off between network performance and computational burden, within the scope of efficient neural networks. The two most popular strategies to boost the performance are making neural networks “deeper” or “wider”. However, they will both incur heavy computation cost, thus not friendly to efficient neural networks.

We propose dynamic convolution, which does not increase either the depth or the width of the network, but increase the model capability by aggregating multiple convolutional kernels via attention. Note that these kernels are assembled differently for different input images, from where dynamic convolution gets its name. This section is organized as follows. We firstly define the generic dynamic perceptron, and then apply it to convolution. Finally, we will discuss the training strategy for dynamic convolutional neural networks (DY-CNNs).

3.1 Preliminary: Dynamic Perceptron

Definition: Let us denote the traditional or static perceptron as , where and

are weight matrix and bias vector, and

is a nonlinear activation function (e.g. ReLU). We define the dynamic perceptron by aggregating multiple (

) linear functions as follows:

s.t. (1)

where is the attention weight for the linear function .

Attention: the attention weights are not fixed, but vary for each input , assembling these linear models dynamically. They represent the optimal aggregation of linear models for a given input. Due to the non-linearity embedded in and , the aggregated model is a non-linear function. Thus, dynamic perceptron has more representation power than its static counterpart.

Example (learning XOR): To make the idea of dynamic perceptron more concrete, we begin with a simple task, i.e. learning the XOR function. In this example, we want our network to perform correctly on the four points . Compared with the solution using two static perceptron layers [6] as follows:

(2)

dynamic perception only needs a single layer as follows:

(3)

where the attentions are , . This example demonstrates that dynamic perceptron has more representation power due to the non-linearity.

Computational Constraint: compared with static perceptron, dynamic perceptron has the same number of output channels but bigger model size. It also introduces two additional computations: (a) computing the attention weights , and (b) aggregating parameters based upon attention and . The additional computational cost should be significantly less than the linear model . Mathematically, the computational constraint can be represent as follows:

(4)

where measures the computational cost (e.g. FLOPs). Note that fully connected layer does not satisfy this, while convolution is a proper fit for this constraint.

3.2 Dynamic Convolution

In this subsection, we showcase a specific dynamic perceptron, dynamic convolution that satisfies the computational constraint (Eq. 4). Similar to dynamic perceptron, dynamic convolution (Figure 3) has convolution kernels that share the same kernel size and input/output dimensions. They are aggregated by using the attention weights

. Following the classic design in CNN, we use batch normalization and an activation function (e.g. ReLU) after the aggregated convolution to build a dynamic convolution layer.

Figure 3: A dynamic convolution layer.

Attention: we apply light-weight squeeze and excitation [13] to compute kernel attentions (see Figure 3). The global spatial information is firstly squeezed by global average pooling. Then we use two fully connected layers (with a ReLU between them) and softmax to generate normalized attention weights for convolution kernels. The first fully connected layer reduces the dimension by 4. Different from SENet [13] which computes attentions over output channels, we compute attentions over convolution kernels. The computation cost for the attention is cheap. For an input feature map with dimension , attention requires Mult-Adds. This is much less than the computational cost of convolution, i.e. Mult-Adds, where is the kernel size, is the number of output channels.

Kernel Aggregation: aggregating convolution kernels is computational efficient due to the small kernel size. Aggregating convolution kernels with kernel size , input channels and output channels introduces extra Multi-Adds. Compared with the computational cost of convolution (), the extra cost is neligible if . Table 1 shows the computational cost of using dynamic convolution in MobileNetV2 and MobileNetV3. For instance, when using MobileNetV2 (1.0), dynamic convolution with kernels only increases the computation cost by 4%. Note that dynamic convolution increases the model size, which is acceptable as convolution kernels are small.

V2 V2 V3-large V3-small
static 300M 97M 219M 66M
=2 309.5M 100.5M 224.9M 67.8M
=4 312.9M 101.4M 227.3M 68.5M
=6 316.3M 102.3M 229.8M 69.3M
=8 319.8M 103.2M 232.2M 70.1M
Table 1: Comparison of Mult-Adds between static convolution and dynamic convolution for MobileNetV2 and MobileNetV3.

From CNN to DY-CNN: dynamic convolution can be easily used as a drop-in replacement for any convolution (e.g. , , group convolution, depth-wise convolution) in any CNN architecture. It is also complementary to other operators (like squeeze-and-excitation [13]) and activation functions (e.g. ReLU6, h-swish [12]). In the rest of paper, we use prefix DY- for the networks that use dynamic convolution. For example, DY-MobileNetV2 refers to using dynamic convolution in MobileNetV2.

3.3 Training Strategy for DY-CNN

Training deeper dynamic convolution neural networks (DY-CNN) is more challenging, as it requires joint optimization of all convolution kernels and attention across multiple layers. In Figure 4-Right, the blue curves show the training and validation errors for DY-MobileNetV2 with width multiplier

over 300 epochs. It converges slowly and the final top-1 accuracy (64.8%) degrades from its static counterpart (65.4%).

We believe the sparsity of attention (due to softmax) only allows a small subset of kernels across layers to be optimized simultaneously, making training inefficient. And this inefficiency becomes more severe for deeper network, as the combination of activated convolution kernels (with higher attention) across layers increases exponentially. To validate this, we train an variation of DY-MobileNetV2, which only uses dynamic convolution for the last convolution in the inverted residual bottleneck block and keep the other two convolution layers static. The training and validation errors are shown in Figure 4-Left. The training converges faster with higher accuracy (65.9%).

Figure 4: Training and validation error for using different softmax temperatures. Left: using dynamic convolution for the last layer in each inverted residual bottleneck block. Right: using dynamic convolution for all layers. We use DY-MobileNetV2 with width multiplier , and each dynamic convolution layer has convolution kernels. Best viewed in color.

We address this issue by flattening attention to enable more convolution kernels optimized simultaneously. This is achieved by using temperature in softmax as follows:

(5)

where

is the logit. The original softmax (without

) is equivalent to . As increases, the output distribution is less sparse. We found that using a larger (e.g. ) can improve the training efficiency significantly (see the red curves in Figure 4-Right). When changing from 1 to 30, the accuracy boosts from 64.8% to 69.4% for DY-MobileNetV2 with width multiplier . Even the network using dynamic convolution in layers also benefits from this (shown in Figure 4-Left).

4 Experimental Results

In this section, we present experimental results to demonstrate the effectiveness of our dynamic convolution. We report results on image classification and single person pose estimation. We also report ablation studies to analyze different components of our approach.

4.1 ImageNet Classification

We use ImageNet [4] for all classification experiments. ImageNet has 1000 object classes, including 1,281,167 images for training and 50,000 images for validation. We evaluate dynamic convolution on three CNN architectures (MobileNetV2[25], MobileNetV3 [12] and ResNet [9]), by using dynamic convolution for all convolution layers except the first layer. All dynamic convolution layers have convolution kernels. The softmax temperature is set 30 to compute attentions, and the batch size is 256. We use different training setups for the three architectures as follows:

Training setup for DY-MobileNetV2: The initial learning rate is 0.05 and is scheduled to arrive at zero within a single cosine cycle. The weight decay is 4e-5. All models are trained using SGD optimizer with 0.9 momentum for 300 epochs. We use dropout rate of 0.2 and 0.1 before the last layer for the width multiplier and respectively.

Training setup for DY-MobileNetV3: The initial learning rate is 0.1 and is scheduled to arrive at zero within a single cosine cycle. The weight decay is 3e-5. We use SGD optimizer with 0.9 momentum for 300 epochs and dropout rate of 0.2 before the last layer. We use label smoothing for DY-MobileNetV3-Large.

Training setup for DY-ResNet: The initial learning rate is 0.1 and drops by 10 at epoch 30, 60 and 90. The weight decay is 1e-4. All models are trained using SGD optimizer with 0.9 momentum for 100 epochs. We use dropout rate 0.1 before the last layer of DY-ResNet-18.

Main Results: We compare dynamic convolution with its static counterpart for three CNN architectures (MobileNetV2, MobileNetV3 and ResNet) in Table 2. Although we focus on efficient CNNs, we evaluate dynamic convolution on two shallow ResNets (ResNet-10 and ResNet-18) to show its effectiveness on convolution, which is only used for the first layer in MobileNet V2 and V3. Without bells and whistles, dynamic convolution outperforms its static counterpart by a clear margin for all three architectures, with small extra computational cost (). DY-ResNet gains more than 2.3% top-1 accuracy and DY-MobileNetV2 gains more than 2.4% top-1 accuracy. DY-MobileNetV3-Small is 2.3% more accurate than the state-of-the-art MobileNetV3-Small.

For MobileNetV3-Large, we can not use small mini-batch to reproduce the baseline performance 75.2%, which is achieved in [12] by using large mini-batch 4096. As large mini-batch is not feasible for us to fit in 4 GPUs, we report results on small mini-batch 256. The top-1 accuracy for our implementation of MobileNetV3-Large and DY-MobileNetV3-Large are 73.7% and 74.7%, respectively.

Network #Param MAdds Top-1 Top-5

MobileNetV2
3.5M 300.0M 72.0 91.0
DY-MobileNetV2 11.1M 312.9M 74.4 91.6
MobileNetV2 2.6M 209.0M 69.8 89.6
DY-MobileNetV2 7.0M 217.5M 72.8 90.9
MobileNetV2 2.0M 97.0M 65.4 86.4
DY-MobileNetV2 4.0M 101.4M 69.4 88.6
MobileNetV2 1.7M 59.2M 60.3 82.9
DY-MobileNetV2 2.8M 62.0M 64.9 85.5

MobileNetV3-Small
2.9M 66.0M 67.4 86.4
DY-MobileNetV3-Small 4.8M 68.5M 69.7 88.5

ResNet-18
11.1M 1.81G 70.4 89.7
DY-ResNet-18 42.7M 1.85G 72.7 90.7
ResNet-10 5.2M 0.89G 63.5 85.0
DY-ResNet-10 18.6M 0.91G 67.7 87.6
Table 2: Comparing DY-CNNs with CNNs on ImageNet [4] classification. We use dynamic convolution with kernels for all convolution layers except the first layer. The numbers in brackets denote the performance improvement over the baseline.

4.2 Inspecting DY-CNN

Kernel Aggregation Top-1 Top-5
attention: 69.4 88.6
average: 36.0 61.5
max: 0.1
shuffle per image: 14.8 30.5
shuffle across images: 27.3 48.4
Table 3: Classification results on ImageNet [4] for using different kernel aggregations. We use DY-MobileNetV2 with width multiplier , whose performance is shown in the first line. refers to the convolution kernel within a dynamic convolution layer. Shuffle per image means shuffling the attention weights for the same image over different kernels. Shuffle across images means using the attention of an image for another image , i.e. . The poor performance for the bottom four aggregations demonstrates DY-CNN is dynamic in a specific way which is encoded in the attention.
Input Resolution
Top-1 Top-5
57.3 79.9
67.0 87.2
67.5 87.4
69.1 88.4
69.4 88.6
50.9 76.2
42.5 68.4
41.2 67.0
37.9 63.5
36.0 61.5
Table 4: Classification results on ImageNet [4] for enabling/disabling attention at different input resolutions. Here we use DY-MobileNetV2 with width multiplier . This model uses dynamic convolution for all convolution layers except the first layer. Each resolution has two options: and ✓. indicates that each layer in that resolution aggregates kernels by averaging , while ✓indicates that each layer in that resolution uses attention . We can see that attention is more effective at higher layers with lower resolution.

We inspect a well trained DY-MobileNetV2 with width multiplier and expect two properties: (a) the convolution kernels are diverse per layer, and (b) the attention is input dependent. We examine these two properties by contradiction. Firstly, if the convolution kernels are not diverse, the performances will be stable if different attentions are used. Thus, we vary the kernel aggregation per layer in three different ways: averaging , choosing the convolution kernel with the maximum attention , and random shuffling attention over kernels per image . Compared with using the original attention, the performances of these variations are significantly degraded (shown in Table 3). When choosing the convolution kernel with the maximum attention, the top-1 accuracy (0.1) is as low as randomly choosing a class. The significant instability confirms the diversity of convolution kernels. In addition, we shuffle attentions across images to check if the attention is input dependent. The poor performance (27.3% top1-accuracy) indicates that it is crucial for each image to use its own attention.

Furthermore, we inspect the attention across layers and find that attentions are flat at low levels and sparse at high levels. This is helpful to explain why variations in Table 3 have poor accuracy. For instance, averaging kernels with sparse attention at high levels or picking one convolution kernel (with the maximum attention) at low levels (where attentions are flat) is problematic. Table 4 shows how attention changes across layers affect the performance. We group layers by their input resolutions, and switch on/off attention for these groups. If attention is switched off for a resolution, each layer in that resolution aggregates kernels by averaging. When enabling attention at higher levels alone (resolution and ), the top-1 accuracy is 67.5%, closed to the performance (69.4%) of using attention for all layers. If attention is used for lower levels alone (resolution , and ), the top-1 accuracy is pool 42.5%.

4.3 Ablation Studies on ImageNet

We run a number of ablations to analyze DY-MobileNetV2, and use DY-MobileNetV3-Small to compare dynamic convolution with squeeze-and-excitation [13].

Figure 5: Comparing DY-MobileNetV2 with MobileNetV2 on different depth and width multipliers. Left: depth multiplier is 1.0, Middle: depth multiplier is 0.7, Right: depth multiplier is 0.5. Each curve has four different width multiplier 1.0, 0.75, 0.5, 0.35 (from right to left). Dynamic convolution outperforms its static counterpart by a clear margin for all width/depth multipliers. Best viewed in color.

The number of convolution kernels (): the hyper-parameter controls the model complexity. Figure 5 shows the classification accuracy and computational cost for dynamic convolution with different . We compare DY-MobileNetV2 with MobileNetV2 on different depth/width multipliers. Firstly, the dynamic convolution outperforms the static baseline for all depth/width multipliers, even with small . This demonstrates the strength of our method. In addition, the accuracy stop increasing once is larger than 4. This is because as increases, even though the model has more representation power, it is more difficult to optimize all convolution kernels and attention simultaneously and the network is more prone to over-fitting.

Figure 6: Comparison between shallower DY-MobileNetV2 with deeper MobileNetV2. The shallower DY-MobileNetV2 (depth ) has better trade-off between accuracy and computational cost than the deeper MobileNetV2 (depth ). To make comparison fair, we also plot the deeper DY-MobileNetV2 and shallower MobileNetV2. For both DY-MobileNetV2 and MobileNetV2, deeper networks have better performance. Best viewed in color.

Dynamic convolution in shallower and thinner networks: Figure 6 shows that the shallower DY-MobileNetV2 (depth ) has better trade-off between accuracy and computational cost than the deeper MobileNetV2 (depth ), even though shallower networks (depth 0.5) have performance degradation for both DY-MobileNetV2 and MobileNetV2. Improvement on shallow networks is useful as they are friendly to parallel computation. Furthermore, dynamic convolution achieves more improvement for thinner and shallower networks with small width/depth multipliers. This is because thinner and shallower networks are underfitted due to their limited model size and dynamic convolution significantly improves their capability.

Dynamic convolution at different layers: Table 5 shows the classification accuracy for using dynamic convolution at three different layers (, depth-wise, ) in an inverted residual bottleneck block in MobileNetV2 . The accuracy is improved if the dynamic convolution is used for more layers. Using dynamic convolution for all three layers yields the best accuracy. If only one layer is allowed to use dynamic convolution, using it for the last convolution yields the best performance.

Temperature of Softmax: the temperature in softmax controls the sparsity of attention weights. It is important for training DY-CNNs effectively. Table 6 shows the classification accuracy for using different temperatures. has the best performance.

Comparison with SENet: Table 7 shows the comparison between dynamic convolution and squeeze-and-excitation (SE [14]) on MobileNetV3-Small [12], in which the locations of SE layers are considered optimal as they are found by network architecture search (NAS). Without SE, the top-1 accuracy for MobileNetV3-Small drops 2%. However, DY-MobileNetV3-Small without SE outperforms MobileNetV3-Small with SE by 1.8% top-1 accuracy. Combining dynamic convolution and SE gains additional 0.5% improvement. This suggests that attention over kernels and attention over output channels can work together.

Network C1 C2 C3 Top-1 Top-5
MobileNetV2 1 1 1 65.4 86.4
DY-MobileNetV2 4 1 1 67.4 87.5
1 4 1 67.4 87.3
1 1 4 68.2 87.9
4 1 4 68.7 88.0
1 4 4 68.4 87.9
4 4 1 68.6 88.0
4 4 4 69.4 88.6
Table 5: Classification results of using dynamic convolution at different layers in MobileNetV2 . C1, C2 and C3 indicate the convolution that expands output channels, the depth-wise convolution and the convolution that shrinks output channels in the inverted residual bottleneck block respectively. C1=1 indicates using static convolution, while C1=4 indicates using dynamic convolution with 4 kernels. The numbers in brackets denote the performance improvement over the baseline.
Network Attention Top-1 Top-5
MobileNetV2 65.4 86.4
DY-MobileNetV2 64.8 85.5
65.7 85.8
67.5 87.4
69.4 88.5
69.4 88.6
69.2 88.4
Table 6: Classification results on ImageNet [4] for using different temperatures of softmax. The numbers in brackets denote the performance improvement over the baseline.
Network Top-1 Top-5
MobileNetV3-Small 67.4 86.4
MobileNetV3-Small w/o SE 65.4 85.2
DY-MobileNetV3-Small 69.7 88.5
Dy-MobileNetV3-Small w/o SE 69.2 88.3
Table 7: Comparing dynamic convolution with squeeze-and-excitation (SE [13]) on MobileNetV3-Small. The numbers in brackets denote the performance improvement over the baseline. Compared with static convolution with SE, using dynamic convolution without SE gains 1.8% top-1 accuracy.

4.4 COCO Single-Person Keypoint Detection

Type Backbone Head
Networks #Param MAdds Operator #Param MAdds AP AP AP AP AP AR
A ResNet-18 10.6M 1.77G dconv 8.4M 5.4G 67.0 87.9 74.8 63.6 73.5 73.1
DY-ResNet-18 42.2M 1.81G dconv 8.4M 5.4G 68.6 88.4 76.1 65.3 75.1 74.6
A MobileNetV2 2.2M 292.6M dconv 8.4M 5.4G 64.7 87.2 72.6 61.3 71.0 71.0
DY-MobileNetV2 9.8M 305.3M dconv 8.4M 5.4G 67.6 88.1 75.5 64.4 74.1 73.8
A MobileNetV2 0.7M 93.7M dconv 8.4M 5.4G 57.0 83.7 63.1 53.9 63.1 63.7
DY-MobileNetV2 2.7M 98.0M dconv 8.4M 5.4G 61.9 85.8 69.7 58.9 67.9 68.4

A
MobileNetV3-Large 3.0M 212.1M dconv 8.4M 5.4G 66.3 87.9 74.5 63.1 72.5 72.6
DY-MobileNetV3-Large 8.6M 220.2M dconv 8.4M 5.4G 68.2 88.2 76.5 64.8 74.8 74.2
A MobileNetV3-Small 1.1M 62.7M dconv 8.4M 5.4G 57.1 83.7 63.8 54.9 62.3 64.1
DY-MobileNetV3-Small 2.8M 65.1M dconv 8.4M 5.4G 59.3 84.7 66.7 56.9 64.7 66.1
B MobileNetV2 2.2M 292.6M bneck 1.2M 701.1M 64.6 87.0 72.4 61.3 71.0 71.0
DY-MobileNetV2 9.8M 305.3M bneck 6.3M 709.4M 68.2 88.4 76.0 65.0 74.7 74.2
B MobileNetV2 0.7M 93.7M bneck 1.2M 701.1M 59.2 84.3 66.4 56.2 65.0 65.6
DY-MobileNetV2 2.7M 98.0M bneck 6.3M 709.4M 62.8 86.1 70.4 59.9 68.6 69.1

B
MobileNetV3-Large 3.0M 212.1M bneck 1.1M 684.3M 65.7 87.4 74.1 62.3 72.2 71.7
DY-MobileNetV3-Large 8.6M 220.2M bneck 5.6M 691.9M 67.8 88.2 75.8 64.7 74.1 73.8
B MobileNetV3-Small 1.1M 62.7M bneck 1.0M 664.2M 57.1 83.8 63.7 55.0 62.2 64.1
DY-MobileNetV3-Small 2.8M 65.1M bneck 4.9M 671.1M 60.0 85.0 67.8 57.6 65.4 66.7
Table 8: Keypoint detection results on COCO validation set. All models are trained from scratch. The top half uses dynamic convolution in the backbone and uses deconvolution the in head (Type A). The bottom half use inverted residual bottleneck blocks in the head and use dynamic convolution in both the backbone and head (Type B). Each dynamic convolution layer includes kernels. The numbers in brackets denote the performance improvement over the baseline.

We use COCO 2017 dataset [18] to evaluate dynamic convolution on single-person keypoint detection. Our models are trained on train2017, including images and person instances labeled with 17 key-points. We evaluate our method on the val2017 containing 5000 images and use the mean average precision (AP) over 10 object key point similarity (OKS) thresholds as the metric.

Implementation Details: We implement two types of networks to evaluate dynamic convolution. Type-A follows SimpleBaseline [33] by using deconvolution in head. We use MobileNetV2 and V3 as a drop-in replacement for the backbone feature extractor and compare static convolution and dynamic convolution in the backbone alone. Type-B still uses MobileNetV2 and V3 as backbone. But it uses upsampling and MobileNetV2’s inverted residual bottleneck block in head. We compare dynamic convolution with its static counterpart in both backbone and head. The details of head structure are shown in Table 9. For both types, we use kernels in each dynamic convolution layer.

Input Operator exp size
bneck, 768 256 2
bneck, 768 128 1
bneck, 384 128 1
Table 9: Light-weight head structures for keypoint detection. We use MobileNetV2’s inverted residual bottleneck block [25]. Each row is corresponding to a stage, which starts with a bilinear upsampling operator to scale up the feature map by 2. denotes the number of output channels in that stage, and denotes the number of inverted residual bottleneck blocks in that stage. bneck refers to MobileNetV2’s inverted residual bottleneck block.

Training setup: We follow the training setup in [26]. The human detection boxes are cropped from the image and resized to . The data augmentation includes random rotation (), random scale (), flipping, and half body data augmentation. All models are trained from scratch for 210 epochs, using Adam optimizer [16]. The initial learning rate is set as 1e-3 and is dropped to 1e-4 and 1e-5 at the and epoch, respectively. The temperature of softmax in DY-CNNs is set as .

Testing: We follow [33, 26] to use two-stage top-down paradigm: detecting person instances using a person detector and then predict keypoints. We use the same person detectors provided by [33]. The keypoints are predicted on the average heatmap of the original and flipped images by adjusting the highest heat value location with a quarter offset from the highest response to the second highest response.

Main Results and Ablations: Firstly we compare dynamic convolution and static convolution in the backbone (Type-A). The results are shown in the top half of Table 8. Dynamic convolution gains 1.6, 2.9+, 1.9+ AP for ResNet-18, MobileNetV2 and V3, respectively.

Secondly, we replace the heavy deconvolution head with light-weight upsampling and MobileNetV2’s inverted residual bottleneck blocks (Type-B) to make the whole network small and efficient. Thus, we can compare dynamic convolution with its static counterpart in both backbone and head. The results are shown in the bottom half of Table 8. Similar to Type-A, dynamic convolution outperforms static baselines by a clear margin. It gains 3.6+ and 2.1+ AP for MobileNetV2 and V3, respectively. These results demonstrate that our method is also effective on keypoint detection.

We perform an ablation to investigate the effects of dynamic convolution at backbone and head separately (Table 10). Even though most of improvement comes from the dynamic convolution at the backbone, dynamic convolution at the head is also helpful. This is mainly because the backbone has more convolution layers than the head.

Backbone Head AP AP AP
static static 59.2 84.3 66.4
static dynamic 60.3 84.9 67.3
dynamic static 62.3 85.6 70.0
dynamic dynamic 62.8 86.1 70.4
Table 10: Comparing dynamic convolution with static convolution on both backbone and head. We use MobileNetV2 with width multiplier as backbone and use upsampling and inverted residual bottleneck blocks (see Table 9) in head. The numbers in brackets denote the performance improvement over the baseline. Dynamic convolution can improve AP at both backbone and head.

5 Conclusion

In this paper, we introduce dynamic convolution, which aggregates multiple convolution kernels dynamically based upon their attentions for each input. Compared to its static counterpart (single convolution kernel per layer), it significantly improves the representation capability with negligible extra computation cost, thus is more friendly to efficient CNNs. Our dynamic convolution can be easily integrated into existing CNN architectures. By simply replacing each convolution kernel in MobileNet (V2 and V3) with dynamic convolution, we achieve solid improvement for both image classification and human pose estimation. We hope dynamic convolution becomes a useful component for efficient network architecture.

Figure 7: Comparing DY-MobileNetV2 with MobileNetV2 per class on ImageNet validation set [4]. We calculate the top-1 accuracy per class using both DY-MobileNetV2 and MobileNetV2 with four different width multipliers (0.35, 0.5, 0.75 and 1.0), and plot for all 1000 classes. Each dot is corresponding to an image class. Darker dots indicate that multiple classes overlap at these positions. DY-MobileNetV2 is more accurate than its static counterpart for in majority of classes (above the diagonal red line), ranging from easier classes to harder classes. Each dynamic convolution layer in DY-MobileNetV2 has convolution kernels. Best viewed in color.

Appendix A Appendix

In this appendix, we report running time and perform additional analysis for our dynamic convolution method.

a.1 Inference Running Time

We report the running time of dynamic MobileNetV2 (DY-MobileNetV2) with four different width multipliers (, , , ) and compare with its static counterpart (MobileNetV2 [25]) in Table 11

. We use a single-threaded core of Intel Xeon CPU E5-2650 v3 (2.30GHz) to measure running time (in milliseconds). The running time is calculated by averaging the inference time of 5,000 images with batch size 1. Both MobileNetV2 and DY-MobileNetV2 are implemented using PyTorch

[23].

Compared with its static counterpart, DY-MobileNetV2 consumes about more running time and more Multi-Adds. The overhead of running time is higher than Multi-Adds. We believe this is because the optimizations of global average pooling and small inner-product operations are not as efficient as convolution. With the small additional computational cost, our dynamic convolution method significantly improves the model performance.

Network Top-1 MAdds CPU (ms)

MobileNetV2
72.0 300.0M 127.9
DY-MobileNetV2 74.4 312.9M 141.2
MobileNetV2 69.8 209.0M 99.5
DY-MobileNetV2 72.8 217.5M 110.5
MobileNetV2 65.4 97.0M 69.6
DY-MobileNetV2 69.4 101.4M 77.4
MobileNetV2 60.3 59.2M 61.1
DY-MobileNetV2 64.9 62.0M 67.4

Table 11: Comparing DY-MobileNetV2 with MobileNetV2 [25] on ImageNet [4] classification. We use dynamic convolution with kernels for all convolution layers in DY-MobileNetV2 except the first layer. CPU: CPU time in milliseconds measured on a single core of Intel Xeon CPU E5-2650 v3 (2.30GHz). The running time is calculated by averaging the inference time of 5,000 images with batch size 1. The numbers in brackets denote the performance improvement over the baseline.

a.2 Top-1 Classification Accuracy per Class

Figure 7 plots the top-1 classification accuracy for both the dynamic convolution (DY-MobileNetV2) and the static convolution (MobileNetV2 [25]) over 1000 classes in ImageNet [4]. The comparison is performed for four different width multipliers (, , , ). Each dot is corresponding to an image class, which has two top-1 accuracies for using both models (DY-MobileNetV2 and MobileNetV2). Since each class only has 50 images in the validation set, it is likely that multiple classes have the same number of images with correct prediction. Thus, multiple classes may have the same accuracy and overlap in Figure 7. We use dot opacity to indicate overlapping. The darker the dot, the more classes overlap at that position.

The dynamic convolution (DY-MobileNetV2) is more accurate than its static counterpart (MobileNetV2 [25]) in majority of classes (above the red diagonal line), ranging from easier classes to harder classes.

References

  • [1] H. Cai, C. Gan, and S. Han (2019) Once for all: train one network and specialize it for efficient deployment. ArXiv abs/1908.09791. Cited by: §2.
  • [2] H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
  • [3] M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 3123–3131. External Links: Link Cited by: §2.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: §A.2, Table 11, §4.1, Table 2, Table 3, Table 4, Table 6, Figure 7.
  • [5] X. Ding, Y. Guo, G. Ding, and J. Han (2019-10) ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • [6] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. The MIT Press. External Links: ISBN 0262035618, 9780262035613 Cited by: §3.1.
  • [7] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. External Links: 1904.00420 Cited by: §1, §2.
  • [8] S. Han, H. Mao, and W. Dally (2016-10) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [10] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018-09) AMC: automl for model compression and acceleration on mobile devices. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2.
  • [12] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam (2019) Searching for mobilenetv3. CoRR abs/1905.02244. External Links: Link, 1905.02244 Cited by: §1, §1, §2, §2, §3.2, §4.1, §4.1, §4.3.
  • [13] J. Hu, L. Shen, and G. Sun (2018-06) Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.2, §3.2, §4.3, Table 7.
  • [14] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Weinberger (2018) Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations, External Links: Link Cited by: §2, §4.3.
  • [15] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §2.
  • [16] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.4.
  • [17] J. Lin, Y. Rao, J. Lu, and J. Zhou (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. External Links: Link Cited by: §2.
  • [18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.4.
  • [19] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [20] L. Liu and J. Deng (2018) Dynamic deep neural networks: optimizing accuracy-efficiency trade-offs by selective execution. In

    AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Cited by: §2.
  • [21] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017-10) Learning efficient convolutional networks through network slimming. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [22] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018-09) ShuffleNet v2: practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.
  • [23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §A.1.
  • [24] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2018)

    Regularized evolution for image classifier architecture search

    .
    In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
  • [25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §A.1, §A.2, §A.2, Table 11, §1, §2, §2, §4.1, Table 9.
  • [26] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §4.4, §4.4.
  • [27] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019-06) MnasNet: platform-aware neural architecture search for mobile. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
  • [28] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019-06) HAQ: hardware-aware automated quantization with mixed precision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [29] X. Wang, F. Yu, Z. Dou, T. Darrell, and J. E. Gonzalez (2018-09) SkipNet: learning dynamic routing in convolutional networks. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [30] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019-06) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [31] B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, and K. Keutzer (2017) Shift: a zero flop, zero parameter alternative to spatial convolutions. Cited by: §2.
  • [32] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris (2018-06) BlockDrop: dynamic inference paths in residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [33] B. Xiao, H. Wu, and Y. Wei (2018-04) Simple baselines for human pose estimation and tracking. In European conference on computer vision, pp. . Cited by: §4.4, §4.4.
  • [34] S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [35] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X. Hua (2019-06) Quantization networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [36] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang (2019) Slimmable neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [37] D. Zhang, J. Yang, D. Ye, and G. Hua (2018-09) LQ-nets: learned quantization for highly accurate and compact deep neural networks. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [38] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018-06) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [39] C. Zhu, S. Han, H. Mao, and W. J. Dally (2017) Trained ternary quantization. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [40] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. CoRR abs/1611.01578. Cited by: §2.
  • [41] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018-06) Learning transferable architectures for scalable image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.