Interest in building light-weight and efficient neural networks has exploded recently. It not only enables new experiences on mobile devices, but also protects user’s privacy from sending personal information to the cloud. Recent works (e.g. MobileNet [11, 25, 12] and ShuffleNet [38, 22]) have shown that both efficient operator design (e.g. depth-wise convolution, channel shuffle, squeeze-and-excitation , asymmetric convolution ) and architecture search ([27, 7, 2]) are important for designing efficient convolution neural networks.
However, even the state-of-the-art efficient CNNs (e.g. MobileNetV3 ) suffer significant performance degradation when the computational constraint becomes extremely low. For instance, when the computational cost reduces from 219M to 66M Multi-Adds, the top-1 Imagenet classification accuracy for MobileNetV3 drops from 75.2% to 67.4%. This is because the extremely low computational cost severely constrains both the network depth (number of layers) and width (number of channels), which are crucial for the network performance but proportional to the computational cost.
This paper proposes a new operator design, named dynamic convolution, to increase the representation ability with negligible extra FLOPs. Dynamic convolution uses a set of parallel convolution kernels instead of using a single convolution kernel per layer (see Figure 2). It dynamically aggregates these convolution kernels for each individual input (e.g. image) via input dependent attention . Dynamic convolution is a non-linear function with more representation power than its static counterpart. Meanwhile, dynamic convolution is computational efficient. It does not increase depth or width of the network, as the parallel convolution kernels share the output channels by aggregation. It only introduces extra computational cost to compute attention and aggregate kernels , which is negligible compared to convolution. The key insight is that within reasonable cost of model size (as convolution kernels are small), dynamic kernel aggregation provides an efficient way (low extra FLOPs) to boost representation capability.
Dynamic convolutional neural networks (denoted as DY-CNNs) are more difficult to train, as they require joint optimization of all convolution kernels and the attention across multiple layers. The sparsity of the attention (softmax output) only allows a small subset of kernels to be optimized simultaneously, making training inefficient. We solve this problem by using temperature in softmax to make attention flat. Thus, more convolution kernels can be optimized simultaneously.
We demonstrate the effectiveness of dynamic convolution on both image classification (ImageNet) and keypoint detection (COCO). Without bells and whistles, simply replacing static convolution with dynamic convolution in MobileNet V2 and V3 achieves solid improvement with only a slight increase (4%) of computational cost (see Figure 1). For instance, with 100M Multi-Adds budget, our method gains 4.0% and 2.3% top-1 accuracy on image classification for MobileNetV2 and MobileNetV3, respectively.
2 Related Work
Efficient CNNs: Recently, designing efficient CNN architectures [15, 11, 25, 12, 38, 22] has been an active research area. SqueezeNet  reduces the number of parameters by using convolution extensively in the fire module. MobileNetV1  substantially reduces FLOPs by decomposing convolution to depthwise convolution and pointwise convolution. Based upon this, MobileNetV2  introduces inverted residual and linear bottlenecks. MobileNetV3  applies squeeze-and-excitation  in the residual layer and employs a platform-aware neural architecture approach  to find the global network structures. ShuffleNet further reduces the MAdds for convolution by channel shuffle operations. ShiftNet  replaces expensive spatial convolution by the shift operation and point-wise convolutions. Compared with existing work, our dynamic convolution can be used to replace any static convolution kernels (e.g. , , depth-wise convolution, group convolution) and is complementary to other advanced operators like squeeze-and-excitation.
Model Compression and Quantization: Model compression [8, 21, 10] and quantization [3, 39, 37, 35, 28] approaches are also important for learning efficient neural networks. They are complementary to our work, helping reduce the model size for our dynamic convolution method.
Dynamic Deep Nerual Networks: Our method is related to recent works of dynamic neural networks [17, 20, 29, 32, 36, 14] that focus on skipping part of an existing model based on input image. DNN , SkipNet  and BlockDrop 
learn an additional controller for skipping decision by using reinforcement learning. MSDNet allows early-exit based on the current prediction confidence. Slimmable Nets  learns a single neural network executable at different width. Once-for-all  proposes a progressive shrinking algorithm to train one network that supports multiple sub-networks. The accuracy for these sub-networks is the same as independently trained networks. Compared with these works, our method has two major differences. Firstly, all convolution layers in our method are dynamic, varying per input image, while existing works focus on dynamic network structure, leaving parameters in each layer static. Secondly, our method does not require an additional controller. The attention is embedded in each layer, enabling end-to-end training.
Neural Architecture Search: Recent research works in neural architecture search (NAS) demonstrate its power on finding high-accuracy neural network architectures [40, 24, 41, 19, 34] as well as hardware-aware efficient network architectures [2, 27, 30]. The hardware-aware NAS methods incorporate hardware latency into the architecture search process, by making it differentiable.  proposed single path supernet to optimize all architectures in the search space simultaneously, and then perform evolutionary architecture search to handle computational constraints. Based upon NAS, MobileNetV3  shows significant improvements over human-designed baselines (e.g. MobileNetV2 ). Our dynamic convolution method can be easily used in advanced architectures found by NAS. Later in this paper, we will show that dynamic convolution not only improves the performance for human-designed networks (e.g. MobielNetV2), but also boosts the performance for automatically searched architectures (e.g. MobileNetV3), with low extra FLOPs. In addition, our method provides a new and effective component to enrich the search space.
3 Dynamic Convolutional Neural Networks
We will describe dynamic convolutional neural networks (DY-CNNs) in this section. The goal is to provide better trade-off between network performance and computational burden, within the scope of efficient neural networks. The two most popular strategies to boost the performance are making neural networks “deeper” or “wider”. However, they will both incur heavy computation cost, thus not friendly to efficient neural networks.
We propose dynamic convolution, which does not increase either the depth or the width of the network, but increase the model capability by aggregating multiple convolutional kernels via attention. Note that these kernels are assembled differently for different input images, from where dynamic convolution gets its name. This section is organized as follows. We firstly define the generic dynamic perceptron, and then apply it to convolution. Finally, we will discuss the training strategy for dynamic convolutional neural networks (DY-CNNs).
3.1 Preliminary: Dynamic Perceptron
Definition: Let us denote the traditional or static perceptron as , where and
are weight matrix and bias vector, and) linear functions as follows:
where is the attention weight for the linear function .
Attention: the attention weights are not fixed, but vary for each input , assembling these linear models dynamically. They represent the optimal aggregation of linear models for a given input. Due to the non-linearity embedded in and , the aggregated model is a non-linear function. Thus, dynamic perceptron has more representation power than its static counterpart.
Example (learning XOR): To make the idea of dynamic perceptron more concrete, we begin with a simple task, i.e. learning the XOR function. In this example, we want our network to perform correctly on the four points . Compared with the solution using two static perceptron layers  as follows:
dynamic perception only needs a single layer as follows:
where the attentions are , . This example demonstrates that dynamic perceptron has more representation power due to the non-linearity.
Computational Constraint: compared with static perceptron, dynamic perceptron has the same number of output channels but bigger model size. It also introduces two additional computations: (a) computing the attention weights , and (b) aggregating parameters based upon attention and . The additional computational cost should be significantly less than the linear model . Mathematically, the computational constraint can be represent as follows:
where measures the computational cost (e.g. FLOPs). Note that fully connected layer does not satisfy this, while convolution is a proper fit for this constraint.
3.2 Dynamic Convolution
In this subsection, we showcase a specific dynamic perceptron, dynamic convolution that satisfies the computational constraint (Eq. 4). Similar to dynamic perceptron, dynamic convolution (Figure 3) has convolution kernels that share the same kernel size and input/output dimensions. They are aggregated by using the attention weights
. Following the classic design in CNN, we use batch normalization and an activation function (e.g. ReLU) after the aggregated convolution to build a dynamic convolution layer.
Attention: we apply light-weight squeeze and excitation  to compute kernel attentions (see Figure 3). The global spatial information is firstly squeezed by global average pooling. Then we use two fully connected layers (with a ReLU between them) and softmax to generate normalized attention weights for convolution kernels. The first fully connected layer reduces the dimension by 4. Different from SENet  which computes attentions over output channels, we compute attentions over convolution kernels. The computation cost for the attention is cheap. For an input feature map with dimension , attention requires Mult-Adds. This is much less than the computational cost of convolution, i.e. Mult-Adds, where is the kernel size, is the number of output channels.
Kernel Aggregation: aggregating convolution kernels is computational efficient due to the small kernel size. Aggregating convolution kernels with kernel size , input channels and output channels introduces extra Multi-Adds. Compared with the computational cost of convolution (), the extra cost is neligible if . Table 1 shows the computational cost of using dynamic convolution in MobileNetV2 and MobileNetV3. For instance, when using MobileNetV2 (1.0), dynamic convolution with kernels only increases the computation cost by 4%. Note that dynamic convolution increases the model size, which is acceptable as convolution kernels are small.
From CNN to DY-CNN: dynamic convolution can be easily used as a drop-in replacement for any convolution (e.g. , , group convolution, depth-wise convolution) in any CNN architecture. It is also complementary to other operators (like squeeze-and-excitation ) and activation functions (e.g. ReLU6, h-swish ). In the rest of paper, we use prefix DY- for the networks that use dynamic convolution. For example, DY-MobileNetV2 refers to using dynamic convolution in MobileNetV2.
3.3 Training Strategy for DY-CNN
Training deeper dynamic convolution neural networks (DY-CNN) is more challenging, as it requires joint optimization of all convolution kernels and attention across multiple layers. In Figure 4-Right, the blue curves show the training and validation errors for DY-MobileNetV2 with width multiplier
over 300 epochs. It converges slowly and the final top-1 accuracy (64.8%) degrades from its static counterpart (65.4%).
We believe the sparsity of attention (due to softmax) only allows a small subset of kernels across layers to be optimized simultaneously, making training inefficient. And this inefficiency becomes more severe for deeper network, as the combination of activated convolution kernels (with higher attention) across layers increases exponentially. To validate this, we train an variation of DY-MobileNetV2, which only uses dynamic convolution for the last convolution in the inverted residual bottleneck block and keep the other two convolution layers static. The training and validation errors are shown in Figure 4-Left. The training converges faster with higher accuracy (65.9%).
We address this issue by flattening attention to enable more convolution kernels optimized simultaneously. This is achieved by using temperature in softmax as follows:
is the logit. The original softmax (without) is equivalent to . As increases, the output distribution is less sparse. We found that using a larger (e.g. ) can improve the training efficiency significantly (see the red curves in Figure 4-Right). When changing from 1 to 30, the accuracy boosts from 64.8% to 69.4% for DY-MobileNetV2 with width multiplier . Even the network using dynamic convolution in layers also benefits from this (shown in Figure 4-Left).
4 Experimental Results
In this section, we present experimental results to demonstrate the effectiveness of our dynamic convolution. We report results on image classification and single person pose estimation. We also report ablation studies to analyze different components of our approach.
4.1 ImageNet Classification
We use ImageNet  for all classification experiments. ImageNet has 1000 object classes, including 1,281,167 images for training and 50,000 images for validation. We evaluate dynamic convolution on three CNN architectures (MobileNetV2, MobileNetV3  and ResNet ), by using dynamic convolution for all convolution layers except the first layer. All dynamic convolution layers have convolution kernels. The softmax temperature is set 30 to compute attentions, and the batch size is 256. We use different training setups for the three architectures as follows:
Training setup for DY-MobileNetV2: The initial learning rate is 0.05 and is scheduled to arrive at zero within a single cosine cycle. The weight decay is 4e-5. All models are trained using SGD optimizer with 0.9 momentum for 300 epochs. We use dropout rate of 0.2 and 0.1 before the last layer for the width multiplier and respectively.
Training setup for DY-MobileNetV3: The initial learning rate is 0.1 and is scheduled to arrive at zero within a single cosine cycle. The weight decay is 3e-5. We use SGD optimizer with 0.9 momentum for 300 epochs and dropout rate of 0.2 before the last layer. We use label smoothing for DY-MobileNetV3-Large.
Training setup for DY-ResNet: The initial learning rate is 0.1 and drops by 10 at epoch 30, 60 and 90. The weight decay is 1e-4. All models are trained using SGD optimizer with 0.9 momentum for 100 epochs. We use dropout rate 0.1 before the last layer of DY-ResNet-18.
Main Results: We compare dynamic convolution with its static counterpart for three CNN architectures (MobileNetV2, MobileNetV3 and ResNet) in Table 2. Although we focus on efficient CNNs, we evaluate dynamic convolution on two shallow ResNets (ResNet-10 and ResNet-18) to show its effectiveness on convolution, which is only used for the first layer in MobileNet V2 and V3. Without bells and whistles, dynamic convolution outperforms its static counterpart by a clear margin for all three architectures, with small extra computational cost (). DY-ResNet gains more than 2.3% top-1 accuracy and DY-MobileNetV2 gains more than 2.4% top-1 accuracy. DY-MobileNetV3-Small is 2.3% more accurate than the state-of-the-art MobileNetV3-Small.
For MobileNetV3-Large, we can not use small mini-batch to reproduce the baseline performance 75.2%, which is achieved in  by using large mini-batch 4096. As large mini-batch is not feasible for us to fit in 4 GPUs, we report results on small mini-batch 256. The top-1 accuracy for our implementation of MobileNetV3-Large and DY-MobileNetV3-Large are 73.7% and 74.7%, respectively.
4.2 Inspecting DY-CNN
|shuffle per image:||14.8||30.5|
|shuffle across images:||27.3||48.4|
We inspect a well trained DY-MobileNetV2 with width multiplier and expect two properties: (a) the convolution kernels are diverse per layer, and (b) the attention is input dependent. We examine these two properties by contradiction. Firstly, if the convolution kernels are not diverse, the performances will be stable if different attentions are used. Thus, we vary the kernel aggregation per layer in three different ways: averaging , choosing the convolution kernel with the maximum attention , and random shuffling attention over kernels per image . Compared with using the original attention, the performances of these variations are significantly degraded (shown in Table 3). When choosing the convolution kernel with the maximum attention, the top-1 accuracy (0.1) is as low as randomly choosing a class. The significant instability confirms the diversity of convolution kernels. In addition, we shuffle attentions across images to check if the attention is input dependent. The poor performance (27.3% top1-accuracy) indicates that it is crucial for each image to use its own attention.
Furthermore, we inspect the attention across layers and find that attentions are flat at low levels and sparse at high levels. This is helpful to explain why variations in Table 3 have poor accuracy. For instance, averaging kernels with sparse attention at high levels or picking one convolution kernel (with the maximum attention) at low levels (where attentions are flat) is problematic. Table 4 shows how attention changes across layers affect the performance. We group layers by their input resolutions, and switch on/off attention for these groups. If attention is switched off for a resolution, each layer in that resolution aggregates kernels by averaging. When enabling attention at higher levels alone (resolution and ), the top-1 accuracy is 67.5%, closed to the performance (69.4%) of using attention for all layers. If attention is used for lower levels alone (resolution , and ), the top-1 accuracy is pool 42.5%.
4.3 Ablation Studies on ImageNet
We run a number of ablations to analyze DY-MobileNetV2, and use DY-MobileNetV3-Small to compare dynamic convolution with squeeze-and-excitation .
The number of convolution kernels (): the hyper-parameter controls the model complexity. Figure 5 shows the classification accuracy and computational cost for dynamic convolution with different . We compare DY-MobileNetV2 with MobileNetV2 on different depth/width multipliers. Firstly, the dynamic convolution outperforms the static baseline for all depth/width multipliers, even with small . This demonstrates the strength of our method. In addition, the accuracy stop increasing once is larger than 4. This is because as increases, even though the model has more representation power, it is more difficult to optimize all convolution kernels and attention simultaneously and the network is more prone to over-fitting.
Dynamic convolution in shallower and thinner networks: Figure 6 shows that the shallower DY-MobileNetV2 (depth ) has better trade-off between accuracy and computational cost than the deeper MobileNetV2 (depth ), even though shallower networks (depth 0.5) have performance degradation for both DY-MobileNetV2 and MobileNetV2. Improvement on shallow networks is useful as they are friendly to parallel computation. Furthermore, dynamic convolution achieves more improvement for thinner and shallower networks with small width/depth multipliers. This is because thinner and shallower networks are underfitted due to their limited model size and dynamic convolution significantly improves their capability.
Dynamic convolution at different layers: Table 5 shows the classification accuracy for using dynamic convolution at three different layers (, depth-wise, ) in an inverted residual bottleneck block in MobileNetV2 . The accuracy is improved if the dynamic convolution is used for more layers. Using dynamic convolution for all three layers yields the best accuracy. If only one layer is allowed to use dynamic convolution, using it for the last convolution yields the best performance.
Temperature of Softmax: the temperature in softmax controls the sparsity of attention weights. It is important for training DY-CNNs effectively. Table 6 shows the classification accuracy for using different temperatures. has the best performance.
Comparison with SENet: Table 7 shows the comparison between dynamic convolution and squeeze-and-excitation (SE ) on MobileNetV3-Small , in which the locations of SE layers are considered optimal as they are found by network architecture search (NAS). Without SE, the top-1 accuracy for MobileNetV3-Small drops 2%. However, DY-MobileNetV3-Small without SE outperforms MobileNetV3-Small with SE by 1.8% top-1 accuracy. Combining dynamic convolution and SE gains additional 0.5% improvement. This suggests that attention over kernels and attention over output channels can work together.
|MobileNetV3-Small w/o SE||65.4||85.2|
|Dy-MobileNetV3-Small w/o SE||69.2||88.3|
4.4 COCO Single-Person Keypoint Detection
We use COCO 2017 dataset  to evaluate dynamic convolution on single-person keypoint detection. Our models are trained on train2017, including images and person instances labeled with 17 key-points. We evaluate our method on the val2017 containing 5000 images and use the mean average precision (AP) over 10 object key point similarity (OKS) thresholds as the metric.
Implementation Details: We implement two types of networks to evaluate dynamic convolution. Type-A follows SimpleBaseline  by using deconvolution in head. We use MobileNetV2 and V3 as a drop-in replacement for the backbone feature extractor and compare static convolution and dynamic convolution in the backbone alone. Type-B still uses MobileNetV2 and V3 as backbone. But it uses upsampling and MobileNetV2’s inverted residual bottleneck block in head. We compare dynamic convolution with its static counterpart in both backbone and head. The details of head structure are shown in Table 9. For both types, we use kernels in each dynamic convolution layer.
Training setup: We follow the training setup in . The human detection boxes are cropped from the image and resized to . The data augmentation includes random rotation (), random scale (), flipping, and half body data augmentation. All models are trained from scratch for 210 epochs, using Adam optimizer . The initial learning rate is set as 1e-3 and is dropped to 1e-4 and 1e-5 at the and epoch, respectively. The temperature of softmax in DY-CNNs is set as .
Testing: We follow [33, 26] to use two-stage top-down paradigm: detecting person instances using a person detector and then predict keypoints. We use the same person detectors provided by . The keypoints are predicted on the average heatmap of the original and flipped images by adjusting the highest heat value location with a quarter offset from the highest response to the second highest response.
Main Results and Ablations: Firstly we compare dynamic convolution and static convolution in the backbone (Type-A). The results are shown in the top half of Table 8. Dynamic convolution gains 1.6, 2.9+, 1.9+ AP for ResNet-18, MobileNetV2 and V3, respectively.
Secondly, we replace the heavy deconvolution head with light-weight upsampling and MobileNetV2’s inverted residual bottleneck blocks (Type-B) to make the whole network small and efficient. Thus, we can compare dynamic convolution with its static counterpart in both backbone and head. The results are shown in the bottom half of Table 8. Similar to Type-A, dynamic convolution outperforms static baselines by a clear margin. It gains 3.6+ and 2.1+ AP for MobileNetV2 and V3, respectively. These results demonstrate that our method is also effective on keypoint detection.
We perform an ablation to investigate the effects of dynamic convolution at backbone and head separately (Table 10). Even though most of improvement comes from the dynamic convolution at the backbone, dynamic convolution at the head is also helpful. This is mainly because the backbone has more convolution layers than the head.
In this paper, we introduce dynamic convolution, which aggregates multiple convolution kernels dynamically based upon their attentions for each input. Compared to its static counterpart (single convolution kernel per layer), it significantly improves the representation capability with negligible extra computation cost, thus is more friendly to efficient CNNs. Our dynamic convolution can be easily integrated into existing CNN architectures. By simply replacing each convolution kernel in MobileNet (V2 and V3) with dynamic convolution, we achieve solid improvement for both image classification and human pose estimation. We hope dynamic convolution becomes a useful component for efficient network architecture.
Appendix A Appendix
In this appendix, we report running time and perform additional analysis for our dynamic convolution method.
a.1 Inference Running Time
. We use a single-threaded core of Intel Xeon CPU E5-2650 v3 (2.30GHz) to measure running time (in milliseconds). The running time is calculated by averaging the inference time of 5,000 images with batch size 1. Both MobileNetV2 and DY-MobileNetV2 are implemented using PyTorch.
Compared with its static counterpart, DY-MobileNetV2 consumes about more running time and more Multi-Adds. The overhead of running time is higher than Multi-Adds. We believe this is because the optimizations of global average pooling and small inner-product operations are not as efficient as convolution. With the small additional computational cost, our dynamic convolution method significantly improves the model performance.
a.2 Top-1 Classification Accuracy per Class
Figure 7 plots the top-1 classification accuracy for both the dynamic convolution (DY-MobileNetV2) and the static convolution (MobileNetV2 ) over 1000 classes in ImageNet . The comparison is performed for four different width multipliers (, , , ). Each dot is corresponding to an image class, which has two top-1 accuracies for using both models (DY-MobileNetV2 and MobileNetV2). Since each class only has 50 images in the validation set, it is likely that multiple classes have the same number of images with correct prediction. Thus, multiple classes may have the same accuracy and overlap in Figure 7. We use dot opacity to indicate overlapping. The darker the dot, the more classes overlap at that position.
The dynamic convolution (DY-MobileNetV2) is more accurate than its static counterpart (MobileNetV2 ) in majority of classes (above the red diagonal line), ranging from easier classes to harder classes.
-  (2019) Once for all: train one network and specialize it for efficient deployment. ArXiv abs/1908.09791. Cited by: §2.
-  (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, External Links: Cited by: §1, §2.
-  (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 3123–3131. External Links: Cited by: §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §A.2, Table 11, §4.1, Table 2, Table 3, Table 4, Table 6, Figure 7.
-  (2019-10) ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
-  (2016) Deep learning. The MIT Press. External Links: Cited by: §3.1.
-  (2019) Single path one-shot neural architecture search with uniform sampling. External Links: Cited by: §1, §2.
-  (2016-10) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
-  (2018-09) AMC: automl for model compression and acceleration on mobile devices. In The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2.
-  (2019) Searching for mobilenetv3. CoRR abs/1905.02244. External Links: Cited by: §1, §1, §2, §2, §3.2, §4.1, §4.1, §4.3.
-  (2018-06) Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.2, §3.2, §4.3, Table 7.
-  (2018) Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations, External Links: Cited by: §2, §4.3.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Cited by: §2.
-  (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.4.
-  (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. External Links: Cited by: §2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.4.
-  (2019) DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Cited by: §2.
Dynamic deep neural networks: optimizing accuracy-efficiency trade-offs by selective execution.
AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
-  (2017-10) Learning efficient convolutional networks through network slimming. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2018-09) ShuffleNet v2: practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §A.1.
Regularized evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §A.1, §A.2, §A.2, Table 11, §1, §2, §2, §4.1, Table 9.
-  (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §4.4, §4.4.
-  (2019-06) MnasNet: platform-aware neural architecture search for mobile. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
-  (2019-06) HAQ: hardware-aware automated quantization with mixed precision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-09) SkipNet: learning dynamic routing in convolutional networks. In The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2019-06) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2017) Shift: a zero flop, zero parameter alternative to spatial convolutions. Cited by: §2.
-  (2018-06) BlockDrop: dynamic inference paths in residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-04) Simple baselines for human pose estimation and tracking. In European conference on computer vision, pp. . Cited by: §4.4, §4.4.
-  (2019) SNAS: stochastic neural architecture search. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2019-06) Quantization networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) Slimmable neural networks. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2018-09) LQ-nets: learned quantization for highly accurate and compact deep neural networks. In The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2018-06) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2017) Trained ternary quantization. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2017) Neural architecture search with reinforcement learning. CoRR abs/1611.01578. Cited by: §2.
-  (2018-06) Learning transferable architectures for scalable image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.