I Introduction
Human Pose Estimation (HPE) aims to locate skeletal keypoints (e.g. ear, shoulder, elbow, etc.) of all persons in the given RGB image. It is fundamental to action recognition and has wide applications in humancomputer interaction, animation, etc. This paper is interested in singleperson pose estimation, which is the basis of multiperson pose estimation [19, 25].
HPE involves two subtasks: location (determining where the keypoints are) and classification (determining which kinds the keypoints are). The location needs plenty of local details to get pixellevel accuracy. While classification requires a relatively larger receptive field to extract discriminative semantic representations [1]. Consequently, HPE methods have to fuse multiscale information to make a balance between these two subtasks [23]. Most nowadays HPE methods [5, 22, 13, 17, 18] repeatedly downscale feature maps to enlarge the receptive fields. Feature maps of different spatial sizes (i.e. , , , and of the input image) are then resized and summed to exploit multiscale information.
This strategy has made great achievements in HPE [13, 23, 17]
, but it still leaves to be desired. In this strategy, feature maps are downscaled by strided convolution (or pooling). As shown in Figure
1, during the downscaling, multiple pixels on the larger feature maps are merged into the same pixel on the smaller ones. The location information will be destroyed in this process. While during the upscaling, even if the transposed convolution [8] is used, it is hard to recover the destroyed location information. Consequently, there will be multiple possible corresponding positions on the upscaled feature maps for original single pixel. Although the final resized feature maps have the same spatial sizes, their pixels may be not well aligned. This spatial nonalignment potentially hurts the accuracy of location. Thus, it may be more preferred to fuse multiscale features of the same spatial sizes.An alternative method is to use dilated convolution, instead of downscaling, to enlarge receptive fields. In [27, 3], multiple convolutional layers with different dilation rates are used to extract feature maps at different scales. These feature maps have the same spatial sizes and are well aligned spatially. They are concatenated and fused by convolution to exploit multiscale information. However, these dilation rates are still manually set and fixed, which may restrict the generalization ability over various human sizes.
Towards these issues, we propose an adaptive dilated convolution (ADC) in this paper. As shown in Figure 2, it divides channels into different dilation groups and uses a dilationrates regression module to adaptively generate dilation rates for these groups. Compared with previous multiscale fusion methods, ADC has three advantages: i) Instead of using multiple independent dilated convolution layers, ADC directly assigns different dilation rates to its channels. In this way, ADC can generate and fuse multiscale features in a single layer, which is more elegant and efficient. ii) ADC allows fractional dilation rates, which enables ADC to adjust receptive fields with finer granularity, instead of only four fixed integer scales. Thus ADC may be able to exploit richer and finer multiscale information. iii) The dilation rates in ADC are adaptively generated, which could help ADC to generalize better to various human sizes.
ADC can be easily plugged into existing HPE methods and trained endtoend by standard backpropagation. Our contributions can be summarized into three points:

We attempt to address the spatial nonalignment and inflexibility problems in nowadays multiscale fusion methods of HPE. These problems are important to the accuracy of location and generalization ability over various human sizes.

We propose an adaptive dilated convolution (ADC), which could flexibly fuse wellaligned multiscale features in a single convolutional layer by adaptively generating dilation rates for different channels.

The proposed ADC can be easily plugged into existing HPE methods and extensive experiments show that ADC can bring these methods consistent improvements.
Ii Related Works
Iia Multiscale Fusion
Multiscale fusion is widely adopted in many highlevel vision tasks, such as detection [14, 12, 9], semantic segmentation [16], etc. On the one hand, these tasks involve both location and classification. They need multiscale information to make a balance between these two subtasks. On the other hand, these tasks need to tackle objects of various sizes. They scaleinvariant representations to get more stable performances. In these tasks, most methods [13, 14, 16]
firstly extract a feature pyramid, which contains feature maps of different spatial sizes, and then fuse feature maps to obtain multiscale information. However, as we have discussed above, the fused features may be not well spatiallyaligned. This nonalignment may hurt the accuracy of location. For detection and segmentation, this influence could be ignored, because their evaluation metrics (IOU) are less sensitive to the accuracy of location. While HPE methods are evaluated by OKS, which will be heavily influenced by pixellevel errors. Thus the nonalignment may restrict the performance of HPE methods. In the proposed adaptive dilated convolution, multiscale features are of the same spatial sizes, which may be more friendly to human pose estimation.
IiB Dilated Convolution
The main idea of dilated convolution is to insert zeros between pixels of convolution kernels. It is widely used in segmentation [4, 2] to enlarge the receptive fields while keeping the resolutions of feature maps. As the size of its receptive field can be easily changed by adjusting its dilation rate, dilated convolution is also used to aggregate multiscale context information. For example, in [27], the outputs of convolutional layers with different dilation rates are fused to exploit multiscale context information. And in [3], a similar idea is adopted in an atrous spatial pyramid pooling (ASPP) module. However, these dilation rates of different layers are manually set and can only be integers, which are not flexible enough. Instead, the dilation rates in ADC can be fractional and are adaptively generated, which enables it to learn more suitable receptive fields for objects of various sizes. Besides, every dilation group in ADC can represent features at a scale, which enables ADC to fuse richer multiscale information yet in a simpler way.
Iii Adaptive Dilated Convolution
Iiia Constant Dilation Rates
As shown in Figure 3, original dilated convolution can be decomposed into two steps: 1) sampling according to a index set over the input feature map ; 2) matrix multiplication of the sampled values and convolutional kernel . The index set is defined by the dilation rate and size of kernel :
(1) 
where is the number of channels in , denotes rounding down to the nearest integer. Specially, if and then
(2)  
For value at location of the output feature map , we have
(3) 
where enumerates the indexes in , and denotes the corresponding convolutional kernel for the output channel.
The receptive field for each channel in convolutional layer is defined as the square covered by index set . In original dilated convolution, the receptive fields of all channels are the same. Their sizes are:
(4)  
Specially, when , the size of receptive field is .
IiiB Adaptive Dilation Rates
In adaptive dilated convolution, the dilation rates are no longer manually set. As shown in Figure 2, we use a dilationrates regression module (DRM) to adaptively generate the dilation rates for different channels. DRM consists of a global average pooling layer and two fully connected layers with nonlinear activations. Suppose DRM is denoted as a function , then generated dilation rate is
(5) 
We divide the input channels into dilation groups. Each group contains channels. The channels in the same group shares the same dilation rate. Thus the shape of is . And the dilation rate of the input channel is . If , then each channel has its own dilation rate. If , then all channels share the same dilation rate.
Consequently, the sampling index set becomes
(6) 
In cases where is fractional, as shown in Figure 3, we use bilinear interpolation to get the sampling values. Suppose denotes the interpolated value at on , then we have:
(7) 
Similarly, in the channel of adaptive dilated convolution, the size of receptive field is:
(8)  
Consequently, different dilation groups have different sizes of receptive fields. And thus ADC can fuse multiscale information in a single layer.
IiiC Analysis and Discussion
Comparison with Yu et al. In [27], Yu et al. use multiple dilated convolutional layers with different dilation rates to extract features at different scales. ADC adopts a similar idea, but implements it in a simple yet efficient way. Firstly, ADC consists of only one convolutional layer. It does not use independent dilated convolutional layers or extra concatenation. Thus ADC is much more computationeconomic and timesaving. Secondly, every dilation group in ADC represents features at a different scale, which enables ADC to exploit richer multiscale information than [27]. Thirdly, the dilation rates in ADC can be fractional and are adaptively generated, instead of manually set integers. It helps ADC to generalize better to persons of various sizes.
Comparison with deformable convolution. In [7], Dai et al. propose a deformable convolutional layer, which allows the sampling index set to be nongrid and irregular. It assigns an offset for each index in
, instead of only modifying the dilation rates. Compared with adaptive dilated convolution, deformable convolution enjoys higher degrees of freedom, but it also has a much higher computational cost. More importantly, the offsets introduced in deformable convolution are completely unconstrained and independent. This may cause the input and output feature maps to lose their spatial correspondence, which also potentially hurts the accuracy of location. We also experimentally proved in Sec
IVA5 that the proposed adaptive dilated convolution is more suitable to human pose estimation than deformable convolution.Comparison with scaleadaptive convolution. In [28], Zhang et al. propose a scaleadaptive convolution (SAC) to address inconsistent predictions of large objects and invisibility of small objects in scene parsing. SAC adaptively generates pixelwise dilation rates to acquire flexiblesize receptive fields along spatial dimensions. It works well for scene parsing, which needs to tackle objects of various sizes within a single image. However, in single person pose estimation, size inconsistent across different images plays a more important role, which could be better alleviated via multiscale fusion along channel dimension. In SAC, different pixels can have different sizes of receptive fields, but different channels share the same dilation rates. Consequently, ADC may be more suitable for single person pose estimation than SAC. In Sec IVA5, we also experimentally prove that ADC works better than SAC in HPE methods.
IiiD Instantiation
We plug ADC into the backbones of frequently used HPE models, including the family of SimpleBaseline [26] and HRNet [23]. Their backbones are built up with residual blocks [10]. As shown in Figure 4, we replace one ordinary convolution layer in the original residual block by ADC. The weights of the last layer in the dilatedrates regression module are initialized as zeros and its bias are initialized as ones. Thus, the generated dilation rates in ADC are initialized are ones. The dilation groups are set as , in which case each group contains only one channel. Thus every channel can exploit context information at different scales, and ADC could fuse as much richer multiscale information as it can. We also experimentally demonstrate that the performance is positively correlated to in Sec IVA4.
Iv Experiments
Iva Experiments on COCO
Dataset. All of our experiments about human pose estimation are done on COCO dataset [15]. It contains over persons and persons. Our models are trained on COCO train2017 ( images), and evaluated on COCO val2017 ( images) and COCO testdev ( images).
Evaluation metric. We use the standard evaluation metric Object Keypoint Similarity (OKS) to evaluate our models. , where is the Euclidean distance between the detected keypoint and its corresponding groundtruth, is the visibility flag of the groundtruth, denotes the person scale, and is a perkeypoint constant that controls falloff. We report the standard average precision () and recall, including ( at OKS=), , (mean of scores from OKS= to OKS= with the increment as , ( scores for person of medium sizes) and ( scores for persons of large sizes).
Training. Following the setting of [23], we augment the data by random rotation ([, ]), random scaling ([, ]), random translation ([, ]), random horizontal flip and half body transform [24]. Then we crop out each single person according to their groundtruth bounding boxes. These crops are resized to (or ) and input to the HPE model.
The models are optimized by Adam [11] optimizer, and the initial learning rate is set as
. For the family of HRNet, each model is trained for 210 epochs and the learning rate decays to
and at and epoch respectively. For the family of SimpleBaseline, each model is trained for 140 epochs and the learning rate decays to and at and epoch respectively. All models are trained and tested on Tesla V100 GPUs. More details can be referred to the Github repository Pose^{*}^{*}*https://github.com/leoxiaobin/deephighresolutionnet.pytorch.git.Testing. During testing, we use the same person detection results provided in [26], which are widely used for many singleperson HPE models [23, 1]. Single persons are cropped out according to the detection results and then resized and input to the HPE models. The flip test [23] is also performed in all experiments. Each keypoint location is predicted by adjusting the highest heatvalue location with a quarter offset in the direction from the highest response to the secondhighest response [23].
IvA1 Ablation Study
To fully demonstrate the superiority of ADC, we perform ablation studies on different models, including the family of SimpleBaseline [26] and HRNet [23]. The results are shown in Table I. As one can see, ADC can bring consistent improvement for different models. For the smallest model, i.e. SimpleBaselineRes50, ADC brings an improvement of on score. For the largest model, i.e. HRNetW48, there is still an improvement of on score. The increments decay as the scores increase. This may because it is harder to improve the performance of a more accurate model. From and , we can see that the improvements in medium and large persons are roughly the same. It indicates that ADC benefits equally the keypoint detection of large and medium persons.
IvA2 Error Analysis
In this section, we use the error analysis tool in [20] to further explore how ADC help HPE models achieve better results. We mainly study four types of errors: 1) jitter: small error around the correct keypoint location; 2) missing: large localization error, the detected keypoint is not within the proximity of any body part; 3) inversion: confusion between semantically similar parts belonging to the same instance. The detection is in the proximity of the true keypoint location of the wrong body part; 4) swap: confusion between semantically similar parts of different instances. The detection is within the proximity of a body part belonging to a different person. We use SimpleBaselineRes50 as the baseline model, and plot the error analysis results with and without ADC in Figure 5. As one can see, ADC can reduce the proportion of all four types of errors. Especially, the proportion of missing error is reduced by . It suggests that ADC could help the model to be more robust and detect keypoints in more cases. This may be attributed to that ADC can adaptively adjust the dilation rates. The jitter error and inversion error directly indicate the accuracy of location and classification respectively. The proportions of these two errors are both reduced by . It suggests that ADC can simultaneously benefit the location and classification of keypoints.
IvA3 Statistical Analysis
In this section, we make a statistical analysis to further investigate how the generated dilation rates in ADC are related to the sizes of test persons. We divide the test persons in COCO val2017 into three types according to the areas of their bounding boxes. Persons whose bounding boxes have areas: 1) smaller than are divided into the small group ( persons); 2) greater than but smaller than are divided into the medium group ( persons); 3) greater than are divided into the large group ( persons). We still use SimpleBaselineRes50 with ADC as the studied model. The backbone, i.e. Resnet50, has four stages, and they contain , , , residual blocks respectively.
We plot the means and variations of the dilation rates of different channels in Figure 6. For example, Figure 6 (a) shows the mean dilation rates of ADC in the third block of the first stage ( channels). Figure 6
(especially the top row) suggests that the dilation rates are closely related to the sizes of test persons. The dilation rates for larger persons are more likely to larger. It enables ADC to be more robust over various human sizes. Besides, the dilation rates for the deeper block also tend to have larger means (bottom row). It may because that deeper blocks are more concerned with semantic features and need larger dilation rates to enlarge the receptive fields. Additionally, the mean dilation rates of different channels in the same layer are quite different. The larger variance of these dilation rates also indicates that ADC can fuse rich multiscale information via different channels.
IvA4 Study of Dilation Groups
We perform comparative experiments to explore the influence of dilation groups . We use the SimpleBaselineRes50 as the baseline. We gradually improve from to . The results are shown in Table II. As one can see, the model performance becomes better when increases. It suggests that the number of different dilation rates matters, which also indicates the importance of multiscale information fusion in HPE.
Groups  

IvA5 Compared with Other Methods
In this section, we experimentally prove that ADC is more suitable to human pose estimation than deformable convolution (DC) [7] and scaleadaptive convolution (SAC) [28]. Comparative experiments are performed on SimpleBaselineRes50. As shown in Table III, although DC can bring an improvement on the baseline, its performance is inferior to that of ADC. As we have discussed in Sec IIIC, the unconstrained and independent offsets of DC may cause the input and output feature maps to lose their spatial correspondence, which potentially hurt the accuracy of location. SAC can alleviate the size inconsistent along spatial dimensions, but involves little multiscale fusion along the channel dimension, which is more important in HPE. Consequently, the improvement of SAC is lower than both DC and ADC.
IvB Experiments for Semantic Segmentation
Similar to human pose estimation, semantic segmentation also requires rich multiscale information to make a balance between local and semantic features. Thus the proposed ADC should also benefit the performance of semantic segmentation models. In this section, we plug ADC into different models to demonstrate its benefits on semantic segmentation.
We use CityScapes
[6] as our training (2975 images) and validation (500 images) datasets. We use FCN [21], PSANet [30], DeepLabV3 [3] and DeepLabV3+ [4] as our baseline models. The input sizes are set as . All models are trained for iterations. More details can be referred to the Github repository mmsegmention^{†}^{†}†https://github.com/openmmlab/mmsegmentation.git. As shown in Table IV, ADC can bring consistent improvements to different Semantic Scene Parsing models. For FCN, ADC even improves the mIOU by .V Conclusion
In this paper, we mainly focus on multiscale fusion methods in human pose estimation. Existing HPE methods usually fuse feature maps of different spatial sizes to exploit multiscale information. However, the location information is irreversibly destroyed during the downscaling, and thus the upscaled feature maps may be not well spatiallyaligned. This nonalignment potentially hurts the accuracy of keypoint location. Besides, scales of these feature maps are fixed and inflexible, which may restrict its generalization over different human sizes. In this paper, we propose an adaptive dilated convolution (ADC), which exploits multiscale information by fusing channels with different dilation rates. In this way, each channel in ADC can represent features at a scale, and thus ADC can exploit richer multiscale information from features of the same spatial sizes. More importantly, the dilation rates for different channels in ADC are adaptively generated, which enables ADC to adjust the scales according to the sizes of test persons. As a result, ADC can help HPE fuse better aligned and more generalized multiscale features. Extensive experiments on both human pose estimation and semantic segmentation prove that ADC can bring consistent improvements to these methods.
Acknowledgements
This work is jointly supported by National Key Research and Development Program of China (2016YFB1001000), Key Research Program of Frontier Sciences, CAS (ZDBSLYJSC032), Shandong Provincial Key Research and Development Program (2019JZZY010119), and CASAIR.
References
 [1] (2020) Learning delicate local representations for multiperson pose estimation. In ECCV, pp. 455–472. Cited by: §I, §IVA.
 [2] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §IIB.
 [3] (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §I, §IIB, §IVB, TABLE IV.
 [4] (2018) Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §IIB, §IVB, TABLE IV.
 [5] (2018) Cascaded pyramid network for multiperson pose estimation. In CVPR, pp. 7103–7112. Cited by: §I.

[6]
(2016)
The cityscapes dataset for semantic urban scene understanding
. In CVPR, pp. 3213–3223. Cited by: §IVB.  [7] (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §IIIB, §IIIC, §IVA5, TABLE III.

[8]
(2016)
A guide to convolution arithmetic for deep learning
. arXiv preprint arXiv:1603.07285. Cited by: §I.  [9] (2019) Nasfpn: learning scalable feature pyramid architecture for object detection. In CVPR, pp. 7036–7045. Cited by: §IIA.
 [10] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §IIID.
 [11] (2014) Adam: a method for stochastic optimization. corr abs/1412.6980 (2014). Cited by: §IVA.
 [12] (2018) Deep feature pyramid reconfiguration for object detection. In ECCV, pp. 169–185. Cited by: §IIA.
 [13] (2019) Rethinking on multistage networks for human pose estimation. arXiv preprint arXiv:1901.00148. Cited by: §I, §I, §IIA.
 [14] (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §IIA.
 [15] (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §IVA.
 [16] (2018) Path aggregation network for instance segmentation. In CVPR, pp. 8759–8768. Cited by: §IIA.
 [17] (2020) Efficient human pose estimation by learning deeply aggregated representations. arXiv preprint arXiv:2012.07033. Cited by: §I, §I.
 [18] (2020) Rethinking the heatmap regression for bottomup human pose estimation. arXiv preprint arXiv:2012.15175. Cited by: §I.
 [19] (2016) Stacked hourglass networks for human pose estimation. In ECCV, pp. 483–499. Cited by: §I.
 [20] (201710) Benchmarking and error diagnosis in multiinstance pose estimation. In ICCV, Cited by: §IVA2.
 [21] (2017) Fully convolutional networks for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (4), pp. 640–651. Cited by: §IVB, TABLE IV.
 [22] (2019) Multiperson pose estimation with enhanced channelwise and spatial information. In CVPR, pp. 5674–5682. Cited by: §I.
 [23] (2019) Deep highresolution representation learning for human pose estimation. In CVPR, pp. 5693–5703. Cited by: §I, §I, §IIID, TABLE I, §IVA1, §IVA, §IVA.
 [24] (2018) Mscoco keypoints challenge 2018. In ECCVW, Vol. 5. Cited by: §IVA.
 [25] (2016) Convolutional pose machines. In CVPR, pp. 4724–4732. Cited by: §I.
 [26] (2018) Simple baselines for human pose estimation and tracking. In ECCV, pp. 466–481. Cited by: §IIID, TABLE I, §IVA1, §IVA.
 [27] (2015) Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §I, §IIB, §IIIC.
 [28] (2017) Scaleadaptive convolutions for scene parsing. In ICCV, pp. 2031–2039. Cited by: §IIIC, §IVA5, TABLE III.
 [29] (2017) Scaleadaptive convolutions for scene parsing. In ICCV, pp. 2031–2039. Cited by: §IIIB.
 [30] (2018) Psanet: pointwise spatial attention network for scene parsing. In ECCV, pp. 267–283. Cited by: §IVB, TABLE IV.