Human pose estimation aims to locate the person parts, such as keypoints on the arms, legs and face. It is a fundamental yet challenging task in computer vision, which plays an important role in many high-level vision tasks like activity understanding [chunyu2013approach] and human re-identification [zheng2017pose].
With the development of Convolutional Neural Network[tompson2014joint, toshev2014deeppose, wei2016convolutional, tompson2015efficient, newell2016stacked], great progress has been achieved in human pose estimation. For example, in [newell2016stacked], Newell first proposes the hourglass network to predict the human keypoints. It shows that the repeated bottom-up, top-down processing and the intermediate supervision are critical to improving the estimation performance. Yang et al.[yang2017learning] design a Pyramid Residual Module to explicitly learn convolutional filters for building feature pyramids and enhance the robustness of keypoint estimation against scale variations of visual patterns. Although the stacked hourglass network [newell2016stacked] and its variants [ke2018multi, yang2017learning, chu2017multi, zhang2019human] have achieved significant performance, it is still an open problem to achieve accurate localizing results due to the occluded keypoints, overlapped limbs, and abnormal poses.
To locate keypoints accurately, a model has to take the high spatial resolution information and multi-level information (e.g., multi-stage information and multi-scale information) into account. For example, the right ankle and right knee of the woman in Figure 1(a) are invisible keypoints, high-level feature maps with large receptive field are needed to infer such kind of keypoints. However, the high spatial resolution information can provide detailed features, which is useful for refining the positions of the other visible joints. Another example, the human body in Figure 1(b) is blurred with abnormal pose, multi-level features should be extracted and fused effectively and sufficiently for the blurred and abnormal pose estimation. According to the above analysis, a Dilated Hourglass Module is proposed to preserve high spatial resolution information along with large respective field. Moreover, a Selective Information Module is designed to fuse the features of different levels under a sufficient consideration of spatial content-aware mechanism. Based on the two prominent components, we propose an efficient Spatial Preserve and Content-aware Network (SPCNet) for human pose estimation, as shown in Figure 2.
For preserving high spatial resolution information along with large respective field, we exploit the dilated convolution operation in the hourglass network. Classic hourglass network adopts large downsampling factor to obtain large receptive field. It is good for estimating the occluded and invisible keypoints, but compromises the location ability. Different from the hourglass network, we substitute the hourglass module with our proposed Dilated Hourglass Module (DHM) to obtain large receptive field and avoid the reduction of spatial dimension caused by the subsampling operation. As shown in Figure 2(a), the spatial size is fixed after only two subsampling operations. Then the dilated bottlenecks are applied to keep the spatial resolution, which can also efficiently capture semantic information and maintain detailed features.
Nowadays, there is an increasing interest in designing networks with attention mechanism. Recent works focus on the spatial attention, channel attention, and non-local attention in the single-level information, and prove the effectiveness of the attention module. However, little attention has been paid to adaptively fuse multi-level information under visual attention mechanism. In our paper, a Selective Information Module (SIM) is proposed to effectively fuse the different levels of features under the attention mechanism. As described in Figure 2(b), we adaptively assemble the multi-level features via a pixel-wise weighted summation in spatial dimension, where the pixel-wise weights are produced by attention-based method under a sufficient consideration of spatial content-aware mechanism. At different spatial position, multi-level features are fused in different proportions according to the diversity of local region content.
In this paper, we first gather up the multi-stage information and multi-scale information from the decoder layers of each Dilated Hourglass Module to compose four high-dimensional feature maps of different levels. Second, we use the proposed Selective Information Module to effectively fuse the four different levels information for predicting the human body keypoints. Then, we evaluate our proposed method on MPII Human Pose dataset [andriluka20142d] , LSP dataset [johnson2010clustered] and FLIC dataset[sapp2013modec], ablation studies demonstrate the effectiveness of the Dilated Hourglass Module and Selective Information Module. In particular, our method exceeds prior methods and achieves the state-of-the-art performance.
In summary, there are three contributions in our paper:
We explore a novel Dilated Hourglass Module which employs the dilated bottlenecks to preserve high spatial resolution and obtain large receptive field.
We propose an effective feature fusion method called Selective Information Module, which is able to adaptively assemble the multi-level spatial information.
Our proposed network outperforms the state-of-the-art methods on MPII Human Pose dataset, LSP dataset and FLIC dataset.
The rest of this paper is organized as follows. In Section 2, we review some articles related to our work. In Section 3, we present the main idea of our SPCNet work. Then ablation studies are performed to measure the effects of different parts of our system, and the experimental results are reported in Section 4, followed by a conclusion in Section 5.
2 Related Work
In this section, we review three parts related to our method: human pose estimation, dilated convolution and attention mechanism.
Human pose estimation. There are many application scenarios for human pose estimation, such as activity understanding [chunyu2013approach], human re-identification [zheng2017pose]. Pictorial structures [andriluka2009pictorial] or graphical models [chen2014articulated] as representative of the traditional methods are used to deal with pose estimation problems. However, these methods predict positions of keypoints rely on hand-generated features, which are susceptible to difficult issues such as occlusion. Recently, deep convolutional networks surpass traditional methods and achieve the state-of-the-art results in pose estimation. DeepPose [toshev2014deeppose] uses deep learning method for pose estimation which directly regresses the keypoints’ coordinates by multi-stage refinement for the first time. Methods of [newell2016stacked, wei2016convolutional, chen2018cascaded, Cao2017Realtime, Sun2019Deep] use fully convolutional neural network to regress the Gaussian heatmap and infer the human keypoint coordinates by using the Gaussian peaks. These methods can produce high quality representation. In [newell2016stacked], Newell first proposes the hourglass network to predict the human keypoints. It shows that the repeated bottom-up, top-down processing and the intermediate supervision are critical to improving the estimation performance. Pyranet[yang2017learning] is a variant of stacked hourglass network that designs the pyramid residual module to enhance the invariance in scales of deep convolutional neural network.
Dilated convolution. Recently, lots of approaches with dilated convolution have achieved high performance on different benchmarks of semantic segmentation [chen2017rethinking] and object detection [Li2019Scale]. DeepLab [chen2017rethinking] designs atrous spatial pyramid pooling (ASPP) that applies dilated convolution with various dilation rates on multiple parallel branches to capture detailed information and context information(multi-level informations). One key advantage is that it can effectively enlarge receptive field size to incorporate context without introducing extra parameters or computation cost. In parallel, large receptive field is also needed for the hard keypoints’ prediction. Motivated by this, we propose a novel bottom-up and top-down hourglass module decorated with the dilated convolution. In this way, we can obtain large receptive field and maintain high spatial resolution simultaneously, both of which are critical to the human pose estimation task.
Attention mechanism. Visual attention has achieved great success in various tasks, such as image classification, human pose estimation and image segmentation. SENet[hu2018squeeze] proposes a ”Squeeze-and-Excitation” block to recalibrate channel-wise feature by using channel attention operation. In [su2019multi], Su et al. design a Spatial Channel-wise Attention Residual Bottleneck to enhance the feature responses both in the spatial and channel-wise context. The above methods only study the spatial attention and channel attention concentrating on the single-level information. However, little attention has been paid to adaptively fuse multi-level information under visual attention mechanism. Our work is inspired by the spatial-attention approaches and we adaptively assemble the multi-level spatial features by fusing the information of different levels via a pixel-wise weighted summation, where the weighted parameters are learned through the spatial attention mechanism.
In this section, we propose a novel Spatial Preserve and Content-aware Network (SPCNet) to preserve spatial resolution information and select relatively important features from different levels according to the local part content under the spatial attention mechanism. An overview of the proposed SPCNet is illustrated in Figure 2. First, we briefly review the stacked hourglass network [newell2016stacked]. Then, we introduce the structure of Dilated Hourglass Module (DHM) and Selective Information Module (SIM) in detail. Finally, the complete network architecture of SPCNet is presented as well as the training and inference processing details.
3.1 Revisiting Stacked Hourglass Network
Stacked hourglass network [newell2016stacked] is a classic approach for locating body keypoints from RGB images. The hourglass unit performs bottom-up process by subsampling the feature maps, and then conducts symmetric top-down process by upsampling the feature maps with the combination of higher resolution features from bottom layers to generate the high resolution heatmaps. Then, the hourglass units are stacked to build the stacked hourglass network. Each hourglass unit is supervised with the ground-truth heatmap.
3.2 Dilated Hourglass Module
Motivation In the task of single-person pose estimation, most of the modern methods tackle it as a dense regression issue. Large downsampling factor in encoding process brings large effective receptive field, which is beneficial to the inference of occluded, twisted and overlapped limbs. However, it will reduce the spatial resolution which is critically important for refining the location of human body keypoints. Therefore, the trade-off between the global feature and the detail feature should be taken into account. Motivated by this, we propose a novel Dilated Hourglass Module not only to maintain the high resolution spatial structure on human body but also obtain large receptive field, which is shown in Figure 2(a).
In the bottom-up way of the original hourglass unit, it conducts 4 downsampling operations to reduce the spatial dimension from to pixels for obtaining large receptive field. The lowest resolution feature map contains high-level semantic information which is critical to the hard keypoints’ location, while it compromises the spatial resolution information which can provide detailed information for refining the positions of keypoints. Aiming at leveraging both the abundant context information and high spatial resolution information, we proposed a Dilated Hourglass Module (DHM) in our paper.
On one hand, the DHM uses only two downsampling operations to maintain the high spatial dimension which is 1/4 of the input resolution and fix the spatial resolution as 4x downsampling after two downsampling operations to mitigate the aforementioned issue. On the other hand, in order to get large receptive field, we introduce convolutional layer with dilation rate to replace the conventional convolutional layer of original residual block. The dilated convolution process is formulated as follows:
where is the value of output feature map at spatial position, is the input feature map, represents the position index of the dilated convolutional kernel , refers to the kernel size and corresponds to the dilation rate. For instance, the kernel size of convolution filter with dilation rate R could be considered as . In our paper, there are three kinds of residual bottlenecks in our proposed DHM: dilated residual blocks (Bottleneck_a and Bottleneck_b), conventional residual block (Bottleneck) as shown in Figure 3. R is experimentally set to 2 for Bottleneck_a and Bottleneck_b.
3.3 Selective Information Module
Motivation High resolution feature maps consist of more details which can precisely localize the keypoints, but may fail to recognize the twisted or overlapped keypoints due to small receptive field. Low resolution representations contain more semantic context which can handle the difficult scenarios (e.g., invisible keypoints, twisted pose). How to efficiently aggregate the multi-level features is still a challenging issue for the pose estimation. In general, element-wise summation and channel-wise concatenation are most commonly used methods to fuse multi-level features. Element-wise summation equally aggregate the features from multiple levels, which is not a learnable process. Despite the channel-wise concatenation followed by convolution can be considered as a learnable process, while it applies the same convolutional kernel across the whole feature map regardless of the content of local region. However, the importance of each level should be treated different in intuition. We consider that different local regions require multi-level features in different proportions according to the diversity of local region content information.
Against the aforementioned problem, we design Selective Information Module (SIM) to adaptively assemble the spatial features from multiple levels, as illustrated in Figure 4 . There are two main steps in our proposed SIM: Information Collection and Information Distribution. With the Information Collection, we can get multi-level feature information from the stacked DHMs. Through the Information Distribution, the multi-level feature information can be adaptively fused. In detail, our proposed module fuses the multi-level features via a weighted summation at each pixel position, where the weights are generated by trainable process. We experimentally verified our proposed module effectively fuse the spatial detail information and context information.
Information Collection: We first extract the multi-level features(i.e., 16, 16, 32, 64 pixels) from each deconv layer of the Dilated Hourglass Module (DHM). Second, features with equivalent scales in eight stacked DHMs are concatenated to obtain four high-dimensional feature maps. Then, we reduce the channels of the four high-dimensional feature maps by
convolutional layers which reduces the number of channels from 2048 to 256 followed by the batch normalization and ReLU in sequence. Next, we upsample multi-level features(i.e., 16, 16, 32, 64 pixels) torespectively and denote the features with multiple receptive field as (with the same resolution of and the same channel of 256 ).
Information Distribution: We first simply fuse via element-wise combination which is considered as information aggregation from features of multiple levels. Then we employ convolutional filters to conduct the channel squeeze and compress the channels to 4. Here we consider the squeezed feature , where corresponding to the feature map of single channel. Index represents the pixel position. After that, a softmax operation is conducted across channels to rescale activations and then rescaled activations are sliced along the channel dimension to generate pixel-wise adaptive weights corresponding to the multi-level features(i.e., ). The process above are formulated as follows:
in which, represents relative importance of multi-level features at pixel position . are the learned weights for respectively. The addition of along the channel dimension is normalized to 1 via softmax operation for each spatial position . Finally the assemble for features with multiple receptive field is defined as:
where F is the final fused feature for predicting the human keypoints, means the pixel-wise multiplication.
3.4 Network Architecture, Training and Inference
Network Architecture. With the Dilated Hourglass Module and Selective Information Module, we propose a novel Spatial Preserve and Content-aware Network (SPCNet) for human pose estimation as illustrated in Figure 2. First, we propose a Dilated Hourglass Module to preserve the spatial resolution along with large receptive field. Similar to Hourglass Network, we stack the DHMs to get the multi-stage and multi-scale information. Then, a Selective Information Module is designed to select relatively important features from different levels under a sufficient consideration of spatial content-aware mechanism. Finally, we use the selected information by the SIM to predict the human body keypoints. In our paper, we find that the two proposed modules are complementary to each other for higher performance on keypoint localization.
Network Training. We use score maps to represent the ground-truth heatmaps of human body keypoints. Denote the ground-truth positions by , is the number of human body keypoints. is the coordinates of the
th keypoint. In this paper, we use a Gaussian distribution with mean
and varianceto represent the ground-truth heatmap as follows:
A squared error loss function is applied to minimize the loss between the predicted score maps(each Dilated Hourglass Module and the Selective Information Module) and the ground-truth heatmaps:
is the number of samples. In our paper, there are 8 auxiliary losses from the stacked Dilated Hourglass Module and 1 supervision loss for the Selective Information Module.
Network Inference. During inference, we obtain the predicted body joint locations from the predicted score maps generated from the Selective Information Module by taking the locations with the maximum score as follows:
4 Experiments and Analysis
In this section, we first briefly introduce the datasets, evaluation metrics and implementation details in4.1. Next, we conduct comprehensive ablation study to reveal the effectiveness of our proposed modules in 4.2. Finally, we compare our results with the prior state-of-the-art results on MPII Human Pose dataset [andriluka20142d], LSP dataset [johnson2010clustered] and FLIC dataset[sapp2013modec].
4.1 Experimental Setup
Datasets and Evaluation Metrics. We evaluate the performance of our network on three benchmark datasets mentioned above. The MPII Human Pose dataset is composed of around 25K images containing over 40K samples with annotated body joints. 28K samples are used for training, and the remaining 12K are used for testing. The LSP dataset and its extended training dataset includes 12K sports images with annotations (11K images are used for training and 1K images are used for testing). The FLIC dataset consists of 5003 samples(3987 for training, 1016 for testing). The evaluation is conducted using Percentage of Correct Keypoints (PCK) [andriluka20142d] metric which shows the percentage of detections that fall within a normalized distance of the ground truth. For the MPII Human Pose evaluation, we use the modified PCK measure that uses a fraction of head size as the normalized factor(denoted as PCKh[andriluka20142d]). For the LSP and FLIC evaluation, we use PCK as previous researches [yang2017learning, tang2018deeply].
During training, we use random rotation, random flip, and random scaling. The rotation range is (-60, 60), the flip probability is 0.5 and the scale range is (0.75, 1.25). Each input image is cropped around the target person according to the annotated body center and scale, and then resized topixels.
We train our proposed network using RMSProp[tieleman2012lecture]
optimizer with a mini-batch size of 48 (12 per GPU) for 170 epochs on a workstation with four 12GB NVIDIA TITAN XP GPUs. The initial learning rate of 1e-3 and is dropped by the factor of 10 at the 120th and the 150th epoch. All codes are implemented with Pytorch. A Mean Squared Error (MSE) loss is applied to compute the loss between the predicted heatmap and the ground-truth heatmap. Testing results are produced from six-scale image pyramids with flipping.
4.2 Ablative Analysis
In this subsection, we conduct ablation experiments on the validation set of the MPII Human Pose to explore the effectiveness of the proposed Dilated Hourglass Module (DHM) and Selective Information Module (SIM). We define the 8-stack hourglass network as our baseline network. Based on the hourglass network, we first explore each proposed component and then conduct comprehensive analysis for the impact of each module (i.e. DHM and SIM) for the whole network by comparing the PCKh score.
Effect of Dilated Hourglass Module. In this experiment, we investigate the effect of our proposed Dilated Hourglass Module and the influence of dilation rate R. The original hourglass module is replaced by the DHM with various dilation rate to conduct a series of experiments. As shown in Table 1, we achieve the best performance when R is set to 2. Compared with the baseline network, The PCKh score is improved from 88.9% to 89.7% by using Dilated Hourglass Module with dilation rate 2, which is an obvious improvement. It experimentally confirmed that using the dilated convolution can preserve more spatial information, which is benefit for refining the positions of joints.
Effect of Selective Information Module. There are two main steps in Selective Information Module(SIM): Information Collection(IC) and Information Distribution(ID). To explore the effectiveness of Selective Information Module, we first get multi-level features(i.e.,) from the stacked DHMs through the Information Collection process, as described in Section 3.3, and then conduct a series of experiments among three feature fusion methods: element-wise summation, channel-wise concatenation and Information Distribution process in our proposed SIM. In detail, for element-wise summation, we sum the multi-level features directly; for channel-wise concatenation, we first concatenate the along the channel dimension and then add an extra convolutional layer after concatenate operation to generate fusion feature by compressing the channel to 256. As shown in Table 2, we observe that Information Collection process combined with any of the three feature fusion methods can improve the performance of baseline network. Furthermore, the Information Distribution process achieves the best PCKh score among these three feature fusion methods on MPII validation set. We obtain 0.5% and 0.3% improvement by replacing the element-wise summation and channel-wise concatenation with the Information Distribution process respectively and achieve 0.7% improvement by adding the SIM to the original hourglass network, which demonstrate the superior performance of our Selective Information Module over the other feature fusion methods.
|Newell et al. [newell2016stacked]||96.7||95.8||89.9||84.9||88.8||85.0||80.6||88.9|
|single-scale with horizon flip|
|Newell et al. [newell2016stacked]||96.8||96.0||90.6||85.9||89.8||86.1||81.1||89.5|
|Yang et al. [yang2017learning]||96.8||96.0||90.4||86.0||89.5||85.2||82.3||89.6|
|Tang et al. [tang2018deeply]||95.6||95.9||90.7||86.5||89.9||86.6||82.5||89.8|
|multi-scale with horizon flip|
|Newell et al. [newell2016stacked]||97.1||96.1||90.8||86.2||89.9||85.9||83.5||90.0|
|Yang et al. [yang2017learning]||97.4||96.2||91.1||86.9||90.1||86.0||83.9||90.3|
|Tang et al. [tang2018deeply]||97.4||96.2||91.0||86.9||90.6||86.8||84.5||90.5|
Comprehensive Analysis. In this experiment, we explore the contribution of each module to the whole network. Besides separately adding each proposed module to the baseline, we further employ the two proposed modules to the baseline simultaneously. The results are reported in Table3. Compared with the 88.9% PCKh score of the baseline hourglass network, we achieve 0.8% improvement with only the Dilated Hourglass Module used and 0.7% improvement with only the Selective Information Module used. Finally our method achieves 90.0% PCKh score with the two proposed modules applied simultaneously, which is 1.1% improvement compared to the baseline hourglass. Validation PCKh curves across different architectures above and the validation PCKh curves of different keypoints at different threshold for SPCNet are plotted in Figure 7(a) and (b) respectively.
|Lifshitz et al.[lifshitz2016human]||97.8||93.3||85.7||80.4||85.3||76.6||70.2||85.0|
|Rafi et al. [rafi2016efficient]||97.2||93.9||86.4||81.3||86.8||80.6||73.4||86.3|
|Insafutdinov et. al.[insafutdinov2016deepercut]||96.8||95.2||89.3||84.4||88.4||83.4||78.0||88.5|
|Wei et al.[wei2016convolutional]||97.8||95.0||88.7||84.0||88.4||82.8||79.4||88.5|
|Newell et al.[newell2016stacked]||98.2||96.3||91.2||87.1||90.1||87.4||83.6||90.9|
|Sun et al.[sun2017human]||98.1||96.2||91.2||87.2||89.8||87.4||84.1||91.0|
|Chu et al.[chu2017multi]||98.5||96.3||91.9||88.1||90.6||88.0||85.0||91.5|
|Chen et al.[chen2017rethinking]||98.1||96.5||92.5||88.5||90.2||89.6||86.0||91.9|
|Yang et al.[yang2017learning]||98.5||96.7||92.5||88.7||91.1||88.6||86.0||92.0|
|Ke et al.[ke2018multi]||98.5||96.8||92.7||88.4||90.6||89.3||86.3||92.1|
|Tang et al.[tang2018deeply]||98.4||96.9||92.6||88.7||91.8||89.4||86.2||92.3|
|Zhang et al.[zhang2019human]||98.6||97.0||92.8||88.8||91.7||89.8||86.6||92.5|
Following prior methods[chu2017multi, yang2017learning], we further conduct the horizon flip and six-scale image pyramids on the validation set of the MPII Human Pose. The results are reported in Table 4. With the multi-scale image pyramids with horizon flipping applied to the prediction, we achieve 91.1% PCkh score, which is 0.3% improvement compared to the HRNet [Sun2019Deep].
4.3 Comparison with the state-of-the-art Methods
To evaluate the performance of our method, we compare our network with the prior state-of-the-art methods on three datasets: MPII Human Pose test dataset, LSP test dataset and FLIC test dataset. Moreover, we give some qualitative results generated by baseline network and our proposed network.
MPII Human Pose dataset. We report the PCKh scores of our approach and the previous state-of-the-art methods at the threshold of 0.5 in Table 5. Compared with the hourglass network, our approach improves the performance of total PCKh score from 90.9% to 92.6%. Specifically, as shown in Table 5, our method surpasses the HRNet[Sun2019Deep] across all keypoints except for the wrist. The final results demonstrate the superior performance of our proposed model over the prior state-of-the-art methods in terms of PCKh score.
LSP dataset. Table 6 summarizes the PCK scores at the threshold of 0.2 on LSP dataset. We follow the previous methods[yang2017learning, chu2017multi] to train our network by adding training set of MPII Human Pose to the LSP and its extend training set. Our method achieves the highest total score 96.4% and exceeds the previous state-of-the-art results across all keypoints on the LSP test set. We observe that the proposed network improves the PCK scores with a large margin by 4.4% and 2.7% on the wrist and elbow compared with the closest competitor, and obtains 1.3% improvement in average.
|Lifshitz et al.[lifshitz2016human]||96.8||89.0||82.7||79.1||90.9||86.0||82.5||86.7|
|Pishchulin et al.[pishchulin2016deepcut]||97.0||91.0||83.8||78.1||91.0||86.7||82.0||87.1|
Insafutdinov et al.[insafutdinov2016deepercut]
|Wei et al.[wei2016convolutional]||97.8||92.5||87.0||83.9||91.5||90.8||89.9||90.5|
|Sun et al.[sun2017human]||97.9||93.6||89.0||85.8||92.9||91.2||90.5||91.6|
|Chu et al.[chu2017multi]||98.1||93.7||89.3||86.9||93.4||94.0||92.5||92.6|
|Yang et al.[yang2017learning]||98.3||94.5||92.2||88.9||94.4||95.0||93.7||93.9|
|Zhang et al.[zhang2019human]||98.4||94.8||92.0||89.4||94.4||94.8||93.8||94.0|
|Tang et al.[tang2018deeply]||98.3||95.9||93.5||90.7||95.0||96.6||95.7||95.1|
FLIC dataset. Table 7 shows the PCK@0.2 scores on FLIC dataset. Our proposed method achieves the 99.3% and 98.2% PCK@0.2 scores for the elbow and wrist, which are 0.3% and 1.2% improvement compared with the hourglass network respectively.
|Wei et al.[wei2016convolutional]||97.8||95.0||96.4|
|Newell et al.[newell2016stacked]||99.0||97.0||98.0|
|Ke et al.[ke2018multi]||99.2||97.3||98.3|
Qualitative results. We compare the baseline model (8-stack hourglass network) with our proposed model by visualizing the estimated heatmaps and skeletons on the test set of MPII Human Pose, as demonstrated in Figure 5. We observe that our method outperforms the baseline model in the challenging cases, such as occluded keypoints, invisible keypoints, crowded background and abnormal body posture. The 1st row shows a women with a twisted pose. The 2nd row presents person with overlapped limbs. Then the 3rd row exhibits a person whose limbs are indistinct with surroundings, and the 4th row shows a person with abnormal pose. The baseline model produces the failure predictions as shown in Figure 5(c) while our proposed model makes the refined predictions as shown in Figure 5(e) when facing these kinds of complex scenarios. It is noteworthy that our proposed model generated higher response in the prediction heatmaps with the blurred regions, as illustrated in Figure 5(d) compared to the Figure 5(b). In addition, examples of estimated pose on MPII test dataset and LSP test dataset are illustrated in Figure 6.
Compared with the baseline hourglass, our method can effectively improve the pose estimation performance of difficult scenarios (e.g., occlusion, twisted and overlapped human body, abnormal pose, and so on). By using the Dilated Hourglass Module, we can preserve the spatial resolution information along with large receptive field. With the Selection Information Module, the multi-stage and multi-scale information can be adaptively selected and enhanced for the final human keypoints localization. Leveraging the two proposed modules, we can get high spatial resolution and different receptive field information, which are critical to the human pose estimation task.
In this paper, we propose to incorporate a Dilated Hourglass Module and a Selective Information Module into an end-to-end architecture for human pose estimation. By stacking the Dilated Hourglass Module, we can preserve spatial resolution information along with large receptive field. Meanwhile, a Selective Information Module is designed to select relatively important features from different levels under a sufficient consideration of spatial content-aware mechanism. The effectiveness of the Dilated Hourglass Module and the Selective Information Module are evaluated on validation set of MPII Human Pose. We experimentally observe that the proposed network can alleviate the difficulties brought by occlusions, overlapped bodies and abnormal poses. Overall, our approach achieves state-of-the-art results on MPII Human Pose dataset, LSP dataset and FLIC dataset.