In this paper, we propose a novel method called Residual Steps Network (RSN). RSN aggregates features with the same spatialsize (Intra-level features) efficiently to obtain delicate local representations, which retain rich low-level spatial information and result in pre-cise keypoint localization. In addition, we propose an efficient attention mechanism - Pose Refine Machine (PRM) to further refine the keypointlocations. Our approach won the 1st place of COCO Keypoint Challenge 2019 and achieves state-of-the-art results on both COCO and MPII benchmarks, without using extra training data and pretrained model. Our single model achieves 78.6 on COCO test-dev, 93.0 on MPII test dataset. Ensembled models achieve 79.2 on COCO test-dev, 77.1 on COCO test-challenge dataset. The source code is publicly available for further research at https://github.com/caiyuanhao1998/RSNREAD FULL TEXT VIEW PDF
We propose a method for multi-person detection and 2-D pose estimation t...
The topic of multi-person pose estimation has been largely improved rece...
Human keypoint detection from a single image is very challenging due to
Both appearance cue and constraint cue are important in human pose
In this paper, we propose a novel multi-stage network architecture with ...
The recently proposed EMDE (Efficient Manifold Density Estimator) model
Recently, multi-resolution networks (such as Hourglass, CPN, HRNet, etc....
The goal of multi-person pose estimation is to locate keypoints of all persons in a single image. It is a fundamental task for human motion recognition, kinematics analysis, human-computer interaction, animation etc. For years, human pose estimation was based on handcraft features. Recently, It has made great progress with the development of deep convolution neural network.
The task of human pose estimation concerns both keypoint localization and classification. Spatial information benefits the localization task, while semantic information is good for the classification task. To extract these two kinds of information, current methods mainly focus on aggregating inter-level features. For instance, HRNet  maintains spatial information in high-resolution sub-network and gradually adds semantic information to it from low-resolution sub-networks. In this way, features of different levels are fully aggregated. In CPN , features of four different spatial levels are extracted by the backbone, and they are combined by a head network. Although these methods are different in the ways of feature fusion, the features to be aggregated are always from different levels. On the contrast, the feature fusion within the same level stays less explored in the task of human pose estimation.
The comparison of intra-level feature fusion (level 1) and inter-level feature fusion is illustrated in Figure 1. The feature maps are continuously downsampled to 1/4, 1/8, 1/16, 1/32 size of input image in Figure 1(a). We define consecutive feature maps with the same spatial size as one level. As Figure 1(c) depicts, there is a big gap between the receptive fields of features from different levels, which are indicated by light blue bounding boxes. As a result, representations learned by inter-level feature fusion are relatively coarse, which impede the localization of human pose from precise. As Figure 1(b) shows, the gap between the receptive fields of intra-level features which are indicated by red bounding boxes is relatively small. As shown in Figure 1(d), fusing intra-level features can extract much more delicate local representations retaining more precise spatial information, which is critical to keypoint localization.
To learn better local representations, we propose a novel network architecture - Residual Steps Network (RSN). The Residual Steps Block (RSB) of RSN fuses features inside each level using dense element-wise sum operations, which is shown in Figure 2(c). The inner structure of RSB is deeply connected and motivated by DenseNet , which has a good performance for human pose estimation owing to retaining rich low-level features by deep connections. However, deep connections bring about explosion of the network capacity as it goes deeper. Thus, DenseNet performs poor when the network becomes big. RSN is motivated by DenseNet but is quite different in that RSN uses element-wise sum rather than concatenation to circumvent network capacity explosion. RSN is modestly less dense connected in the block than DenseNet, which further promotes the efficiency. Additionally, to further refine the keypoint locations, we propose an attention module - Pose Refine Machine (PRM) to rebalance the output features of the network. The architecture of PRM is illustrated in Figure 3 and analyzed in Section 3.3. To better illustrate the advantages of our approach, we analyze the differences between RSN and current methods in Section 2.2.
In conclusion, our contributions can be summarized as three points:
1. We propose a novel network - RSN, which aims to learn delicate local representations by efficient intra-level feature fusion.
2. We propose an attention mechanism - PRM, which goes futher to refine the aggregated features in both channel and spatial wise.
3. Our approach outperforms the state-of-the-art methods on both COCO and MPII datasets without using extra training data and pretrained model.
Current methods of human pose estimation fall into two categories: top-down methods [23, 13, 10, 7, 27, 19, 26, 25, 22, 3, 33] and bottom-up methods [2, 16, 32, 20]. Top-down methods first detect the positions of all persons, then estimate the pose of each person. Bottom-up methods first detect all the human keypoints in an image and then assemble these points into groups to form different individuals. Since this paper mainly concentrates on feature fusion strategies, we discuss these methods in terms of feature fusion.
Recently, many methods [19, 33, 3, 35, 27] of human pose estimation use inter-level feature fusion to extract more spatial information and semantic information. Newell et al.  propose a U-shape convolutional neural network (CNN) named Hourglass. In a single stage of hourglass, high-level features are added to low-level features after upsampling. Later works such as Yang et al.  show great performance of using inter-level feature fusion. Chen et al.  combines inter-level features using a refine block. Sun et al.  set up four parallel sub-networks. The features of these four sub-networks aggregate with each other through high-to-low or low-to-high way.
Though many methods have validates the effectiveness of inter-level feature fusion, intra-level feature fusion is rarely explored in human pose estimation. However, it has extensive applications in other tasks such as semantic segmentation and image classification [28, 9, 12, 5, 38, 34]. In a block of Inception , features pass through several convolutional layers with different kernels separately and then added up. DenseNet  fuses intra-level features using continuous concatenating operations. This implementation retains low features to improve the performance. However, when the network goes deeper, the capacity increases sharply and much redundant information appears in the network, resulting in poor efficiency. Different from DenseNet, RSN uses element-wise sum rather than concatenation to circumvent network capacity explosion. In addition, RSN is modestly less densely connected in the constituent unit, which further promotes the efficiency.
Res2Net  and OSNet  focus on multi-scale representations. Both of them lack dense connections between adjacent branches. The dense connections contribute sufficient gradients and make low-level features better supervised. Therefore, lack of dense connections between adjacent branches results in less precise spatial information, which is essential to keypoint localization. In conclusion, both Res2Net and OSNet are inferior to RSN in the task of human pose estimation. In Section 4.1.2, we validate the efficiency of DenseNet, Res2Net and RSN.
is almost used in all areas of computer vision. Current methods of attention mechanism mainly fall into two categories: channel attention[31, 11, 26, 36] and spatial attention [26, 37, 31, 15, 8]. Woo et al. 
propose a channel attention module with global average pooling and max pooling. Kligvasser et al.
propose a spatial activation function with depth-wise separable convolution. Other works such as Hu et al. show the advantages of using attention mechanism.
The overall pipeline of our method is illustrated in Figure 2. The multi-stage network architecture is cascaded by several single-stage modules - Residual Steps Network (RSN), shown in Figure 2(a). As Figure 2(b) shows, RSN differs from ResNet in the architecture of constituent unit. RSN consists of Residual Steps Blocks (RSBs) while ResNet is comprised of ”bottleneck” blocks. Figure 2(c) illustrates the structure of RSB. A Pose Refine Machine (PRM) is used in the last stage and it is analyzed in Section 3.3.
Residual Steps Network is designed for learning delicate local representations by repeatedly enhancing efficient intra-level feature fusion inside RSB, which is the constituent unit of RSN. As shown in Figure 2(c), RSB firstly divides the features into four splits (i = 1, 2, 3, 4), then implements a conv11 (convolutional layer with kernel size 11) separately. Each feature output from conv11 undergoes incremental numbers of conv33. The output features (i = 1, 2, 3, 4) are then concatenated to go through a conv11. An identity connection is employed as the ResNet bottleneck. Because the incremental numbers of conv33 look like steps, the network is therefore named Residual Steps Network.
The receptive fields of RSB range across several values, and the max one is 15. Compared with a single receptive field in ResNet bottleneck as shown in Table 2, RSB indicates more delicate information viewed in the network. In addition, it is deeply connected inside RSB. On the branch, the front conv33 receive the features output from the branch. The conv33 is then designed to refine the fusion of the features output from the conv33. Benefit from the dense connection structure, small-gap receptive fields of features are fully fused resulting in delicate local representations, which retain precise spatial and semantic information. Additionally, during the training process, the deeply connected structure contributes sufficient gradients, so the low-level features are better supervised, which benefits the keypoint localization task. We explore how the branch number of RSB influences the prediction results in Section 4. Four-branch architecture has the best performance.
In this part, we analyze the receptive fields in RSB and other methods. Firstly, the formula for calculating the receptive field of the convolutional layer is written as Equation 1
denotes the size of the receptive field corresponding to the layer, denotes the kernel size of the layer and
denotes the stride of thelayer. In this part, we only focus on the change of relative receptive fields in a block. Every is 3 and is 1. Thus, Equation 1 can be simply written as Equation 2
Using this formula, we calculate the relative receptive fields of RSB and other methods. The detailed receptive fields of RSB are shown in Table 1. The receptive field comparison of RSB and other methods is shown in Table 2. It indicates that RSN has a wider range of scales than ResNet, Res2Net and OSNet. The scale of each human joint varies a lot. For instance, the scale of eye is small while that of hip is large. For this reason, architecture with wider range of receptive fields is more fit for extracting features relating to different joints. Also, wider range of receptive fields helps to learn more discriminant semantic representations, which benefits the keypoint classification task. More importantly, RSN builds dense connections between the features with small-gap receptive fields inside RSB. The deeply connected architecture contributes to learning delicate local representations, which are essential to precise human pose estimation.
In the last module of multi-stage network (Figure 2(a)), an attention mechanism - Pose Refine Machine (PRM) is used to reweight the output features, as shown in Figure 3. The first comment of PRM is a conv33, then the features are input into three paths. The top path is an identity connection. The middle one, motivated by SENet , passes through a global pooling, two conv11 and a sigmoid activation to get a weight vector . The bottom path passes through a conv11, a depth-wise separable conv99 and a sigmoid activation to get an attention map . Element-wise sum and multiplication are conducted among the three paths to get the output features. Define the input features of PRM as , the output features as , the first conv33 as and element-wise multiplication as . Then the mathematical expression of PRM can be written as Equation 3.
As for the output of RSN, features after intra-level and inter-level aggregation are mixed together containing both low-level precise spatial information and high-level discriminant semantic information. Spatial information is good for keypoint localization while semantic information benefits keypoints classification. Thus, these features contribute differently to the final prediction. Therefore, to further refine the aggregated representations, optimize the parameters of network and reduce the interference of noise. PRM is designed to reweight the output features of RSN. The middle path is designed to reweight the features in channel wise and the bottom path is proposed for spatial attention.
Datasets and Evaluation Metric. COCO dataset 
includes over 200K images and 250K person instances labeled with 17 joints. We use only COCO train2017 dataset for training (including about 57K images and 150K person instances). We evaluate our method on COCO minival dataset (5K images) and the testing datasets including test-dev (20K images) and test-challenge (20K images). We use standard OKS-based AP score as the evaluation metric.
The network is trained on 8 V100 GPUs with mini-batch size 48 per GPU. There are 140k iterations per epoch and 200 epochs in total. Adam optimizer is adopted and the linear learning rate gradually decreases from 5e-4 to 0. The weight decay is 1e-5. Each image goes through a series of data augmentation operations including cropping, flipping, rotation and scaling. The range of rotation is. The range of scaling is 0.71.35. The size of input image is 256192 or 384288.
Testing Details. We apply a post-Gaussian filter to the estimated heatmaps. Following the strategy of hourglass , we average the predicted heatmaps of original image with the results of corresponding flipped image. Then we implement a quarter offset from the highest response to the second highest one to get the locations of keypoints. The same with CPN , the pose score is the multiplication of the average score of keypoints and the bounding box score.
In Section 2.2, we analyze the differences between RSN and current methods. In this part, we validate the effectiveness of intra-level feature fusion method in RSN. Ablation experiments are implemented on ResNet, Res2Net and RSN based networks. PRM is left out for more strong comparison. Note that the only variable is the architecture of constituent unit of the backbone. The results are reported in Table 3.
As Table 3 shows, RSN boosts the performance by relatively larger extent with acceptable computation cost addition, while Res2Net can only obtain limited gain. For instance, RSN-18 is 2.9 points AP higher than ResNet-18 adding only 0.2 GFLOPs and 2.3 points AP higher than Res2Net-18 adding only 0.3 GFLOPs. However, Res2Net-18 obtains only 0.6 AP gain than ResNet-18. RSN always works much better than ResNet and Res2Net with comparable GFLOPs. In addition, it is worth noting that when model complexity is relatively low, RSN still has a remarkable performance, which indicates that RSN is more compact and efficient. For instance, compared with ResNet-101 and Res2Net-101, RSN-18 has a similar AP, however, with only a third of computation cost.
In conclusion, wider range of receptive fields and more dense connections make the intra-level feature fusion of RSN more effective. Delicate local representations retaining finer spatial information are critical to precise human pose estimation.
The dense connection ideology of RSN comes form DenseNet. However, it is not efficient for DenseNet when too many concatenating operations are implemented. To circumvent the network capacity explosion, RSN uses element-wise sum to connect adjacent branches. To validate the efficiency of our approach, we respectively adopt ResNet, Res2Net, DenseNet and RSN as the backbone in the same multi-stage architecture as shown in Figure 2(a) to compare the performance. PRM is left out for fair comparison. The results are shown in Figure 4. For relatively small models, RSN and DenseNet based networks can both achieve good results, while Res2Net only gets a minor improvement than ResNet. However, as the model capacity increases, the improvements of DenseNet and Res2Net based network decrease dramatically. Both of them can only get a inferior result close to ResNet when the model size becomes large, while RSN can keep its superiority to the end.
DenseNet has a high AP score at a low complexity owing to the deep connections and frequent feature aggregations inside the same level by continuous concatenating operations. This makes the low-level features sufficiently supervised resulting in satisfactory delicate spatial texture information, which benefits the keypoint localization. However, as the computation cost raises, the concatenating operations of DenseNet become redundant. It combines quite a large range of less utilized information. As for Res2Net, narrower range of receptive fields and lack of efficient intra-level feature fusion to extract delicate local representations make it much inferior than RSN.
In order to embody the differences of Res2Net, DenseNet and RSN more essentially, we shows the average absolute filter weights of the last conv11 layers of each level in trained Res2Net-50, DenseNet-169 and RSN-50 in Figure 5. The highly used weights become less from level 1 to level 4 in DenseNet. The overall useful weights of DenseNet are less than those of RSN, which can be deduced from Figure 5 (b) and (c) that the red area in each level of DenseNet is much smaller than that of RSN. According to the analysis in Section 3.2, RSN can enhance the efficient fusion of intra-level features with dense element-wise sum connections. There are not accumulative concatenating operations like DenseNet. Thus, RSN is less occupied by the redundant features with low utilization. On the other hand, compared with Res2Net, the more densely connected architecture and wider range of receptive fields make the intra-level feature fusion of RSN more effective, that is why the red area of RSN is much larger than that of Res2Net and the weights of RSN are more fully used, just as shown in Figure 5 (a) and (c). As a result, the RSN model can keep its high efficiency and considerable improvement from the beginning to the end, just as shown in Figure 4.
Additionally, to highlight the advantages of RSN more intuitively, we conduct visual analysis of Res2Net-50, DenseNet-169 and RSN-50, as shown in Figure 6. In Figure 6(b), the high-level response to human body of Res2Net and DenseNet either covers incomplete body area or too large area of background. Only RSN has a relatively complete and appropriate response area to the human body. As a result, in final prediction, Res2Net is easily misled by the background information, DenseNet ignores some keypoints such as shoulders, while RSN can locate the keypoints better and reduce the interference of background information. As Figure 6(c) shows, the heatmaps of RSN are much clearer and the locations of the responses are much more accurate.
When designing the architecture of RSN, we firstly absorb the dense connection ideology of DenseNet into RSB. Then, we explore how the branch number of RSB influences the performance. Ablation experiments are done on RSN-18 and RSN-50 separately. We gradually increase the number of branches from to while keeping the model capacity unchanged. As Table 4 shows, the performance firstly becomes better and attains its peak when there are branches. However, when the branch number continues growing, the results begin to get worse. Thus, is the best choice.
In Section 3.3, we have analyzed the effect of PRM. To validate the improvement of PRM, we perform ablation experiments on both single-stage and multi-stage network architecture. The results are shown in Table 5. When the model capacity is small, PRM has a considerable improvement. As for relatively high AP baseline, PRM still obtains 0.4 AP gain.
We validate our approach on COCO test-dev and test-challenge datasets. The results are shown in Table 6 and Table 7. For fair comparison, we pay attention to the performances of single models with comparable GFLOPs, without using extra training data. In Table 6, our method outperforms HRNet by 2.5 AP (78.0 v.s. 75.5), and outperforms SimpleBaseline by 4.3 AP on COCO test-dev dataset. Additionally, as Table 7 shows, our approach outperforms MSPN (the winner of COCO Keypoint Challenge 2018) by 0.7 AP on test-challenge dataset. Note that we don’t even use pretrained model.
|Method||Extra data||Pretrain||Backbone||Input Size||GFLOPs||AP||AP.5||AP.75||AP(M)||AP(L)||AR|
Results on COCO test-dev dataset. ”*” means using ensembled models. Pretrain = pretrain the backbone on the ImageNet classification task.
|Method||Extra data||Pretrain||Backbone||Input Size||GFLOPs||AP||AP.5||AP.75||AP(M)||AP(L)||AR|
|Mask R-CNN ||-||ResX-101-FPN||-||-||68.9||89.2||75.2||63.7||76.8||75.4|
The MPII Human Pose Dataset  is composed of about 25K images with 40K subjects taken from real-word activities. Each person instance is labeled with 16 skeletal joints. The training and data augmentation strategy are nearly the same to COCO keypoints dataset in Section 4.1, except the input size is 256256.
We follow the standard evaluation metric PCKh.The results on the MPII test dataset are shown in Table 8. Our method boosts the state-of-the-art performance by 0.7 in PCKh@0.5, which is a significant improvement under the circumstance that the MPII benchmark is becoming saturated.
|Newell et al. ||98.2||96.3||91.2||87.1||90.1||87.4||83.6||90.9|
|Tang et al. ||97.4||96.4||92.1||87.7||90.2||87.7||84.3||91.2|
|Ning et al. ||98.1||96.3||92.2||87.8||90.6||87.6||82.7||91.2|
|Chu et al. ||98.5||96.3||91.9||88.1||90.6||88.0||85.0||91.5|
|Bin et al. ||98.5||96.6||91.9||87.6||91.1||88.1||84.1||91.5|
|Chen et al.||98.1||96.5||92.5||88.5||90.2||89.6||86.0||91.9|
|Yang et al.||98.5||96.7||92.5||88.7||91.1||88.6||86.0||92.0|
|Ke et al. ||98.5||96.8||92.7||88.4||90.6||89.3||86.3||92.1|
|Tang et al. ||98.4||96.9||92.6||88.7||91.8||89.4||86.2||92.3|
|Sun et al. ||98.6||96.9||92.8||89.0||91.5||89.0||85.7||92.3|
In this paper, we propose a novel network Residual Steps Network (RSN) for human pose estimation. RSN is designed for learning delicate local representations by efficient intra-level feature fusion. To further refine the keypoint locations, we propose an attention mechanism - Pose Refine Machine (PRM). Some prediction results of our method on COCO and MPII valid datasets are shown in Figure 7.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.