Cascaded Pyramid Network for Multi-Person Pose Estimation (CVPR 2018)
Existing pose estimation approaches can be categorized into single-stage and multi-stage methods. While a multi-stage architecture is seemingly more suitable for the task, the performance of current multi-stage methods is not as competitive as single-stage ones. This work studies this issue. We argue that the current unsatisfactory performance comes from various insufficient design in current methods. We propose several improvements on the architecture design, feature flow, and loss function. The resulting multi-stage network outperforms all previous works and obtains the best performance on COCO keypoint challenge 2018. The source code will be released.READ FULL TEXT VIEW PDF
Cascaded Pyramid Network for Multi-Person Pose Estimation (CVPR 2018)
Multi-Stage Pose Network
Multi-Stage Pose Network
Human pose estimation problem has seen rapid progress in recent years using deep convolutional neural networks. Currently, the best performing methods[27, 14, 8, 39] are pretty simple, typically based on a single-stage backbone network, which is transferred from image classification task. For example, the COCO keypoint challenge 2017 winner  is based on Res-Inception . The recent simple baseline approach  uses ResNet . As pose estimation requires a high spatial resolution, up sampling  or deconvolution 
is usually appended after the backbone networks to increase the spatial resolution of deep features.
Another category of pose estimation methods adopts a multi-stage architecture. Each stage is a simple light-weight network and contains its own down sampling and up sampling paths. The feature (and heat) maps between the stages remain a high resolution. All the stages are usually supervised simultaneously to facilitate a coarse-to-fine, end-to-end training. Representative works include convolutional pose machine  and Hourglass network .
Pose estimation is more challenging than image classification. At an apparent look, the multi-stage architecture is more suited for the task because it naturally enables high spatial resolution, multi-level supervision and coarse-to-fine refinement. However, existing multi-stage methods
do not perform as well as single-stage methods on COCO.
This work aims to study this issue. We point out that the current unsatisfactory performance of the multi-stage methods is mostly due to the insufficiency in their design choices. With a number of improvements in the architecture, feature flow, and loss function, the potential advantage of a multi-stage architecture can be fully explored. New state-of-the-art performance is achieved, with a large margin compared to all previous methods.
Specifically, we propose a multi-stage pose estimation network (MSPN) with three novel observations and techniques. First, the single-stage module in the current multi-stage methods is far from optimal. For example, Hourglass uses equal width channels in all blocks for both down and up sampling. Such a design is inconsistent with the practice in current network architecture design such as ResNet . We found that adopting the existing good network structure for down sampling path and a simple up sampling path is much better. Second, due to the repeated down and up sampling steps, information is more likely to lose and optimization becomes more difficult. We propose to aggregate features across different stages to strengthen the information flow and mitigate the difficulty in training. Last, observing that the pose localization accuracy is gradually refined during multi-stage, we adopt a coarse-to-fine supervision strategy in accordance. Supervision is also imposed over multi-scales to further improve training.
Each technique above introduces performance improvement. When they play in synergy, the resulting multi-stage architecture significantly outperforms all previous works. This is exemplified in Figure 1. For the single-stage method, its performance becomes saturated while increasing the network capacity. For the representative multi-stage method Hourglass , only a small performance gain is obtained after using more than two stages. For our proposed network, performance continues to improve with more stages.
On COCO keypoint benchmark, the proposed method achieves 76.1 average precision (AP) on test-dev. It significantly outperforms state-of-the-art algorithms. We obtain 78.1 AP on test-dev and 76.4 on test-challenge dataset in MS COCO 2018 Skeleton Challenge, which is 4.3 AP improvement on test-challenge benchmark compared with MS COCO 2017 Challenge winner.
Pose estimation has underwent a long way as a basic research topic of computer vision. In the early days, hand-crafted features are widely used in classical methods[2, 33, 12, 32, 11, 42, 19, 29]. Recently, many approaches [4, 13, 30, 18, 6, 3] take advantage of deep convolutional neural network (DCNN)  to enhance the performance of pose estimation by a large step. In terms of network architecture, current human pose estimation methods could be divided as single-stage [27, 14, 8, 39] and multi-stage [38, 5, 25, 26, 41, 20] two categories.
Single-Stage Approach Single-stage methods [27, 14, 8, 39] are based on backbone networks that are well tuned on image classification tasks, such as VGG  or ResNet . Papandreou et al.  designs a network to generate heat maps as well as their relative offsets to get the final predictions of the keypoints. He et al.  proposes Mask R-CNN to first generate person box proposals and then apply single-person pose estimation. Chen et al.  which is the winner of COCO 2017 keypoint challenge leverages a Cascade Pyramid Network (CPN) to refine the process of pose estimation. The proposed online hard keypoints mining (OHKM) loss is used to deal with hard keypoints. Xiao et al.  provides a baseline method that is simple and effective in the pose estimation task. In spite of their good performance, these methods have encountered a common bottleneck. Simply increasing the model capacity does not give rise to much improvement in performance. This is illustrated in both Figure 1 and Table 2.
Bottom-up methods firstly predict individual joints in the image and then associate these joints into human instances. Cao et al.  employs a VGG-19  network as a feature encoder, then the output features go through a multi-stage network resulting in heat maps and associations of the keypoints. Newell et al.  proposes a network to simultaneously output keypoints and group assignments.
Top-down approaches first locate the persons using detectors [31, 23, 22]. And a single person pose estimator is then used to predict the keypoints locations. Wei et al.  employs deep convolutional neural networks as feature encoder to estimate human pose. This work designs a sequential architecture composed of convolutional networks to implicitly model long-range dependencies between joints. Hourglass  is proposed to apply intermediate supervision to repeated down-sampling, up-sampling processing for pose estimation task.  adopts Hourglass and further design a Pyramid Residual Module (PRMs) to enhance the invariance in different scales. Many recent works [20, 7, 10, 37] are based on Hourglass and propose various improvements. While these multi-stage methods work well on MPII , they are not competitive on the more challenging tasks on COCO . For example, the winners of COCO keypoint challenge on 2016 , 2017  are all single-stage based, as well as the recent simple baseline work . In this work, we propose several modifications on existing multi-stage architecture and show that the multi-stage architecture is better.
We adopt the top-down pipeline. It first performs human detection and then runs a single-person pose estimator on each detected human instance. A Multi-Stage Pose Network (MSPN) is proposed for pose estimation. A two-stage network is shown in Figure 2.
Our Multi-Stage Pose Network proposes three novel designs to boost the performance. First, we analyze the deficiency of the previous single-stage module and show that the state-of-the-art image classification network design can be exploited. Second, to reduce information loss, a feature aggregation strategy is proposed to propagate information from early stages to the later ones. At last, we introduce a coarse-to-fine supervision in our network. It gradually refines localization accuracy with more stages. Meanwhile, it makes full use of contextual information and enables more discriminative representations across scales. In the following sections, we will provide the details of each design respectively.
Good design of single-stage module is crucial for the multi-stage network. Most recent multi-stage methods [41, 20, 7, 10, 37] are based on the Hourglass architecture . However, as shown in Table 1, Hourglass simply stacks convolutional layers and the number of features remains consistent during repeated down and up sampling procedures in a single stage. It results in a relatively poor performance seen from Figure 1. An effective single-stage module as shown in Figure 2
is a U-shape architecture in which features extracted from multiple scales are utilized for predictions. Drawing experience from popular backbones[34, 15, 40, 17, 9, 16] such as VGG , we double the number of features every time there is a down-sampling operation, which could effectively reduce the information loss. Besides, computing capacity is mainly allocated to the down-sampling unit rather than up-sampling unit. It is reasonable since we aim to extract more representative features in the down-sampling process and the lost information can hardly be recovered in the up-sampling procedure. Therefore, increasing the capacity of down-sampling unit is usually more effective.
A multi-stage network is flavored in that information lost during repeated up and down sampling. To mitigate this issue, a cross stage feature aggregation strategy is used to propagate multi-scale features from early stages to the current stage in an efficient way.
As is shown in Figure 2, for each scale, two separate information flows are introduced from down-sampling and up-sampling units in the previous stage to the down-sampling procedure of the current stage. It is noted that a convolution is added on each flow as shown in Figure 3. Together with down-sampled features of current stage, three components are added to produce fused results. With this design, the current stage can take full advantage of prior information to extract more discriminative representations. In addition, the feature aggregation could be regarded as an extended residual design, which are helpful dealing with the gradient vanishing problem.
In the pose estimation task, context is crucial for locating the challenging poses since it provides information for invisible joints. Besides, we notice that small localization errors would seriously affect the performance of pose estimation. Accordingly, we design a coarse-to-fine supervision, as illustrated in Figure 2. Specifically, the ground truth heat map for each joint is realized as a Gaussian in most previous works. In this work, we further propose to use different kernel sizes of the Gaussian in different stages. That is, an early stage uses a large kernel and a late stage uses a small kernel. This strategy is based on the observation that the estimated heat maps from multi-stages are also in a similar coarse-to-fine manner. Figure 4 shows a illustrative example. It demonstrates that the proposed supervision is able to refine localization accuracy gradually.
Besides, we are inspired that intermediate supervision could play an essential role in improving the performance of deep neural network from . Therefore, we introduce a multi-scale supervision to perform intermediate supervisions with four different scales in each stage, which could obtain substantial contextual information in various levels to help locate challenging poses. As shown in Figure 2, an online hard keypoints mining (OHKM)  is applied to the largest scale supervision in each stage. L2 loss is used for heat maps on all the scales.
, we aggregate the data of train and validation parts together, and further divide it into trainval dataset (nearly 57K images and 150K person instances) and minival dataset (5k images). They are separately utilized for training and evaluating. OKS-based mmAP (AP for short) is used as our evaluation metric.
Human Detector. We adopted a state-of-the-art object detector MegDet  to generate human proposals. The MegDet is trained with full categories of MS COCO dataset111We will release the detection results in the future.. Only human boxes out of the best 100 ones of all categories are selected as the input of single-person pose estimator. All the boxes are expanded to have a fixed aspect ratio of .
Training. The network is trained on 8 Nvidia GTX 1080Ti GPUs with mini-batch size 32 per GPU. There are 90k iterations. Adam optimizer is adopted and the linear learning rate gradually decreases from 5e-4 to 0. The weight decay is 1e-5.
Each image will randomly go through a series of data augmentation operations including cropping, flipping, rotation and scaling. As for cropping, instances with more than eight joints will be cropped to upper or lower bodies with equal possibility. The rotation range is , and scaling range is . The image size is set in Section 4.3, and in Section 4.4.
Testing. A post-Gaussian filter is applied to the estimated heat maps. Following the same strategy as , we average the predicted heat maps of original image with results of corresponding flipped image. Then, a quarter offset in the direction from the highest response to the second highest response is implemented to obtain the final locations of keypoints. The pose score is the multiplication of box score and average score of keypoints, same as in .
In this section, we provide an in-depth analysis of each individual design in our framework.
In order to show the effectiveness of our method in a clear way, we also perform corresponding experiments on Hourglass . All results are reported on minival dataset. The input image size is .
First, we evaluate how the capacity of backbone affects the performance of pose estimation. In terms of the single-stage network in Table 2, we observe that its performance gets quickly saturated with the growth of backbone capacity. It is obvious that Res-101 outperforms Res-50 by 1.6 AP and costs a more 3.1G FLOPs, but there is only 0.5 gain from Res-101 to Res-152 at the cost of additional 3.7G FLOPs. For further exploration, we train a Res-254 network by adding more residual blocks on Res-152. Although the FLOPs of the network increases from 11.2G to 18.0G, there is an only 0.4 AP improvement. Therefore, it is not effective to adopt Res-152 or larger backbones for a single-stage network.
Then, we demonstrate the effectiveness of multi-stage architecture based on the proposed single-stage module. From Table 3, we can see that the performance of single-stage Hourglass  is poor. Adding one more stage introduces a large AP margin. It shows that a multi-stage network is potential. However, the improvement becomes small when four or eight stages are employed. This indicates the necessity of a more effective single-stage module. Our single-stage model is discussed in Section 3.1 and the performance with 71.5 AP on minival dataset demonstrates the superiority of our single-stage module. And our two-stage network further leads to a 3.0 improvement and obtains 74.5 AP. Introducing the third and fourth stage maintains a tremendous upward trend and eventually brings an impressive performance boost. These experiments indicate that MSPN successfully pushes the upper bound of existing single-stage and multi-stage networks. It obtains noticeable performance gain with more network capacity.
|CMU Pose ||-||-||61.8||84.9||67.5||57.1||68.2||66.5||87.2||71.8||60.6||74.6|
|Mask R-CNN ||Res-50-FPN||-||63.1||87.3||68.7||57.8||71.4||-||-||-||-||-|
|Simple Baseline ||Res-152||384288||73.8||91.7||81.2||70.3||80.0||79.1||-||-||-||-|
Finally, we testify that our single-stage module can effectively adopt other backbones. We conduct more experiments on ResNet-18 and Xception  architectures. Results are illustrated in Table 4. It is clear that the two-stage network based on Res-18 obtains a comparable result with Res-50 with smaller FLOPs. Moreover, we design two Xception  backbones with different capacity, a large one (L-XCP) and a small one (S-XCP). The four-stage S-XCP outperforms the single large model with 1.0 in AP with similar complexity. These results demonstrate the generality of our single-module backbone.
To address the issue that a deep multi-stage architecture is vulnerable by information losing during repeating up-down sampling procedures, we propose a cross stage feature aggregation strategy. It is adopted to fuse different level features in adjacent stages and ensure more discriminative representations for current stage. Table 6 shows that the proposed feature aggregation strategy brings about a 0.3 gain from 74.2 to 74.5 for MSPN and a 0.5 improvement in terms of Hourglass, which demonstrates its effectiveness on dealing with aforementioned problems. At the same time, we can draw a conclusion that Hourglass tends to lose more information during forward propagation and our feature aggregation strategy can effectively mitigate this issue.
In this part, we evaluate our coarse-to-fine supervision for both MSPN and Hourglass. The results are shown in Table 6. It is clear that this strategy improves the performance of our network by a large margin from 73.3 to 74.2. First of all, this design aims to realize a coarse-to-fine detection procedure and the result demonstrates its effectiveness on further improving the accuracy of keypoint localization. In addition, it is reasonable that intermediate supervisions can take full advantage of contextual information across different scales. To demonstrate the applicability of this supervision in other multi-stage networks, we further apply this strategy to a four-stage Hourglass that is comparable with our two-stage MSPN in complexity, and finally obtains a 1.2 improvement in AP. In a word, the proposed coarse-to-fine supervision could largely boost the performance of pose estimation and be well adapted to other multi-stage networks.
Furthermore, we conduct several experiments to verify which level of supervision will have a higher efficiency in our network. As described in Section 3.2, we apply a Gaussian blur operation to each point on a heat map and a smaller kernel corresponds to a finer supervision. As shown in Table 7, we could see that either setting-1 or setting-2 will degrade the performance compared with the proposed coarse-to-fine supervision (setting-3). Especially, setting-2 even leads to a worse performance than the setting-1, which indicates that an appropriate supervision could make a difference to the final result.
|Kernel Size 1||7||5||7|
|Kernel Size 2||7||5||5|
|Mask R-CNN* ||Res-50-FPN||-||68.9||89.2||75.2||63.7||76.8||75.4||93.2||81.2||70.2||82.6|
|MSRA SB+* ||Res-152||384288||74.5||90.9||80.8||69.5||82.9||80.5||95.1||86.3||75.3||87.5|
As shown in Table 5, our single model trained by only COCO data achieves 76.1 AP on test-dev and outperforms other methods by a large margin in all metrics. Advocated by external data, MSPN leads to a 1.0 improvement resulting in 77.1 AP. And the ensemble model finally obtains 78.1. From Table 8, it is clear that our framework obtains 76.4 AP on the test-challenge dataset and shows its significant superiority over other state-of-the-art methods. Eventually, our method surpasses COCO 2017 Challenge winner CPN  and MSRA Sample Baseline  by 4.3 and 1.9 AP in test-challenge dataset respectively.
Finally, some pose estimation results generated by our method are shown in Figure 5. We can see that our MSPN handles crowd and occlusion situations as well as challenging poses effectively.
In this work, we propose a Multi-Stage Pose Network (MSPN) to perform multi-person pose estimation. It breaks the performance ceiling of the current methods and achieves state-of-the-art results on MS COCO datasets. We first verify the effectiveness of the multi-stage pipeline with well-designed single-stage modules in MSPN. Additionally, a coarse-to-fine supervision and a cross stage feature aggregation strategy are proposed to further boost the performance of our framework. Extensive experiments have been conducted to demonstrate its superiority over other current methods as well as its generalizability.
Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014.
Xception: Deep learning with depthwise separable convolutions.arXiv preprint, pages 1610–02357, 2017.
Articulated pose estimation using discriminative armlet classifiers.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3342–3349, 2013.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.