Most of the top-down pose estimation models, such as HRNet[sun2019deep], CPN[chen2018cascaded], Mask R-CNN[he2017mask] generate predictions assuming there exists only one person in one person detection box. However, multiple persons can reside in a box in some cases, as shown in Fig 1, and it is difficult to say which one is a dominant target person. It becomes more difficult as a model processes cropped images from expanded detection boxes to capture context information. This phenomenon makes single-person pose estimation as an ill-posed problem, in which there exist multiple solutions.
To alleviate this issue, we introduce two ideas; the first idea is to add instance cue on input which specifies a target person in a box, and the other is to design a recurrent network so that a model can refine its predictions using the outputs from previous hops as a hint for a target person.
Figure 2 shows our network structure. We adopt HRNet[sun2019deep]
as a baseline architecture since, to the best of our knowledge, it is the state-of-the-art model with open-sourced codes. Then we add an input refinement block so that the model can handle an external instance cue for a target person. We also create a feedback connection from the output of the network to the intermediate feature map so that the model can refine its outputs using its previous predictions. For each of both modifications, we just add two simple convolutional layers with a residual connection to reuse ImageNet pre-trained models provided by the official HRNet respository111https://github.com/leoxiaobin/deep-high-resolution-net.pytorch.
2.1 Instance Cue
As mentioned earlier, there may exist two or more persons in a bounding box from annotated data or a detection box from an object detector. To specify a target person explicitly, a cropped image and an instance cue embedding are fed into our network. The embedding is a single channel Gaussian heatmap which has a peak located on a target person.
The input refinement block in HintPose aggregates image features and instance cue embedding and updates feature maps with element-wise summation.
Instance cues can be derived from ground-truth keypoints or instance segmentation maps during training time. At inference time, they can be generated from the outputs of other instance segmentation or keypoints estimation models, or it is also possible to train another simple network to predict them directly.
2.2 Recurrent Refinement
In addition to providing an external instance cue, it is also possible to use the outputs of the model itself as a hint for a target person. We adopted the structure of Feedback Network[li2019srfbn] and designed our model to have a recurrent connections so that it can refine its outputs using its previous predictions.
We added two convolutional blocks onto the baseline network; one is to update feature maps using information from the previous output and another is to extract meaningful information for the next hop. We built the connection to feed back onto features after layer1 of HRNet so that the improved features can be processed in all the different scales of HRNet and have smaller memory footprint.
3.1 Training & Evaluation Details
While training the network, we generated instance cues by randomly selecting a joint among ground-truth joints and augmenting its x, y position. For a model with recurrent refinement, the model is evaluated for three hops, and all of its prediction outputs are used to compare with a ground-truth heatmap and to compute mean squared error.
We trained our models with the COCO training set only and used the same hyper-parameters provided by the official HRNet repository.
We used MMDetection toolbox[mmdetection] and Hybrid Task Cascade (HTC)[chen2019hybrid] + HRNetV2p-W48 model to generate detection boxes on the COCO17 validation and test sets. Its detection accuracy is 47.0 mAP (60.5 mAP for ‘person’ category) on the COCO17 validation set. We ignored bounding boxes smaller than .
To generate instance cues during an evaluation, we generated image-level joint heatmaps using MultiPoseNet[kocabas18prn] and considered local peaks in each detected bounding box as instance cues. When there are multiple cues in a bounding box, the same cropped image are fed into the model multiple times with each cue.
For models with recurrent refinement, the output heatmaps after three hops are used to compute the final predictions.
Other hyper-parameters and post-processing, including person scoring and OKS-base NMS, are kept the same with the original HRNet.
3.2 COCO Keypoints Detection
We evaluated our models with the COCO[lin2014coco] 2017 Keypoints validation and test-dev sets.
Figure 3 shows two different predictions from the same input image with two different instance cues. The predictions of our model vary as instance cue moves from one person to another.
When they are evaluated on the COCO validation set, Both Instance Cue and Recurrent Refinement improved the performance of our pose estimation model. Moreover, the improvement increased as the modifications are applied together.
|I.C. + R.R.||78.1||82.6|
|I.C. + R.R.||76.2||81.0|
|Ensemble model + PoseFix||77.8||82.2|
Our model showed a significant improvement of +0.8 mAP compared to the baseline HRNet. Moreover, when we ensemble 6 different models222Averaged heatmaps of the followings are used for post-processing; two models with instance cue and recurrent refinement, two models with instance cue only and two models with recurrent refinement only., our models achieved 77.3 mAP. We also refined our predictions with PoseFix[Moon_2019_CVPR_PoseFix] and the final predictions achieved 77.8 mAP on the COCO test-dev set.
One observation is that the improvements from Instance cue are not as meaningful as those on the COCO validation set. We hypothesized that the number of crowded bounding boxes is less significant on the test-dev set.
4 Discussions & Future Works
In this technical report, we introduced two different methods to handle multiple, overlapped person instances in one bounding box. Our modifications can be applied to existing top-down pose estimation models by adding a couple of convolutional blocks. When we evaluated our model on the COCO keypoints dataset, we observed non-negligible performance improvements thanks to our HintPose method.
Our future work will include evaluating our model on other datasets and figuring out how our method performs when it is applied to other top-down models. It is known that crowded scenes are not dominant in the COCO Keypoint dataset[li2018crowdpose]. Therefore, we expect that more significant improvements can be observed when our model is evaluated on a dataset with more occlusions such as CrowdPose[li2018crowdpose].
It is also possible to improve our model with a better learning strategy as [li2019srfbn] showed that curriculum learning is vital to train its feedback network structure. Another way to improve our model can be to use a different type of instance cue, such as segmentation maps.