Simple Multi-Resolution Representation Learning for Human Pose Estimation

Human pose estimation - the process of recognizing human keypoints in a given image - is one of the most important tasks in computer vision and has a wide range of applications including movement diagnostics, surveillance, or self-driving vehicle. The accuracy of human keypoint prediction is increasingly improved thanks to the burgeoning development of deep learning. Most existing methods solved human pose estimation by generating heatmaps in which the ith heatmap indicates the location confidence of the ith keypoint. In this paper, we introduce novel network structures referred to as multiresolution representation learning for human keypoint prediction. At different resolutions in the learning process, our networks branch off and use extra layers to learn heatmap generation. We firstly consider the architectures for generating the multiresolution heatmaps after obtaining the lowest-resolution feature maps. Our second approach allows learning during the process of feature extraction in which the heatmaps are generated at each resolution of the feature extractor. The first and second approaches are referred to as multi-resolution heatmap learning and multi-resolution feature map learning respectively. Our architectures are simple yet effective, achieving good performance. We conducted experiments on two common benchmarks for human pose estimation: MS-COCO and MPII dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

02/25/2019

Deep High-Resolution Representation Learning for Human Pose Estimation

This is an official pytorch implementation of Deep High-Resolution Repre...
09/11/2019

DNANet: De-Normalized Attention Based Multi-Resolution Network for Human Pose Estimation

Recently, multi-resolution networks (such as Hourglass, CPN, HRNet, etc....
08/27/2019

Bottom-up Higher-Resolution Networks for Multi-Person Pose Estimation

In this paper, we are interested in bottom-up multi-person human pose es...
09/08/2021

Learning Local-Global Contextual Adaptation for Fully End-to-End Bottom-Up Human Pose Estimation

This paper presents a method of learning Local-GlObal Contextual Adaptat...
07/28/2021

Lighter Stacked Hourglass Human Pose Estimation

Human pose estimation (HPE) is one of the most challenging tasks in comp...
02/18/2022

Towards Simple and Accurate Human Pose Estimation with Stair Network

In this paper, we focus on tackling the precise keypoint coordinates reg...
03/24/2019

KPTransfer: improved performance and faster convergence from keypoint subset-wise domain transfer in human pose estimation

In this paper, we present a novel approach called KPTransfer for improvi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human pose estimation is one of the vital tasks in computer vision and has received a great deal of attention from researchers for the past few decades. From the spatial aspect, this problem is divided into 2D and 3D human pose estimation. Geometrically, the 3D human pose might be predicted through the respective 2D human pose combining with a 3D exemplar matching [5]. This paper focuses on the deep learning approach for 2D human pose estimation which aims to localize human anatomical keypoints on the torso, face, arms, and legs.

The pioneer of deep learning methods formulated human pose estimation as a CNN-based regression towards body joints [30]. The model uses an AlexNet [16] backend (consisting of 7 layers) and an extra final layer that directly outputs joint coordinates. The later state-of-the-art methods reshaped this problem by estimating heatmaps for all human keypoints, where the th heatmap represents the location confidence of the th keypoint [28, 31, 20, 6, 32]. Heatmap-based approaches consist of two major parts as shown in Fig. 1: the first part (encoder) works as a feature extractor which is responsible for understanding the image while the second one (decoder) is to generate the heatmaps corresponding to the human keypoints. Convolutional pose machines (CPM) [31] used a multi-stage training scheme where the image features and the heatmaps produced by the previous stage are fed as the input; thus, the prediction is refined throughout stages. Commonly, the output of the feature extractor is the low-resolution feature maps. Stacked Hourglass [20] and Cascaded pyramid network (CPN) [6] adopted a multi-resolution learning strategy to generate the heatmaps from the feature maps at a variety of resolutions. Instead of independently processing at multiple resolutions as CPN, Hourglass uses skip layers to preserve spatial information at each resolution. However, these two methods were defeated when Xiao et al. [32] proposed a simple yet effective baseline which utilizes ResNet [11] as its backbone for feature extractor followed by a few deconvolutional layers for heatmap generator (Fig. 2). SimpleBaseline [32] for human pose estimation is the most effortless way to generate the heatmaps from the low-resolution feature maps, obtaining good performance on MS-COCO 2017 benchmark [18] (improving AP by 3.5 and 1.0 points compared to Hourglass [20] and CPN [6] respectively, with the similar backbone and input size).

Fig. 1: Simple pipeline for human pose estimation using heatmaps.

In the feature extractor, the deeper the layer is, the more specific the learned features are. For example, the first layer may learn overall features by abstracting the pixels and encoding the edges; the second layer may learn how to arrange the edges; the third layer encodes the face; the fourth layer encodes the eyes. Simply to see that the model needs to learn specialized features like eyes, nose because they correspond to the human keypoints. In particular, there are many cases of occluded keypoints. For example, the wrist is behind the back, so the wrist may not be detected. However, we actually can infer the wrist thanks to other keypoints such as elbow, shoulder, or even human skeleton. This means the model needs not only specific features but also overall patterns.

This paper is inspired by the idea that the simple architecture could be ameliorated if it can learn the features from multiple resolutions, for the high resolution allows capturing overall information and the low resolution aims to extract specific characteristics. We propose novel network architectures utilizing the simple baseline [32], combining with the multi-resolution learning strategy. Our first approach achieves the multi-resolution heatmaps after the lowest-resolution feature maps are obtained. To do so, we branch off at each resolution of the heatmap generator and add extra layers for heatmap generation. In our second approach, the networks directly learn the heatmap generation at each resolution of the feature extractor. Our experiments were conducted on two common benchmarks for human pose estimation: MS-COCO [18] and MPII [1]. On the COCO val2017 dataset, our best model gains AP by 0.6 points compared to SimpleBaseline [32] which has a similar backbone and input size. On the MPII dataset, our best model achieves PCKh@0.5 of 89.8.

Contributions: Our main contributions are:

  • [noitemsep,nolistsep]

  • We introduce two novel approaches to achieve multi-resolution representation for both heatmap generation and feature map extraction.

  • Our architectures are simple yet effective, and experiments show the superiority of our approaches over numerous methods.

  • Our approaches could be applied to other tasks that have the architecture of encoder (feature extractor) - decoder (specific tasks) such as image captioning and image segmentation.

Ii Human pose estimation using Deconvolutional layers as the Heatmap generator

Fig. 2: Human pose estimation using deconvolutional layers as the heatmap generator.

This section presents the simple baseline [32] whose the heatmap generator composed of deconvolutional layers. The network structure is illustrated in Fig. 2. From the input image, the model uses residual blocks to learn the features of the image. After each residual block, the resolution is decreased by half while the number of output channels is doubled. In Fig. 2, four residual blocks are working together as a feature extractor, and their numbers of output channels are , , , and respectively. We also use these notations for later architectures.

After reaching lowest-resolution feature maps, the network begins the top-down sequence of upsampling to obtain the high-resolution feature maps. Instead of using upsampling algorithms, SimpleBaseline [32] leverages deconvolutional layers where each of them is built out of a transposed convolutional layer [8]

, a batch normalization, and a Relu activation. At last, a convolutional layer is added to generate

high-resolution heatmaps representing the location confidence for all

human keypoints. Mean Squared Error (MSE) is used as the loss function between the predicted and ground-truth heatmaps:

(1)

where and are the ground-truth and predicted heatmap of the th keypoint respectively, (, ) is the size of the heatmap.

Iii Our method

To investigate the impact of multi-resolution representation, in this section, we propose learning the multi-resolution representation for both the heatmap generator and the feature extractor. These two approaches are referred to as multi-resolution heatmap learning and multi-resolution feature map learning, respectively. We use ResNet [11] as our feature extractor because it is the most common backbone network for image feature extraction.

Iii-a Multi-resolution heatmap learning

(a) MRHeatNet1
(b) MRHeatNet2
MRHeatNet1
Fig. 5: Multi-resolution heatmap learning. We propose two architectures for generating the heatmaps at each resolution of the deconvolutional layers. (a) The lowest-resolution heatmaps are upsampled and then combined with the higher-resolution heatmaps. (b) The heatmaps at each resolution are individually learned and then combined at the end. The residual block halves the resolution of the input. The deconvolutional layer doubles the resolution of the input.

We started thinking about this kind of architecture by assuming that the ResNet backbone [11] works very well on the image feature extraction. The architectures of the multi-resolution heatmap learning are illustrated in Fig. 5. The lowest-resolution feature maps are fed into the sequence of deconvolutional layers to obtain the higher resolutions. The number of output channels of these deconvolutional layers is kept unchanged and is set to be equal to the number of output channels (denoted by ) of the first residual block.

In the baseline method, heatmaps are generated after obtaining the highest resolution. In our method, we branch off at each deconvolutional layer (excluding the highest-resolution deconvolutional layer) and add some convolutional layers to generate the low-resolution heatmaps. The higher-resolution heatmaps could be obtained from the low-resolution heatmaps by using extra deconvolutional layers. The reason we do so is that the high-resolution feature maps help generate the heatmaps with overall information while the low-resolution feature maps focus on specific characteristics. We propose two architectures with a slight difference as shown in Fig. 5:

(a) MRFeaNet1
(b) MRFeaNet2
Fig. 8: Multi-resolution feature map learning. We propose two architectures for learning the features at each resolution of the residual blocks. (a) The number of output channels of deconvolutional layers is kept unchanged. (b) The number of output channels is different among the deconvolutional layers. The highest-resolution heatmaps are obtained from the feature maps at each resolution of the feature extractor. Notations in Fig. 5 are also used here. The residual block halves the resolution of the input. The deconvolutional layer doubles the resolution of the input.
  • In Fig. (a)a, the lowest-resolution heatmaps are upsampled to the higher resolution (called medium resolution) and then combined with the heatmaps generated at this medium resolution. The result of this combination is fed into a deconvolutional layer to obtain the highest-resolution heatmaps.

  • With a small change, in Fig. (b)b, the heatmaps at each resolution are upsampled to the highest-resolution heatmaps independently and then combined at the end.

Iii-B Multi-resolution feature map learning

Instead of learning at each resolution of the heatmap generator as in the multi-resolution heatmap learning strategy, the multi-resolution feature map learning aims to directly learn how to generate the heatmaps at each resolution of the feature extractor (Fig. 8). At each residual block corresponding to each resolution of the feature extractor (excluding the lowest resolution), the network branches off and goes through respective deconvolutional layers to obtain the highest resolution. Especially, the branch from the highest-resolution residual block does not go through any deconvolutional layers but directly goes to the element-sum component. At last, a convolutional layer is added to generate predicted heatmaps for all keypoints.

Following this stream, we propose two architectures as illustrated in Fig. (a)a and Fig. (b)b. The main difference between these two architectures is the number of output channels of deconvolutional layers. In the network shown in Fig. (a)a, the number of output channels of all deconvolutional layers is set to be equal to the number of output channels (denoted by ) of the highest-resolution residual block, this may lead to an information loss.

The feature extractor consists of four residual blocks: the first residual block outputs feature maps with the size of , the second residual block aims to learn more features and outputs feature maps with the size of , the third residual block outputs feature maps with the size of , and the fourth residual block finally outputs lowest-resolution feature maps with the size of . It is easy to see the principle of the image feature extraction here: the number of feature maps is increased by a factor of 2 (more features are learned) while the resolution is halved. Therefore, in the top-down sequence of upsampling, the resolution is increased two times, the number of feature maps should be decreased two times as well. For the network shown in Fig. (a)a, after the first deconvolutional layer in the main branch, the resolution of feature maps is increased two times, but the number of feature maps is decreased eight times (from to ). Therefore, some previously learned information may be lost. To overcome this point, the architecture in Fig. (b)b uses the deconvolutional layers with the number of output channels depending on the number of feature maps extracted by the previously adjacent layer. For instance, after the fourth residual block, lowest-resolution feature maps are outputted; as a result, the numbers of output channels of following deconvolutional layers are , , and , respectively. The effectiveness of learning the heatmap generation from multiple resolutions of the feature extractor will be clarified in Section IV.

Iv Experiment

Dataset

We evaluate our architectures on two common benchmarks for human pose estimation: MS-COCO [18] and MPII [1].

  • The COCO dataset contains more than 200k images and 250k person instances labeled with keypoints. Each person is annotated with 17 keypoints. We train our models on COCO train2017 dataset with 57k images and 150k person instances. Our models are evaluated on COCO val2017 and test-dev2017 dataset, with 5k and 20k images, respectively.

  • The MPII dataset contains around 25k images with over 40k person samples. Each person is annotated with 16 joints. MPII covers 410 human activities collected from YouTube videos where the contents are everyday human activities. Since the annotations of MPII test set are not available, we train our models on a subset of 22k training samples and evaluate our models on a validation set of 3k samples [28].

Evaluation metric

We use different metrics for our evaluation on the MS-COCO and MPII dataset:

  • In the COCO dataset, each person object has the ground-truth keypoints with the form , where are the keypoint locations and is a visibility flag (: not labeled, : labeled but not visible, and

    : labeled and visible). The standard evaluation metric is based on Object Keypoint Similarity (OKS)

    [7]:

    (2)

    In which, is the Euclidean distance between the detected and corresponding ground-truth keypoint, is the visibility flag of the ground-truth keypoint, is the object scale, and is a per-keypoint constant that controls falloff. Predicted keypoints that are not labeled () do not affect the OKS. The OKS plays the same role as the IoU in object detection, so the average precision (AP) and average recall (AR) scores could be computed if given the OKS.

  • For the MPII dataset, we use Percentage of Correct Keypoints with respect to head (PCKh) metric [1]. Firstly, we recall Percentage of Correct Keypoints (PCK) metric [33]. PCK is the percentage of correct detection that falls within a tolerance range which is a fraction of torso diameter. The equation could be expressed as:

    (3)

    where and are the ground-truth and predicted location of the th keypoint respectively, and are the ground-truth location of right hip and left shoulder respectively, r is the threshold bounded between 0 and 1. represents the torso diameter. For example, PCK@0.2 () means that: the distance between the predicted and ground-truth keypoint torso diameter. PCKh is almost the same as PCK except that the tolerance range is a fraction of head size.

Network parameter

For all our experiments, we use ResNet [11] as our backbone for the image feature extraction, consisting of 4 residual blocks as shown in Fig. 5 and Fig. 8. Each deconvolutional layer uses kernel filters. Each convolutional layer uses kernel filters. The numbers of output channels of the residual block, deconvolutional layer, and convolutional layer are denoted by and as shown in Fig. 5 and Fig. 8. is set to 256. is set to 17 or 16 for the COCO or MPII dataset respectively.

Method Backbone Pretrain AP AP AP AP AP AR AR AR AR AR
8-stage Hourglass [20] 8-stage Hourglass N 66.9 - - - - - - - - -
CPN [6] ResNet-50 Y 68.6 - - - - - - - - -
CPN + OHKM [6] ResNet-50 Y 69.4 - - - - - - - - -
SimpleBaseline [32] ResNet-50 Y 70.4 88.6 78.3 67.1 77.2 76.3 92.9 83.4 72.1 82.4
MRHeatNet1 ResNet-50 Y 70.2 88.5 77.6 66.8 77.2 76.2 92.8 83.0 71.8 82.4
MRHeatNet2 ResNet-50 Y 70.3 88.5 78.0 67.2 77.0 76.4 92.9 83.1 72.1 82.4
MRFeaNet1 ResNet-50 Y 70.6 88.7 78.1 67.3 77.5 76.5 92.9 83.3 72.1 82.7
MRFeaNet2 ResNet-50 Y 70.9 88.8 78.3 67.2 78.1 76.8 93.0 83.6 72.2 83.4
SimpleBaseline [32] ResNet-101 Y 71.4 89.3 79.3 68.1 78.1 77.1 93.4 84.0 73.0 83.2
MRFeaNet2 ResNet-101 Y 71.8 89.1 79.6 68.5 78.8 77.8 93.5 84.5 73.5 84.0
SimpleBaseline [32] ResNet-152 Y 72.0 89.3 79.8 68.7 78.9 77.8 93.4 84.6 73.6 83.9
MRFeaNet2 ResNet-152 Y 72.6 89.4 80.4 69.4 79.3 78.2 93.4 85.2 74.1 84.2
TABLE I: Comparisons on COCO val2017 dataset. OHKM means Online Hard Keypoints Mining [6]

. Pretrain means the backbone is pre-trained on the ImageNet classification task.

Method Backbone Input size AP AP AP AP AP AR AR AR AR AR
Bottom-up approach: keypoint detection and grouping
OpenPose [3] - - 61.8 84.9 67.5 57.1 68.2 - - - - -
Associative Embedding [19] - - 65.5 86.8 72.3 60.6 72.6 70.2 89.5 76.0 64.6 78.1
PersonLab [21] ResNet-152 - 68.7 89.0 75.4 64.1 75.5 75.4 92.7 81.2 69.7 83.0
MultiPoseNet [15] - - 69.6 86.3 76.6 65.0 76.3 73.5 88.1 79.5 68.6 80.3
Top-down approach: person detection and single-person keypoint detection
Mask-RCNN [10] ResNet-50-FPN - 63.1 87.3 68.7 57.8 71.4 - - - - -
G-RMI [22] ResNet-101 353 257 64.9 85.5 71.3 62.3 70.0 69.7 88.7 75.5 64.4 77.1
Integral Pose Regression [27] ResNet-101 256 256 67.8 88.2 74.8 63.9 74.0 - - - - -
G-RMI + extra data [22] ResNet-101 353 257 68.5 87.1 75.5 65.8 73.3 73.3 90.1 79.5 68.1 80.4
SimpleBaseline [32] ResNet-50 256 192 70.0 90.9 77.9 66.8 75.8 75.6 94.5 83.0 71.5 81.3
SimpleBaseline [32] ResNet-101 256 192 70.9 91.1 79.3 67.9 76.7 76.7 94.9 84.2 72.7 82.2
SimpleBaseline [32] ResNet-152 256 192 71.6 91.2 80.1 68.7 77.2 77.2 94.9 85.0 73.4 82.6
Our multi-resolution representation learning models
MRHeatNet1 ResNet-50 256 192 69.7 90.8 77.8 66.6 75.4 75.4 94.4 82.9 71.3 81.1
MRHeatNet2 ResNet-50 256 192 69.9 90.8 78.3 66.9 75.6 75.6 94.5 83.3 71.6 81.2
MRFeaNet1 ResNet-50 256 192 70.1 90.7 78.4 67.0 75.9 75.8 94.3 83.3 71.7 81.3
MRFeaNet2 ResNet-50 256 192 70.4 90.9 78.7 67.3 76.3 76.2 94.6 83.7 72.0 81.9
MRFeaNet2 ResNet-101 256 192 71.2 91.0 79.6 68.2 76.9 77.0 94.7 84.5 72.9 82.5
MRFeaNet2 ResNet-152 256 192 71.8 91.2 80.1 68.9 77.5 77.4 94.8 84.9 73.5 82.8
TABLE II: Comparisons on COCO test-dev dataset.

Iv-a Experimental results on COCO dataset

Training. The data pre-processing and augmentation follow the setting in [32]. The ground-truth human bounding box is extended in height or width to a fixed aspect ratio (). The human box after cropped from the image is resized to a fixed size of for a fair comparison with [20, 6, 32]. The data augmentation includes random rotation (), random scale (), and flip. We use Adam optimizer [14]. The batch size is 64. The learning schedule is set up as follows: the base learning rate is set to , and is dropped to and at the th and

th epoch, respectively. The training process is terminated within 170 epochs.

Testing. We use the two-stage top-down paradigm, similar to [6, 32]. Keypoint locations are obtained by using the highest heatvalue’s location in predicted heatmaps and a quarter offset in the direction from the highest response to the second-highest response.

Comparisons on COCO val2017 dataset. TABLE I reports our evaluation results compared to Hourglass [20], CPN [6], and SimpleBaseline [32]. Note that the results of Hourglass [20] are cited from [6]. For the fair comparison, we use the faster-RCNN detector [26] with the detection AP of 56.4 (being the same with that of SimpleBaseline [32]) while the person detection AP of Hourglass [20] and CPN [6] is 55.3.

As shown in TABLE I, both our architectures outperform Hourglass [20] and CPN [6]. With the same ResNet-50 backbone, our MRFeaNet2 achieves an AP score of 70.9, improving the AP by 4.0 and 2.3 points compared to Hourglass and CPN respectively. Online Hard Keypoints Mining (OHKM) proved the efficiency when helping CPN gain the AP by 0.8 points (from 68.6 to 69.4), but still being 1.5 points lower than the AP of MRFeaNet2.

Compared to SimpleBaseline [32], our multi-resolution heatmap learning architectures have slightly worse performance. In the case of using the ResNet-50 backbone, SimpleBaseline has the AP score of 70.4 while the AP scores of MRHeatNet1 and MRHeatNet2 are 70.2 and 70.3 respectively. This may be explained that the deconvolutional layers cannot completely recover all information which the feature extractor already learned, so only learning from the outputs of deconvolutional layers is not enough to generate the heatmaps.

On the other hand, our multi-resolution feature map learning architectures have better performance compared to SimpleBaseline [32]. With the ResNet-50 backbone, MRFeaNet1 gains AP by 0.2 points while the AP of MRFeaNet2 increases by 0.5 points. MRFeaNet2 still obtains the AP improvement of 0.4 and 0.6 points compared to SimpleBaseline in the case of using the ResNet-101 and ResNet-152 backbone, respectively. This proves that learning heatmap generation from multiple resolutions of the feature extractor can help improve the performance of keypoint prediction.

Comparisons on COCO test-dev dataset. TABLE II shows the performance of our models and previous methods on the COCO test-dev dataset. Note that the results of SimpleBasline [32] are reproduced by us using the provided models. We use the human detector with the person detection AP of 60.9 on COCO test-dev for SimpleBasline and our models. Our networks outperform bottom-up approaches. Our MRFeaNet2 achieves the AP improvement of 2.2 points compared to MultiPoseNet [15]. In comparison with top-down approaches, our models are better even with the smaller backbone and image size. Our MRFeaNet2, which uses the ResNet-50 backbone, obtains the AP of 70.4 while the AP score of G-RMI [22] is 68.5 even using the larger backbone network, larger image size, and extra training data. Compared to SimpleBaseline [32], our MRFeaNet2 still improves the AP by 0.4, 0.3, and 0.2 points in the case of using the ResNet-50, ResNet-101, and ResNet-152 backbone, respectively.

Iv-B Experimental results on MPII dataset

Training. The data pre-processing and augmentation are similar to the setting in the experiment on the COCO dataset. The input size of human bounding box is set to for a fair comparison with other methods. The data augmentation includes random rotation (), random scale (), and flip. Adam optimizer [14] is also used. The batch size is 64. The learning rate starts from , drops to and at the th and th epoch, respectively. The training process is terminated within epochs.

Method Hea Sho Elb Wri Hip Kne Ank Total
 Pishchulin et al. [23] 74.3 49.0 40.8 34.1 36.5 34.4 35.2 44.1
 Tompson et al. [29] 95.8 90.3 80.5 74.3 77.6 69.7 62.8 79.6
 Carreira et al. [4] 95.7 91.7 81.7 72.4 82.8 73.2 66.4 81.3
 Tompson et al. [28] 96.1 91.9 83.9 77.8 80.9 72.3 64.8 82.0
 Hu et al. [12] 95.0 91.6 83.0 76.6 81.9 74.5 69.5 82.4
 Pishchulin et al. [24] 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4
 Lifshitz et al. [17] 97.8 93.3 85.7 80.4 85.3 76.6 70.2 85.0
 Gkioxary et al. [9] 96.2 93.1 86.7 82.1 85.2 81.4 74.1 86.1
 Rafi et al. [25] 97.2 93.9 86.4 81.3 86.8 80.6 73.4 86.3
 Belagiannis et al. [2] 97.7 95.0 88.2 83.0 87.9 82.6 78.4 88.1
 Insafutdinov et al. [13] 96.8 95.2 89.3 84.4 88.4 83.4 78.0 88.5
 Wei et al. [31] 97.8 95.0 88.7 84.0 88.4 82.8 79.4 88.5
 SimpleBaseline [32] 96.4 95.3 89.0 83.2 88.4 84.0 79.6 88.5
 MRHeatNet1 96.7 95.2 88.9 83.8 88.1 83.6 78.6 88.4
 MRHeatNet2 96.8 95.5 88.6 83.8 88.5 83.6 78.7 88.5
 MRFeaNet1 96.5 95.5 89.6 84.3 88.6 84.6 80.6 89.1
 MRFeaNet2 96.6 95.4 88.9 83.9 88.5 84.6 80.9 88.9
 SimpleBaseline [32] 96.9 95.9 89.5 84.4 88.4 84.5 80.7 89.1
 MRHeatNet1 96.7 95.7 89.7 84.4 89.1 84.7 81.4 89.3
 MRHeatNet2 97.4 95.6 89.3 84.2 89.0 84.9 81.2 89.3
 MRFeaNet1 96.8 95.6 89.4 84.6 89.2 85.2 81.2 89.4
 MRFeaNet2 96.6 95.2 89.3 84.2 89.2 85.9 81.6 89.3
 SimpleBaseline [32] 97.0 95.9 90.0 85.0 89.2 85.3 81.3 89.6
 MRHeatNet1 96.8 96.0 90.1 84.4 88.9 85.3 81.4 89.5
 MRHeatNet2 96.9 95.6 89.9 84.6 88.9 86.0 81.2 89.5
 MRFeaNet1 97.2 95.9 90.2 85.3 89.3 85.4 82.0 89.8
 MRFeaNet2 96.7 95.4 89.9 85.1 88.8 85.7 81.8 89.5
TABLE III: Comparisons on MPII dataset (PCKh@0.5). (), (), or () means the ResNet-50, ResNet-101, or ResNet-152 backbone is used, respectively.
Fig. 9: PCKh@0.5 score of SimpleBaseline and our models on MPII dataset.
Fig. 10: Qualitative results of our proposed architectures on COCO test2017 dataset.
Fig. 11: Qualitative results of our MRFeaNet1 on MPII test set. Each prediction has 16 heatmaps corresponding to 16 human keypoints. From left to right, top to bottom, these 16 keypoints are right ankle, right knee, right hip, left hip, left knee, left ankle, pelvis, thorax, upper neck, head top, right wrist, right elbow, right shoulder, left shoulder, left elbow, and left wrist.

Testing. We use the human bounding boxes provided with the images. TABLE III shows the PCKh scores of our architectures and previous methods at . The results of SimpleBaseline [32] are reproduced by us using the provided models.

Similar to the experiments on the COCO dataset, our multi-resolution representation learning architectures outperform numerous previous methods. In comparison with SimpleBaseline [32], the multi-resolution feature map learning method achieves better performance. Our MRFeaNet1 gains the PCKh@0.5 score by 0.6, 0.3 and 0.2 points compared to SimpleBaseline in the case of using the ResNet-50, ResNet-101, and ResNet-152 backbone, respectively.

On the other hand, the results also show that the performance could be improved if using the larger backbone network. To make this statement clear, the PCKh@0.5 scores of SimpleBaseline [32] and our models are presented on a chart as shown in Fig. 9. MRFeaNet1, which is the best model on the MPII dataset, obtains the score improvement of 0.4 and 0.7 points compared to MRFeaNet1 and MRFeaNet1 respectively. MRHeatNet1 achieves the highest improvement which is 1.1 points when the backbone network is transformed from ResNet-50 to ResNet-152.

Iv-C Qualitative results

Qualitative results on COCO test2017 dataset. We use our models trained on the COCO train2017 dataset with the ResNet-50 backbone to visualize human keypoint prediction. Our qualitative results on the unseen images of the COCO test2017 dataset are shown as in Fig. 10. Both our models work well on the simple cases (the 1 and 2 row).

  • The figures in the 3 and 4 row are harder with some occluded keypoints, but the multi-resolution feature map learning models still relatively precisely predict the human keypoints. The multi-resolution heatmap learning models do not work well: MRHeatNet1 omits the right elbow in the 3 row, and the eye detection of MRHeatNet2 is not reasonable in both of these two cases.

  • In the 5 row, both legs of the woman are hidden under the table, but both of our models can make their opinion. The prediction results are different among the models. If carefully looking at the hip prediction, the locations proposed by MRFeaNet2 are the most reasonable result.

Qualitative results on MPII dataset. We use our MRFeaNet1 model trained on a subset of the MPII training set with the ResNet-152 backbone to visualize human keypoint prediction. Fig. 11 shows the keypoint predictions and corresponding heatmaps on the unseen images of the MPII test set. Each heatmap represents the location confidence of the respective keypoint. With the simple cases as in the 1 and 2 row, all keypoints are predicted with high confidence.

  • The man in the 3 row has his right leg and left ankle occluded, so the prediction of these keypoints has low confidence. However, all prediction results of this case are reasonable and acceptable.

  • Especially, the man in the 4 row has two ankles not displayed, so the ankle prediction is unreasonable. The heatmaps corresponding to these two ankles are suitable and meaningful, where there is no location predicted with high confidence.

V Conclusion

In this paper, we introduce two novel approaches for multi-resolution representation learning solving human pose estimation. The first approach reconciles a multi-resolution representation learning strategy with the heatmap generator where the heatmaps are generated at each resolution of the deconvolutional layers. The second approach achieves the heatmap generation from each resolution of the feature extractor. While our multi-resolution feature map learning models outperform the baseline and many previous methods, the proposed architectures are relatively straightforward and integrable. The future work includes the applications to other tasks that have the architecture of encoder-decoder (feature extraction - specific tasks) such as image captioning and image segmentation.

References

  • [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014) 2d human pose estimation: new benchmark and state of the art analysis. In

    Proceedings of the IEEE Conference on computer Vision and Pattern Recognition

    ,
    pp. 3686–3693. Cited by: §I, 2nd item, §IV.
  • [2] V. Belagiannis and A. Zisserman (2017) Recurrent human pose estimation. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 468–475. Cited by: TABLE III.
  • [3] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. Cited by: TABLE II.
  • [4] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik (2016) Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742. Cited by: TABLE III.
  • [5] C. Chen and D. Ramanan (2017) 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7035–7043. Cited by: §I.
  • [6] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112. Cited by: §I, §IV-A, §IV-A, §IV-A, §IV-A, TABLE I.
  • [7] COCO COCO - Common Objects in Context. Note: http://cocodataset.org/#keypoints-eval Cited by: 1st item.
  • [8] V. Dumoulin and F. Visin (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285. Cited by: §II.
  • [9] G. Gkioxari, A. Toshev, and N. Jaitly (2016)

    Chained predictions using convolutional neural networks

    .
    In European Conference on Computer Vision, pp. 728–743. Cited by: TABLE III.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: TABLE II.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §III-A, §III, §IV.
  • [12] P. Hu and D. Ramanan (2016) Bottom-up and top-down reasoning with hierarchical rectified gaussians. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5600–5609. Cited by: TABLE III.
  • [13] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pp. 34–50. Cited by: TABLE III.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A, §IV-B.
  • [15] M. Kocabas, S. Karagoz, and E. Akbas (2018) Multiposenet: fast multi-person pose estimation using pose residual network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 417–433. Cited by: §IV-A, TABLE II.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
  • [17] I. Lifshitz, E. Fetaya, and S. Ullman (2016) Human pose estimation using deep consensus voting. In European Conference on Computer Vision, pp. 246–260. Cited by: TABLE III.
  • [18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §I, §I, §IV.
  • [19] A. Newell, Z. Huang, and J. Deng (2017) Associative embedding: end-to-end learning for joint detection and grouping. In Advances in neural information processing systems, pp. 2277–2287. Cited by: TABLE II.
  • [20] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §I, §IV-A, §IV-A, §IV-A, TABLE I.
  • [21] G. Papandreou, T. Zhu, L. Chen, S. Gidaris, J. Tompson, and K. Murphy (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–286. Cited by: TABLE II.
  • [22] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911. Cited by: §IV-A, TABLE II.
  • [23] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele (2013) Strong appearance and expressive spatial models for human pose estimation. In Proceedings of the IEEE international conference on Computer Vision, pp. 3487–3494. Cited by: TABLE III.
  • [24] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele (2016) Deepcut: joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4929–4937. Cited by: TABLE III.
  • [25] U. Rafi, B. Leibe, J. Gall, and I. Kostrikov (2016) An efficient convolutional network for human pose estimation.. In BMVC, Vol. 1, pp. 2. Cited by: TABLE III.
  • [26] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §IV-A.
  • [27] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545. Cited by: TABLE II.
  • [28] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler (2015) Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656. Cited by: §I, 2nd item, TABLE III.
  • [29] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pp. 1799–1807. Cited by: TABLE III.
  • [30] A. Toshev and C. Szegedy (2014) Deeppose: human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653–1660. Cited by: §I.
  • [31] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh (2016) Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4724–4732. Cited by: §I, TABLE III.
  • [32] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: §I, §I, §II, §II, §IV-A, §IV-A, §IV-A, §IV-A, §IV-A, §IV-A, §IV-B, §IV-B, §IV-B, TABLE I, TABLE II, TABLE III.
  • [33] Y. Yang and D. Ramanan (2011) Articulated pose estimation with flexible mixtures-of-parts. In CVPR 2011, pp. 1385–1392. Cited by: 2nd item.