Deep High-Resolution Representation Learning for Human Pose Estimation

02/25/2019 ∙ by Ke Sun, et al. ∙ Microsoft USTC 0

This is an official pytorch implementation of Deep High-Resolution Representation Learning for Human Pose Estimation. In this work, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. The code and models have been publicly available at <https://github.com/leoxiaobin/deep-high-resolution-net.pytorch>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

D human pose estimation has been a fundamental yet challenging problem in computer vision. The goal is to localize human anatomical keypoints (e.g., elbow, wrist, etc.) or parts. It has many applications, including human action recognition, human-computer interaction, animation, etc. This paper is interested in single-person pose estimation, which is the basis of other related problems, such as multi-person pose estimation 

[6, 27, 33, 39, 47, 57, 41, 46, 17, 71], video pose estimation and tracking [49, 72], etc.

The recent developments show that deep convolutional neural networks have achieved the state-of-the-art performance. Most existing methods pass the input through a network, typically consisting of high-to-low resolution subnetworks that are connected in series, and then

raise the resolution. For instance, Hourglass [40] recovers the high resolution through a symmetric low-to-high process. SimpleBaseline [72] adopts a few transposed convolution layers for generating high-resolution representations. In addition, dilated convolutions are also used to blow up the later layers of a high-to-low resolution network (e.g., VGGNet or ResNet) [27, 77].


Figure 1: Illustrating the architecture of the proposed HRNet. It consists of parallel high-to-low resolution subnetworks with repeated information exchange across multi-resolution subnetworks (multi-scale fusion). The horizontal and vertical directions correspond to the depth of the network and the scale of the feature maps, respectively.

We present a novel architecture, namely High-Resolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process. We estimate the keypoints over the high-resolution representations output by our network. The resulting network is illustrated in Figure 1.

Figure 2: Illustration of representative pose estimation networks that rely on the high-to-low and low-to-high framework. (a) Hourglass [40]. (b) Cascaded pyramid networks [11]. (c) SimpleBaseline [72]: transposed convolutions for low-to-high processing. (d) Combination with dilated convolutions [27]

. Bottom-right legend: reg. = regular convolution, dilated = dilated convolution, trans. = transposed convolution, strided = strided convolution, concat. = concatenation. In (a), the high-to-low and low-to-high processes are symmetric. In (b), (c) and (d), the high-to-low process, a part of a classification network (ResNet or VGGNet), is

heavy, and the low-to-high process is light. In (a) and (b), the skip-connections (dashed lines) between the same-resolution layers of the high-to-low and low-to-high processes mainly aim to fuse low-level and high-level features. In (b), the right part, refinenet, combines the low-level and high-level features that are processed through convolutions.

Our network has two benefits in comparison to existing widely-used networks [40, 27, 77, 72] for pose estimation. (1) Our approach connects high-to-low resolution subnetworks in parallel rather than in series as done in most existing solutions. Thus, our approach is able to maintain the high resolution instead of recovering the resolution through a low-to-high process, and accordingly the predicted heatmap is potentially spatially more precise. (2) Most existing fusion schemes aggregate low-level and high-level representations. Instead, we perform repeated multi-scale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation. Consequently, our predicted heatmap is potentially more accurate.

We empirically demonstrate the superior keypoint detection performance over two benchmark datasets: the COCO keypoint detection dataset [36] and the MPII Human Pose dataset [2]. In addition, we show the superiority of our network in video pose tracking on the PoseTrack dataset [1].

2 Related Work

Most traditional solutions to single-person pose estimation adopt the probabilistic graphical model or the pictorial structure model [79, 50]

, which is recently improved by exploiting deep learning for better modeling the unary and pair-wise energies 

[9, 65, 45] or imitating the iterative inference process [13]. Nowadays, deep convolutional neural network provides dominant solutions [20, 35, 62, 42, 43, 48, 58, 16]. There are two mainstream methods: regressing the position of keypoints [66, 7], and estimating keypoint heatmaps [13, 14, 78] followed by choosing the locations with the highest heat values as the keypoints.

Most convolutional neural networks for keypoint heatmap estimation consist of a stem subnetwork similar to the classification network, which decreases the resolution, a main body producing the representations with the same resolution as its input, followed by a regressor estimating the heatmaps where the keypoint positions are estimated and then transformed in the full resolution. The main body mainly adopts the high-to-low and low-to-high framework, possibly augmented with multi-scale fusion and intermediate (deep) supervision.

High-to-low and low-to-high. The high-to-low process aims to generate low-resolution and high-level representations, and the low-to-high process aims to produce high-resolution representations [4, 11, 23, 72, 40, 62]. Both the two processes are possibly repeated several times for boosting the performance [77, 40, 14].

Representative network design patterns include: (1) Symmetric high-to-low and low-to-high processes. Hourglass and its follow-ups [40, 14, 77, 31]

design the low-to-high process as a mirror of the high-to-low process. (2) Heavy high-to-low and light low-to-high. The high-to-low process is based on the ImageNet classification network, e.g., ResNet adopted in 

[11, 72], and the low-to-high process is simply a few bilinear-upsampling [11] or transpose convolution [72] layers. (3) Combination with dilated convolutions. In [27, 51, 35], dilated convolutions are adopted in the last two stages in the ResNet or VGGNet to eliminate the spatial resolution loss, which is followed by a light low-to-high process to further increase the resolution, avoiding expensive computation cost for only using dilated convolutions [11, 27, 51]. Figure 2 depicts four representative pose estimation networks.

Multi-scale fusion. The straightforward way is to feed multi-resolution images separately into multiple networks and aggregate the output response maps [64]. Hourglass [40] and its extensions [77, 31] combine low-level features in the high-to-low process into the same-resolution high-level features in the low-to-high process progressively through skip connections. In cascaded pyramid network [11], a globalnet combines low-to-high level features in the high-to-low process progressively into the low-to-high process, and then a refinenet combines the low-to-high level features that are processed through convolutions. Our approach repeats multi-scale fusion, which is partially inspired by deep fusion and its extensions [67, 73, 59, 80, 82].

Intermediate supervision. Intermediate supervision or deep supervision, early developed for image classification [34, 61], is also adopted for helping deep networks training and improving the heatmap estimation quality, e.g., [69, 40, 64, 3, 11]. The hourglass approach [40] and the convolutional pose machine approach [69] process the intermediate heatmaps as the input or a part of the input of the remaining subnetwork.

Our approach. Our network connects high-to-low subnetworks in parallel. It maintains high-resolution representations through the whole process for spatially precise heatmap estimation. It generates reliable high-resolution representations through repeatedly fusing the representations produced by the high-to-low subnetworks. Our approach is different from most existing works, which need a separate low-to-high upsampling process and aggregate low-level and high-level representations. Our approach, without using intermediate heatmap supervision, is superior in keypoint detection accuracy and efficient in computation complexity and parameters.

There are related multi-scale networks for classification and segmentation [5, 8, 74, 81, 30, 76, 55, 56, 24, 83, 55, 52, 18]. Our work is partially inspired by some of them [56, 24, 83, 55], and there are clear differences making them not applicable to our problem. Convolutional neural fabrics [56] and interlinked CNN [83]

fail to produce high-quality segmentation results because of a lack of proper design on each subnetwork (depth, batch normalization) and multi-scale fusion. The grid network 

[18], a combination of many weight-shared U-Nets, consists of two separate fusion processes across multi-resolution representations: on the first stage, information is only sent from high resolution to low resolution; on the second stage, information is only sent from low resolution to high resolution, and thus less competitive. Multi-scale densenets [24] does not target and cannot generate reliable high-resolution representations.

3 Approach

Human pose estimation, a.k.a. keypoint detection, aims to detect the locations of keypoints or parts (e.g., elbow, wrist, etc) from an image of size . The state-of-the-art methods transform this problem to estimating heatmaps of size , , where each heatmap indicates the location confidence of the th keypoint.

We follow the widely-adopted pipeline [40, 72, 11] to predict human keypoints using a convolutional network, which is composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution. We focus on the design of the main body and introduce our High-Resolution Net (HRNet) that is depicted in Figure 1.

Sequential multi-resolution subnetworks. Existing networks for pose estimation are built by connecting high-to-low resolution subnetworks in series, where each subnetwork, forming a stage, is composed of a sequence of convolutions and there is a down-sample layer across adjacent subnetworks to halve the resolution.

Let be the subnetwork in the th stage and be the resolution index (Its resolution is of the resolution of the first subnetwork). The high-to-low network with (e.g., ) stages can be denoted as:

(1)

Parallel multi-resolution subnetworks. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel. As a result, the resolutions for the parallel subnetworks of a later stage consists of the resolutions from the previous stage, and an extra lower one.

An example network structure, containing parallel subnetworks, is given as follows,

(2)

Figure 3: Illustrating how the exchange unit aggregates the information for high, medium and low resolutions from the left to the right, respectively. Right legend: strided = strided convolution, up samp. = nearest neighbor up-sampling following a convolution.

Repeated multi-scale fusion. We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks. Here is an example showing the scheme of exchanging information. We divided the third stage into several (e.g., ) exchange blocks, and each block is composed of parallel convolution units with an exchange unit across the parallel units, which is given as follows,

(3)

where represents the convolution unit in the th resolution of the th block in the th stage, and is the corresponding exchange unit.

We illustrate the exchange unit in Figure 3 and present the formulation in the following. We drop the subscript and the superscript for discussion convenience. The inputs are response maps: . The outputs are response maps: , whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, . The exchange unit across stages has an extra output map : .

The function consists of upsampling or downsampling from resolution to resolution . We adopt strided convolutions for downsampling. For instance, one strided convolution with the stride for downsampling, and two consecutive strided convolutions with the stride for downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a convolution for aligning the number of channels. If , is just an identify connection: .

Heatmap estimation.

We regress the heatmaps simply from the high-resolution representations output by the last exchange unit, which empirically works well. The loss function, defined as the mean squared error, is applied for comparing the predicted heatmaps and the groundtruth heatmaps. The groundtruth heatmpas are generated by applying

D Gaussian with standard deviation of

pixel centered on the grouptruth location of each keypoint.

Network instantiation. We instantiate the network for keypoint heatmap estimation by following the design rule of ResNet to distribute the depth to each stage and the number of channels to each resolution.

The main body, i.e., our HRNet, contains four stages with four parallel subnetworks, whose the resolution is gradually decreased to a half and accordingly the width (the number of channels) is increased to the double. The first stage contains residual units where each unit, the same to the ResNet-, is formed by a bottleneck with the width , and is followed by one convolution reducing the width of feature maps to . The nd, rd, th stages contain , , exchange blocks, respectively. One exchange block contains residual units where each unit contains two convolutions in each resolution and an exchange unit across resolutions. In summary, there are totally exchange units, i.e., multi-scale fusions are conducted.

In our experiments, we study one small net and one big net: HRNet-W and HRNet-W, where and represent the widths () of the high-resolution subnetworks in last three stages, respectively. The widths of other three parallel subnetworks are for HRNet-W, and for HRNet-W.

4 Experiments

Method Backbone Pretrain Input size #Params GFLOPs
-stage Hourglass [40] -stage Hourglass N M
CPN [11] ResNet-50 Y M
CPN + OHKM [11] ResNet-50 Y M
SimpleBaseline [72] ResNet-50 Y M
SimpleBaseline [72] ResNet-101 Y M
SimpleBaseline [72] ResNet-152 Y M
HRNet-W HRNet-W N M
HRNet-W HRNet-W Y M
HRNet-W HRNet-W Y M
SimpleBaseline [72] ResNet-152 Y M
HRNet-W HRNet-W Y M
HRNet-W HRNet-W Y M 76.3 90.8 82.9 72.3 83.4 81.2
Table 1: Comparisons on the COCO validation set. Pretrain = pretrain the backbone on the ImageNet classification task. OHKM = online hard keypoints mining [11].
Method Backbone Input size #Params GFLOPs
Bottom-up: keypoint detection and grouping
OpenPose [6]
Associative Embedding [39]
PersonLab [46]
MultiPoseNet [33]
Top-down: human detection and single-person keypoint detection
Mask-RCNN [21] ResNet-50-FPN
G-RMI [47] ResNet-101 M
Integral Pose Regression [60] ResNet-101 M
G-RMI + extra data [47] ResNet-101 M
CPN [11] ResNet-Inception
RMPE [17] PyraNet [77] M
CFN [25]
CPN (ensemble) [11] ResNet-Inception
SimpleBaseline [72] ResNet-152 M
HRNet-W HRNet-W M
HRNet-W HRNet-W M 75.5 92.5 83.3 71.9 81.5 80.5
HRNet-W + extra data HRNet-W M 77.0 92.7 84.5 73.4 83.1 82.0
Table 2: Comparisons on the COCO test-dev set. #Params and FLOPs are calculated for the pose estimation network, and those for human detection and keypoint grouping are not included.

4.1 COCO Keypoint Detection

Dataset. The COCO dataset [36] contains over images and person instances labeled with keypoints. We train our model on COCO train dataset, including images and person instances. We evaluate our approach on the val set and test-dev set, containing images and images, respectively.

Evaluation metric.

The standard evaluation metric is based on Object Keypoint Similarity (OKS):

Here is the Euclidean distance between the detected keypoint and the corresponding ground truth, is the visibility flag of the ground truth, is the object scale, and

is a per-keypoint constant that controls falloff. We report standard average precision and recall scores

111http://cocodataset.org/#keypoints-eval: ( at ) , (the mean of scores at positions, ; for medium objects, for large objects, and at .

Training. We extend the human detection box in height or width to a fixed aspect ratio: , and then crop the box from the image, which is resized to a fixed size, or . The data augmentation includes random rotation (), random scale (), and flipping. Following  [68], half body data augmentation is also involved.

We use the Adam optimizer [32]. The learning schedule follows the setting [72]. The base learning rate is set as , and is dropped to and at the th and

th epochs, respectively. The training process is terminated within

epochs.

Testing. The two-stage top-down paradigm similar as [47, 11, 72] is used: detect the person instance using a person detector, and then predict detection keypoints.

We use the same person detectors provided by SimpleBaseline222https://github.com/Microsoft/human-pose-estimation.pytorch [72] for both validation set and test-dev set. Following the common practice [72, 40, 11], we compute the heatmap by averaging the headmaps of the original and flipped images. Each keypoint location is predicted by adjusting the highest heatvalue location with a quarter offset in the direction from the highest response to the second highest response.

Results on the validation set. We report the results of our method and other state-of–the-art methods in Table 1. Our small network - HRNet-W, trained from scratch with the input size , achieves an AP score, outperforming other methods with the same input size. (1) Compared to Hourglass [40], our small network improves AP by points, and the GFLOPs of our network is much lower and less than half, while the number of parameters are similar and ours is slightly larger. (2) Compared to CPN [11] w/o and w/ OHKM, our network, with slightly larger model size and slightly higher complexity, achieves and points gain, respectively. (3) Compared to the previous best-performed SimpleBaseline [72], our small net HRNet-W obtains significant improvements: points gain for the backbone ResNet- with a similar model size and GFLOPs, and points gain for the backbone ResNet- whose model size (#Params) and GLOPs are twice as many as ours.

Our nets can benefit from (1) training from the model pretrained for the ImageNet classification problem: The gain is points for HRNet-W; (2) increasing the capacity by increasing the width: Our big net HRNet-W gets and improvements for the input sizes and , respectively.

Considering the input size , our HRNet-W and HRNet-W, get the and AP, which have and improvements compared to the input size . In comparison to the SimpleBaseline [72] that uses ResNet- as the backbone, our HRNet-W and HRNet-W attain and points gain in terms of AP at and computational cost, respectively.

Results on the test-dev set. Table 2 reports the pose estimation performances of our approach and the existing state-of-the-art approaches. Our approach is significantly better than bottom-up approaches. On the other hand, our small network, HRNet-W, achieves an AP of . It outperforms all the other top-down approaches, and is more efficient in terms of model size (#Params) and computation complexity (GFLOPs). Our big model, HRNet-W, achieves the highest AP. Compared to the SimpleBaseline [72] with the same input size, our small and big networks receive and improvements, respectively. With additional data from AI Challenger [70] for training, our single big network can obtain an AP of .

4.2 MPII Human Pose Estimation

Dataset. The MPII Human Pose dataset [2] consists of images taken from a wide-range of real-world activities with full-body pose annotations. There are around images with subjects, where there are subjects for testing and the remaining subjects for the training set. The data augmentation and the training strategy are the same to MS COCO, except that the input size is cropped to for fair comparison with other methods.

Testing. The testing procedure is almost the same to that in COCO except that we adopt the standard testing strategy to use the provided person boxes instead of detected person boxes. Following [14, 77, 62], a six-scale pyramid testing procedure is performed.

Evaluation metric. The standard metric [2]

, the PCKh (head-normalized probability of correct keypoint) score, is used. A joint is correct if it falls within

pixels of the groundtruth position, where is a constant and is the head size that corresponds to of the diagonal length of the ground-truth head bounding box. The PCKh () score is reported.

Results on the test set. Tables 3 and 4 show the PCKh results, the model size and the GFLOPs of the top-performed methods. We reimplement the SimpleBaseline [72] by using ResNet- as the backbone with the input size . Our HRNet-W achieves a PKCh@ score, and outperforms the stacked hourglass approach [40] and its extensions [58, 14, 77, 31, 62]. Our result is the same as the best one [62] among the previously-published results on the leaderboard of Nov. th, 333http://human-pose.mpi-inf.mpg.de/#results. We would like to point out that the approach [62], complementary to our approach, exploits the compositional model to learn the configuration of human bodies and adopts multi-level intermediate supervision, from which our approach can also benefit. We also tested our big network - HRNet-W and obtained the same result . The reason might be that the performance in this datatset tends to be saturate.

Method Hea Sho Elb Wri Hip Kne Ank Total
Insafutdinov et al. [27]
Wei et al. [69]
Bulat et al. [4]
Newell et al. [40]
Sun et al. [58]
Tang et al. [63]
Ning et al. [44]
Luvizon et al. [37]
Chu et al. [14]
Chou et al. [12]
Chen et al. [10] 89.6
Yang et al. [77]
Ke et al. [31] 86.3
Tang et al. [62] 96.9 91.8 92.3
SimpleBaseline [72]
HRNet-W 98.6 96.9 92.8 89.0 92.3
Table 3: Performance comparisons on the MPII test set (PCKh).
Method #Params GFLOPs PCKh@
Insafutdinov et al. [27] M
Newell et al. [40] M
Yang et al. [77] M
Tang et al. [62] M
SimpleBaseline [72] M
HRNet-W M
Table 4: #Params and GFLOPs of some top-performed methods reported in Table 3. The GFLOPs is computed with the input size .

4.3 Application to Pose Tracking

Dataset. PoseTrack [28] is a large-scale benchmark for human pose estimation and articulated tracking in video. The dataset, based on the raw videos provided by the popular MPII Human Pose dataset, contains video sequences with frames. The video sequences are split into , , videos for training, validation, and testing, respectively. The length of the training videos ranges between frames, and frames from the center of the video are densely annotated. The number of frames in the validation/testing videos ranges between frames. The frames around the keyframe from the MPII Pose dataset are densely annotated, and afterwards every fourth frame is annotated. In total, this constitutes roughly labeled frames and pose annotations.

Evaluation metric. We evaluate the results from two aspects: frame-wise multi-person pose estimation, and multi-person pose tracking. Pose estimation is evaluated by the mean Average Precision (mAP) as done in [51, 28]. Multi-person pose tracking is evaluated by the multi-object tracking accuracy (MOTA) [38, 28]. Details are given in [28].

Training. We train our HRNet-W for single person pose estimation on the PoseTrack training set, where the network is initialized by the model pre-trained on COCO dataset. We extract the person box, as the input of our network, from the annotated keypoints in the training frames by extending the bounding box of all the keypoints (for one single person) by in length. The training setup, including data augmentation, is almost the same as that for COCO except that the learning schedule is different (as now it is for fine-tuning): the learning rate starts from , drops to at the th epoch, and to at the th epoch; the iteration ends within epochs.

Testing. We follow [72] to track poses across frames. It consists of three steps: person box detection and propagation, human pose estimation, and pose association cross nearby frames. We use the same person box detector as used in SimpleBaseline [72], and propagate the detected box into nearby frames by propagating the predicted keypoints according to the optical flows computed by FlowNet 2.0 [26]444https://github.com/NVIDIA/flownet2-pytorch, followed by non-maximum suppression for box removing. The pose association scheme is based on the object keypoint similarity between the keypoints in one frame and the keypoints propagated from the nearby frame according to the optical flows. The greedy matching algorithm is then used to compute the correspondence between keypoints in nearby frames. More details are given in [72].

Results on the PoseTrack test set. Table 5 reports the results. Our big network - HRNet-W achieves the superior result, a mAP score and a MOTA score. Compared with the second best approach, the FlowTrack in SimpleBaseline [72], that uses ResNet- as the backbone, our approach gets and points gain in terms of mAP and MOTA, respectively. The superiority over the FlowTrack [72] is consistent to that on the COCO keypoint detection and MPII human pose estimation datasets. This further implies the effectiveness of our pose estimation network.

Entry Additional training Data mAP MOTA
ML-LAB [84] COCO+MPII-Pose
SOPT-PT [53] COCO+MPII-Pose
BUTD2 [29] COCO
MVIG [53] COCO+MPII-Pose
PoseFlow [53] COCO+MPII-Pose
ProTracker [19] COCO
HMPT [53] COCO+MPII-Pose
JointFlow [15] COCO
STAF [53] COCO+MPII-Pose
MIPAL [53] COCO
FlowTrack [72] COCO
HRNet-W COCO 74.9 57.9
Table 5: Results of pose tracking on the PoseTrack test set.

4.4 Ablation Study

We study the effect of each component in our approach on the COCO keypoint detection dataset. All results are obtained over the input size of except the study about the effect of the input size.

Method Final exchange Int. exchange across Int. exchange within
(a)
(b)
(c)
Table 6: Ablation study of exchange units that are used in repeated multi-scale fusion. Int. exchange across = intermediate exchange across stages, Int. exchange within = intermediate exchange within stages.
Figure 4: Qualitative results of some example images in the MPII (top) and COCO (bottom) datasets: containing viewpoint and appearance change, occlusion, multiple persons, and common imaging artifacts.
Figure 5: Ablation study of high and low representations. , , correspond to the representations of the high, medium, low resolutions, respectively.

Repeated multi-scale fusion. We empirically analyze the effect of the repeated multi-scale fusion. We study three variants of our network. (a) W/o intermediate exchange units ( fusion): There is no exchange between multi-resolution subnetworks except the last exchange unit. (b) W/ across-stage exchange units only ( fusions): There is no exchange between parallel subnetworks within each stage. (c) W/ both across-stage and within-stage exchange units (totally fusion): This is our proposed method. All the networks are trained from scratch. The results on the COCO validation set given in Table 6 show that the multi-scale fusion is helpful and more fusions lead to better performance.

Resolution maintenance. We study the performance of a variant of the HRNet: all the four high-to-low resolution subnetworks are added at the beginning and the depth are the same; the fusion schemes are the same to ours. Both our HRNet-W and the variant (with similar #Params and GFLOPs) are trained from scratch and tested on the COCO validation set. The variant achieves an AP of , which is lower than the AP of our small net, HRNet-W

. We believe that the reason is that the low-level features extracted from the early stages over the low-resolution subnetworks are less helpful. In addition, the simple high-resolution network of similar parameter and computation complexities without low-resolution parallel subnetworks shows much lower performance .

Representation resolution. We study how the representation resolution affects the pose estimation performance from two aspects: check the quality of the heatmap estimated from the feature maps of each resolution from high to low, and study how the input size affects the quality.

We train our small and big networks initialized by the model pretrained for the ImageNet classification. Our network outputs four response maps from high-to-low solutions. The quality of heatmap prediction over the lowest-resolution response map is too low and the AP score is below points. The AP scores over the other three maps are reported in Figure 5. The comparison implies that the resolution does impact the keypoint prediction quality.

Figure 6: Illustrating how the performances of our HRNet and SimpleBaseline [72] are affected by the input size.

Figure 6 shows how the input image size affects the performance in comparison with SimpleBaseline (ResNet-50) [72]. We can find that the improvement for the smaller input size is more significant than the larger input size, e.g., the improvement is points for and points for . The reason is that we maintain the high resolution through the whole process. This implies that our approach is more advantageous in the real applications where the computation cost is also an important factor. On the other hand, our approach with the input size outperforms the SimpleBaseline [72] with the large input size of .

5 Conclusion and Future Works

In this paper, we present a high-resolution network for human pose estimation, yielding accurate and spatially-precise keypoint heatmaps. The success stems from two aspects: (1) maintain the high resolution through the whole process without the need of recovering the high resolution; and (2) fuse multi-resolution representations repeatedly, rendering reliable high-resolution representations.

The future works include the applications to other dense prediction tasks, e.g., semantic segmentation, object detection, face alignment, image translation, as well as the investigation on aggregating multi-resolution representations in a less light way. All them are available at https://jingdongwang2017.github.io/Projects/HRNet/index.html.

Appendix

Results on the MPII Validation Set

We provide the results on the MPII validation set [2]. Our models are trained on a subset of MPII training set and evaluate on a heldout validation set of 2975 images. The training procedure is the same to that for training on the whole MPII training set. The heatmap is computed as the average of the heatmaps of the original and flipped images for testing. Following [77, 62], we also perform six-scale pyramid testing procedure (multi-scale testing). The results are shown in Table 7.

Method Hea Sho Elb Wri Hip Kne Ank Total
Single-scale testing
Newell et al. [40]
Yang et al. [77]
Tang et al. [62]
SimpleBaseline [72]
HRNet-W 90.3
Multi-scale testing
Newell et al. [40]
Yang et al. [77]
Tang et al. [62]
SimpleBaseline [72]
HRNet-W 90.8
Table 7: Performance comparisons on the MPII validation set (PCKh).

More Results on the PoseTrack Dataset

We provide the results for all the keypoints on the PoseTrack dataset [1]. Table 8 shows the multi-person pose estimation performance on the PoseTrack dataset. Our HRNet-W achieves 77.3 and 74.9 points mAP on the validation and test setss, and outperforms previous state-of-the-art method [72] by 0.6 points and 0.3 points respectively. We provide more detailed results of multi-person pose tracking performance on the PoseTrack2017 test set as a supplement of the results reported in the paper, shown in Table 9.

Method Head Sho. Elb. Wri. Hip Knee Ank. Total
PoseTrack validation set
Girdhar et al. [19]
Xiu et al. [75]
Bin et al. [72]
HRNet-W 77.3
PoseTrack test set
Girdhar et al.* [19]
Xiu et al. [75]
Bin et al.* [72]
HRNet-W* 74.9
Table 8: Multi-person pose estimation performance (MAP) on the PoseTrack2017 dataset. “*” means models trained on thr train+valid set.
Method Head Sho. Elb. Wri Hip Knee Ank. Total
Girdhar et al.* [19]
Xiu et al. [75]
Xiao et al.* [72]
HRNet-W* 57.9
Table 9: Multi-person pose tracking performance (MOTA) on the PoseTrack2017 test set.“*” means models trained on the train+validation set.

Results on the ImageNet Validation Set

We apply our networks to image classification task. The models are trained and evaluated on the ImageNet 2013 classification dataset [54]. We train our models for 100 epochs with a batch size of 256. The initial learning rate is set to 0.1 and is reduced by 10 times at epoch 30, 60 and 90. Our models can achieve comparable performance as those networks specifically designed for image classification, such as ResNet [22]. Our HRNet-W has a single-model top-5 validation error of 6.5% and has a single-model top-1 validation error of 22.7% with the single-crop testing. Our HRNet-W gets better performance: 6.1% top-5 errors and 22.1% top-1 error. We use the models trained on the ImageNet dataset to initialize the parameters of our pose estimation networks.

Acknowledgements. The authors thank Dianqi Li and Lei Zhang for helpful discussions.

References

  • [1] M. Andriluka, U. Iqbal, A. Milan, E. Insafutdinov, L. Pishchulin, J. Gall, and B. Schiele. Posetrack: A benchmark for human pose estimation and tracking. In CVPR, pages 5167–5176, 2018.
  • [2] M. Andriluka, L. Pishchulin, P. V. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, pages 3686–3693, 2014.
  • [3] V. Belagiannis and A. Zisserman. Recurrent human pose estimgation. In FG, pages 468–475, 2017.
  • [4] A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, volume 9911 of Lecture Notes in Computer Science, pages 717–732. Springer, 2016.
  • [5] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, pages 354–370, 2016.
  • [6] Z. Cao, T. Simon, S. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, pages 1302–1310, 2017.
  • [7] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In CVPR, pages 4733–4742, 2016.
  • [8] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018.
  • [9] X. Chen and A. L. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, pages 1736–1744, 2014.
  • [10] Y. Chen, C. Shen, X. Wei, L. Liu, and J. Yang. Adversarial posenet: A structure-aware convolutional network for human pose estimation. In ICCV, pages 1221–1230, 2017.
  • [11] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. CoRR, abs/1711.07319, 2017.
  • [12] C. Chou, J. Chien, and H. Chen. Self adversarial training for human pose estimation. CoRR, abs/1707.02439, 2017.
  • [13] X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature learning for pose estimation. In CVPR, pages 4715–4723, 2016.
  • [14] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. Multi-context attention for human pose estimation. In CVPR, pages 5669–5678, 2017.
  • [15] A. Doering, U. Iqbal, and J. Gall. Joint flow: Temporal flow fields for multi person tracking, 2018.
  • [16] X. Fan, K. Zheng, Y. Lin, and S. Wang. Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In CVPR, pages 1347–1355, 2015.
  • [17] H. Fang, S. Xie, Y. Tai, and C. Lu. RMPE: regional multi-person pose estimation. In ICCV, pages 2353–2362, 2017.
  • [18] D. Fourure, R. Emonet, É. Fromont, D. Muselet, A. Trémeau, and C. Wolf. Residual conv-deconv grid network for semantic segmentation. In British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4-7, 2017, 2017.
  • [19] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran. Detect-and-track: Efficient pose estimation in videos. In CVPR, pages 350–359, 2018.
  • [20] G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. In ECCV, pages 728–743, 2016.
  • [21] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. In ICCV, pages 2980–2988, 2017.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [23] P. Hu and D. Ramanan. Bottom-up and top-down reasoning with hierarchical rectified gaussians. In CVPR, pages 5600–5609, 2016.
  • [24] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense convolutional networks for efficient prediction. CoRR, abs/1703.09844, 2017.
  • [25] S. Huang, M. Gong, and D. Tao. A coarse-fine network for keypoint localization. In ICCV, pages 3047–3056. IEEE Computer Society, 2017.
  • [26] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, pages 1647–1655, 2017.
  • [27] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, pages 34–50, 2016.
  • [28] U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multi-person pose estimation and tracking. In CVPR, pages 4654–4663, 2017.
  • [29] S. Jin, X. Ma, Z. Han, Y. Wu, W. Yang, W. Liu, C. Qian, and W. Ouyang. Towards multi-person pose tracking: Bottom-up and top-down methods. In ICCV PoseTrack Workshop, 2017.
  • [30] A. Kanazawa, A. Sharma, and D. W. Jacobs. Locally scale-invariant convolutional neural networks. CoRR, abs/1412.5104, 2014.
  • [31] L. Ke, M. Chang, H. Qi, and S. Lyu. Multi-scale structure-aware network for human pose estimation. CoRR, abs/1803.09894, 2018.
  • [32] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [33] M. Kocabas, S. Karagoz, and E. Akbas. Multiposenet: Fast multi-person pose estimation using pose residual network. In ECCV, volume 11215 of Lecture Notes in Computer Science, pages 437–453. Springer, 2018.
  • [34] C. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In AISTATS, 2015.
  • [35] I. Lifshitz, E. Fetaya, and S. Ullman. Human pose estimation using deep consensus voting. In ECCV, pages 246–260, 2016.
  • [36] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014.
  • [37] D. C. Luvizon, H. Tabia, and D. Picard. Human pose regression by combining indirect part detection and contextual information. CoRR, abs/1710.02322, 2017.
  • [38] A. Milan, L. Leal-Taixé, I. D. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multi-object tracking. CoRR, abs/1603.00831, 2016.
  • [39] A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, pages 2274–2284, 2017.
  • [40] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, pages 483–499, 2016.
  • [41] X. Nie, J. Feng, J. Xing, and S. Yan. Pose partition networks for multi-person pose estimation. In ECCV, September 2018.
  • [42] X. Nie, J. Feng, and S. Yan. Mutual learning to adapt for joint human parsing and pose estimation. In ECCV, September.
  • [43] X. Nie, J. Feng, Y. Zuo, and S. Yan. Human pose estimation with parsing induced learner. In CVPR, June 2018.
  • [44] G. Ning, Z. Zhang, and Z. He. Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimedia, 20(5):1246–1259, 2018.
  • [45] W. Ouyang, X. Chu, and X. Wang. Multi-source deep learning for human pose estimation. In CVPR, pages 2337–2344, 2014.
  • [46] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV, September 2018.
  • [47] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, pages 3711–3719, 2017.
  • [48] X. Peng, Z. Tang, F. Yang, R. S. Feris, and D. Metaxas. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In CVPR, June 2018.
  • [49] T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In ICCV, pages 1913–1921, 2015.
  • [50] L. Pishchulin, M. Andriluka, P. V. Gehler, and B. Schiele. Poselet conditioned pictorial structures. In CVPR, pages 588–595, 2013.
  • [51] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, pages 4929–4937, 2016.
  • [52] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full-resolution residual networks for semantic segmentation in street scenes. In CVPR, 2017.
  • [53] PoseTrack. PoseTrack Leader Board. https://posetrack.net/leaderboard.php.
  • [54] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [55] M. Samy, K. Amer, K. Eissa, M. Shaker, and M. ElHelw. Nu-net: Deep residual wide field of view convolutional neural network for semantic segmentation. In CVPRW, June 2018.
  • [56] S. Saxena and J. Verbeek. Convolutional neural fabrics. In NIPS, pages 4053–4061, 2016.
  • [57] T. Sekii. Pose proposal networks. In ECCV, September 2018.
  • [58] K. Sun, C. Lan, J. Xing, W. Zeng, D. Liu, and J. Wang. Human pose estimation using global and local normalization. In ICCV, pages 5600–5608, 2017.
  • [59] K. Sun, M. Li, D. Liu, and J. Wang. IGCV3: interleaved low-rank group convolutions for efficient deep neural networks. In BMVC, page 101. BMVA Press, 2018.
  • [60] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei. Integral human pose regression. In ECCV, pages 536–553, 2018.
  • [61] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  • [62] W. Tang, P. Yu, and Y. Wu. Deeply learned compositional models for human pose estimation. In ECCV, September 2018.
  • [63] Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, and D. N. Metaxas. Quantized densely connected u-nets for efficient landmark localization. In ECCV, pages 348–364, 2018.
  • [64] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In CVPR, pages 648–656, 2015.
  • [65] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, pages 1799–1807, 2014.
  • [66] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In CVPR, pages 1653–1660, 2014.
  • [67] J. Wang, Z. Wei, T. Zhang, and W. Zeng. Deeply-fused nets. CoRR, abs/1605.07716, 2016.
  • [68] Z. Wang, W. Li, B. Yin, Q. Peng, T. Xiao, Y. Du, Z. Li, X. Zhang, G. Yu, and J. Sun. Mscoco keypoints challenge 2018. In Joint Recognition Challenge Workshop at ECCV 2018, 2018.
  • [69] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, pages 4724–4732, 2016.
  • [70] J. Wu, H. Zheng, B. Zhao, Y. Li, B. Yan, R. Liang, W. Wang, S. Zhou, G. Lin, Y. Fu, et al. Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475, 2017.
  • [71] F. Xia, P. Wang, X. Chen, and A. L. Yuille. Joint multi-person pose estimation and semantic part segmentation. In CVPR, pages 6080–6089, 2017.
  • [72] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In ECCV, pages 472–487, 2018.
  • [73] G. Xie, J. Wang, T. Zhang, J. Lai, R. Hong, and G. Qi. Interleaved structured sparse convolutional neural networks. In CVPR, pages 8847–8856. IEEE Computer Society, 2018.
  • [74] S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, pages 1395–1403, 2015.
  • [75] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu. Pose flow: Efficient online pose tracking. In BMVC, page 53, 2018.
  • [76] Y. Xu, T. Xiao, J. Zhang, K. Yang, and Z. Zhang. Scale-invariant convolutional neural networks. CoRR, abs/1411.6369, 2014.
  • [77] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learning feature pyramids for human pose estimation. In ICCV, pages 1290–1299, 2017.
  • [78] W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR, pages 3073–3082, 2016.
  • [79] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, pages 1385–1392, 2011.
  • [80] T. Zhang, G. Qi, B. Xiao, and J. Wang. Interleaved group convolutions. In ICCV, pages 4383–4392, 2017.
  • [81] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, pages 6230–6239, 2017.
  • [82] L. Zhao, M. Li, D. Meng, X. Li, Z. Zhang, Y. Zhuang, Z. Tu, and J. Wang. Deep convolutional neural networks with merge-and-run mappings. In IJCAI, pages 3170–3176, 2018.
  • [83] Y. Zhou, X. Hu, and B. Zhang. Interlinked convolutional neural networks for face parsing. In ISNN, pages 222–231, 2015.
  • [84] X. Zhu, Y. Jiang, and Z. Luo. Multi-person pose estimation for posetrack with enhanced part affinity fields. In ICCV PoseTrack Workshop, 2017.