An Integral Pose Regression System for the ECCV2018 PoseTrack Challenge

09/17/2018 ∙ by Xiao Sun, et al. ∙ Microsoft 0

For the ECCV 2018 PoseTrack Challenge, we present a 3D human pose estimation system based mainly on the integral human pose regression method. We show a comprehensive ablation study to examine the key performance factors of the proposed system. Our system obtains 47mm MPJPE on the CHALL_H80K test dataset, placing second in the ECCV2018 3D human pose estimation challenge. Code will be released to facilitate future work.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ECCV2018 3D Human Pose Estimation Challenge evaluates proposed methods for estimating 3D key points of people from monocular RGB images. The challenge is based on the CHALL_H80K subset of the popular Human3.6M [1, 2] benchmark. CHALL_H80K contains 80K 3D human poses and corresponding images from 10 professional actors (6 male, 4 female) and 15 scenarios (discussion, smoking, taking photo, talking on the phone, etc.). Among them, 5 subjects (36K frames) are used for training, 2 subjects (20K frames) for validation and 4 subjects (24K frames) for testing. The images are captured in a controlled environment in which the subjects and background have a simple appearance. In CHALL_H80K, only 3D pose ground-truth is provided, using the root (pelvis) joint as the origin, expressed in millimeters (mm). For evaluation, the mean per joint position error (MPJPE[2] metric is used.

Human pose estimation has been extensively studied [2, 3, 4]

. Recent years have seen significant progress on the problem due to advances in deep convolutional neural networks (CNNs). The best performing methods on 2D pose estimation are all detection-based 

[5]. They generate a likelihood heat map for each joint and locate the joint as the point with the maximum likelihood in the map. Promising extensions of the heat map approach have been presented for 3D pose estimation [6]. Most recently, Sun et al. [7] replaced the argmax post-processing with an integral formulation to tackle the quantization problem and enable end-to-end learning. This approach currently achieves the highest 3D pose estimation performance.

For the ECCV 2018 PoseTrack Challenge, we present a 3D human pose estimation system based mainly on the integral human pose regression [7] method. Within the rules of the challenge, some external data is used in our system besides CHALL_H80K. Specifically, 2D human pose data from MPII [3] and COCO [4] are used for both camera model estimation and 2D/3D mixed data training. COCO object bounding box data is used to train a person box detector for a two-stage top-down pose estimation paradigm similar to [7, 8]

. ImageNet 

[9] classification data is employed for pre-training the backbone networks. No other external data is used. Our system obtains 47mm MPJPE on the CHALL_H80K test dataset, placing second in the ECCV2018 3D human pose estimation challenge [10]. Code111 will be released to facilitate future work.

2 Overview

Figure 1: Overview of 3D pose estimation framework.

Figure 1 provides an overview of our system. It contains three components. First, a person box detection component roughly localizes the person in the input RGB image. Second, a camera projection component is used to project 3D ground truth to the image coordinate system, as done in per-pixel/voxel classification based learning methods. These two components are used for data pre-processing. Third, the core CNN based learning component applies Integral Regression [7] to perform 3D human pose estimation.

Person Box Detection

Instead of predicting the 3D key points from the original image directly, we follow a two-stage top-down paradigm similar to [7, 8]. First, a person box detector is used to roughly localize the person. Then, a normalized local image patch is generated by cropping the original image with the box and resizing it to a fixed image resolution. The normalized local image patches are used as the final inputs of the CNN model. Since much of the background area is detected and removed in this first stage, the accuracy of the following key point detection model can be greatly improved. As shown in Table 2, we obtain a relative improvement (MPJPE decreases from to ) by using the person box detector.

Camera Projection Model

Following the detection based pose estimation paradigm, which is essentially a per pixel/voxel classification task, we first project the ground truth to the image coordinate system (in pixels). Since the CHALL_H80K dataset provides the 3D ground truth only in mm, we assume a weak perspective camera projection model and then estimate the camera model parameters by matching the 2D projection of the 3D ground truth with a coarse 2D key points estimation result, which is obtained from a pre-trained 2D model using external 2D data (MPII).

At the testing phase, the camera model parameters are unknown. Hence, we are not able to recover the final 3D key points using camera back-projection. Instead, we scale the predicted 3D key points in pixels to conform to a particular average bone length to obtain the final prediction in mm. Different choices of bone length significantly affect the final performance, as shown in Table 1.

Bone Length Type Per-frame Avg Val Avg Train Avg Train+Val
MPJPE(mm) 60.0 64.6 65.6 65.2
Table 1: Effect of using different bone length at testing phase.

Not surprisingly, using the per-frame ground truth average bone length substantially outperforms using other average bone lengths, such as those computed from the validation or training sets. In real applications, it is possible for users to provide this personalized information to strengthen the system. In this challenge, however, bone length information is not given on the test dataset, which leaves us with using an average bone length estimated from a given dataset. In practice, we use Avg Train+Val bone length in all of our experiments.

Integral Regression

Sun et al. [7] recently presented a simple and effective integral regression method that unifies the heat map representation and joint regression approaches in a manner that preserves the merits of both. It replaces the non-differentiable argmax post-processing with the differentiable integral operation, thus allowing end-to-end training and producing continuous output to solve the quantization problem. Moreover, they generalize this approach to the 3D human pose estimation problem for the first time and achieve the state of the art result on the Human3.6M dataset.

We use the Integral Regression method to train our 3D human pose estimation model. In [7], several variants of Integral Regression are proposed and investigated according to different heat map loss types and joint loss types. In their experiments, L1 joint loss without heat map loss pre-training achieves the best performance for the 3D pose estimation problem. We adopt this variant in all of our experiments.

The 3D pose estimation performance can be significantly improved by adding abundant 2D pose data to the training [7, 11]. This is feasible because the integral formulation generates predictions individually and maintains differentiability. In addition, our camera model estimation component projects 3D ground truth to the image coordinate system. As shown in Table 2, we get relative improvement (MPJPE decreases from to ) by using external MPII 2D pose data for mixed 2D/3D training.

3 Experiments


ResNet [12] is adopted as the backbone network and is pre-trained on the ImageNet classification dataset [9]

. A normal distribution with 1e-3 standard deviation is used to initialize the head network parameters as in


for integral regression. PyTorch 

[13] is used for implementation. Adam is employed for optimization. Data augmentation includes random translation ( of the image size), scale (), rotation (

degrees) and flip. In all experiments, the base learning rate is 1e-4. It drops to 1e-6 when the loss on the validation set saturates. The training iterations proceed until performance on the validation set saturates. Four GPUs are utilized. The mini-batch size is 64, and batch-normalization 

[14] is used. For our ablation study, the CHALL_H80K train and val datasets are used for training and evaluation, respectively. For the final challenge result on the CHALL_H80K test dataset, both train and val datasets are used for training. Other training details are provided with the individual experiments.

Ablation Study

Table 2 shows a comprehensive ablation study to examine the key performance factors of the proposed framework, including the two-stage paradigm using person box detection (25.4% relative improvement), and 2D and 3D mixed data training (28.1% relative improvement). Additionally, the effect of different training strategies including deeper backbone networks (1.6% relative improvement) and larger image resolution (4.4% relative improvement), together with testing strategies including flip testing (1.0% relative improvement) and model ensemble (2.8% relative improvement) are investigated.

description box det. dataset backbone patch size flip test ensemble MPJPE(mm)
original baseline HM36 ResNet-50 -
+person box det. HM36 ResNet-50 256*256
+MPII data HM36+MPII ResNet-50 256*256
+deeper HM36+MPII ResNet-152 256*256
+larger image HM36+MPII ResNet-152 288*384
+COCO data HM36+MPII+COCO ResNet-152 288*384
+flip test HM36+MPII+COCO ResNet-152 288*384
+model ensemble HM36+MPII+COCO ResNet-152 288*384
Table 2: Ablation study. All models are trained on CHALL_H80K train dataset and evaluated on CHALL_H80K val dataset.

Challenge Result

We use the best setting in Table 2, namely the last entry, to produce our final result on the CHALL_H80K test dataset. Both train and val datasets are used for training. Our system obtains 47mm MPJPE, giving it second place in the ECCV2018 3D human pose estimation challenge [10].