DeepAI
Log In Sign Up

View Invariant 3D Human Pose Estimation

01/30/2019
by   Guoqiang Wei, et al.
Microsoft
USTC
0

The recent success of deep networks has significantly advanced 3D human pose estimation from 2D images. The diversity of capturing viewpoints and the flexibility of the human poses, however, remain some significant challenges. In this paper, we propose a view invariant 3D human pose estimation module to alleviate the effects of viewpoint diversity. The framework consists of a base network, which provides an initial estimation of a 3D pose, a view-invariant hierarchical correction network (VI-HC) on top of that to learn the 3D pose refinement under consistent views, and a view-invariant discriminative network (VID) to enforce high-level constraints over body configurations. In VI-HC, the initial 3D pose inputs are automatically transformed to consistent views for further refinements at the global body and local body parts level, respectively. For the VID, under consistent viewpoints, we use adversarial learning to differentiate between estimated poses and real poses to avoid implausible 3D poses. Experimental results demonstrate that the consistent viewpoints can dramatically enhance the performance. Our module shows robustness for different 3D pose base networks and achieves a significant improvement (about 9 estimation benchmark Human3.6M.

READ FULL TEXT VIEW PDF

page 2

page 6

05/12/2020

View-invariant Pose Analysis for Human Movement Assessment from RGB Data

We propose a CNN regression method to generate high-level, view-invaria...
03/23/2016

Towards Viewpoint Invariant 3D Human Pose Estimation

We propose a viewpoint invariant model for 3D human pose estimation from...
10/17/2017

Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation

In this paper, we propose a pose grammar to tackle the problem of 3D hum...
12/02/2019

View-Invariant Probabilistic Embedding for Human Pose

Depictions of similar human body configurations can vary with changing v...
09/25/2017

Multi-view pose estimation with mixtures-of-parts and adaptive viewpoint selection

We propose a new method for human pose estimation which leverages inform...
10/23/2020

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Recognition of human poses and activities is crucial for autonomous syst...
09/27/2019

Exploring Pose Priors for Human Pose Estimation with Joint Angle Representations

Pose Priors are critical in human pose estimation, since they are able t...

I Introduction

Estimating 3D human pose from an RGB image has been an active research field for many years. It facilitates a wide spectrum of applications such as human computer interaction, action recognition, sports performance analysis, and augmented reality [1]

. This is still a challenging task as it should not only overcome the barriers that exist in 2D pose estimation such as the diversities in viewpoint, clothing, lighting and the flexibility in body articulation, but also resolve the ambiguities in recovering depth from a 2D projection of 3D objects. The development of neural networks has advanced the 3D human pose estimation

[2, 3, 4, 5, 6, 7]. Generally, previous methods are categorized into two classes: i) training an end-to-end network to directly predict 3D pose from an image [2, 3, 4], ii) estimating 2D pose from an image first and then lifting the 2D pose to 3D pose [5, 8, 6, 7].

Despite the general success of the end-to-end learning paradigm, two-step solutions which consist of a convolutional neural network for predicting 2D joint locations from an image and a subsequent optimization step to recover the 3D pose also win excellent performance

[4, 6, 7]. One reason is that 2D pose estimation [9, 10, 11, 12] has gained significant advances and can provide plausible estimation with high accuracy [8]. To infer a 3D pose from 2D pose input, many approaches have been designed, e.g., by memorization (i.e., matching) [13, 14, 8] or by regressing the 3D pose [5, 6, 7]. Even though promising results have been achieved, few approaches explore the challenge caused by the diversity of viewpoints. In practical scenarios, a person can be captured from an arbitrary viewpoint by a camera. Similarly, a person can perform an action towards different orientations with respect to a camera. These result in diverse viewpoints for both the human poses. It is generally challenging for one model to tackle poses with diverse viewpoints.

Fig. 1: The proposed framework for view invariant (VI) 3D pose estimation. Given a set of 2D joint locations of a person, the base network (a) predicts an initial 3D pose , which is then further corrected by the proposed (b) global view-invariant body correction and the (c) local view-invariant body parts correction subnetworks (note that the Inv. Part Trans. module following Part Refine are not shown to save space) to generate refined 3D poses. Moreover, we enforce high-level body configuration constraint during the training by adversary learning, where a view invariant discriminator is jointly trained with the generator (e.g., our 3D pose estimator) to distinguish the ground-truth poses from the generated ones under consistent views. Through Inv. Body Trans., the 3D pose is finally inversely transformed back to the original view as the final output.

In this work, we propose a general view invariant module to refine 3D poses generated from base networks. It alleviates the influence of viewpoint diversity by automatically transforming the intermediate 3D poses to a consistent view, facilitating 3D pose correction. As shown in Figure 1, we build our framework by stacking a base 3D pose estimation network, and a view-invariant hierarchical correction network (VI-HC), to infer the 3D poses. To explore the global structure of the human body and the flexible configurations of local parts, VI-HC is constructed by a global view-invariant correction subnetwork and a local body parts view-invariant correction subnetwork, which efficiently refines the 3D poses by removing the influence of diverse viewpoints and flexibility of body parts. Moreover, to enforce high-level constraints over body configurations, a simple discriminator is added under the consistent views as a high level loss for training, which efficiently distinguishes the plausible 3D poses from fake 3D poses. Note that the consistent view easies both the refinements from VI-HC and discriminative correction from the discriminator. We validate the effectiveness of the view-invariant module on top of two different powerful base 3D pose estimation networks respectively. Experimental results demonstrate our proposed module improves the performance by about 9% and 3% on the two base networks respectively, on the public 3D human pose estimation benchmark Human3.6M dataset.

In summary, we make four main contributions:

  • We design a simple but powerful view invariant scheme to address the challenge of diverse viewpoints to enhance the 3D pose estimation performance.

  • Hierarchical view invariant correction subnetworks are designed to refine the 3D poses under consistent views with respect to the global body, and the five articulated body parts, respectively.

  • We propose the use of a simple view invariant discriminator to enforce a high-level constraint over body configurations to exclude implausible 3D poses.

  • Our module consistently improves the performance and shows robustness over different baselines.

Ii Related Work

Ii-a 3D Pose Estimation

Over the recent years, we have witnessed the tremendous progress in the field of 3D pose estimation. Tekin et al.[15] rely on an auto-encoder to learn high-dimensional latent pose representation and regress 3D poses from 2D images. Pavlakos et al.[4] train a convolutional neural network to predict per voxel likelihoods for each joint in a fine discretized 3D volumetric representation. Sun et al.[3] propose to regress the joint locations and bone representations to exploit structure information. Extra 2D pose datasets are usually utilized to enhance the performance [16].

Different from those end-to-end 2D image to 3D pose regression approaches [15, 4, 3], Martinez et al.[6] show that a well-designed simple network for regressing 3D pose from 2D pose can perform quite competitively. One important reason is that the latest off-the-shelf 2D pose estimators, e.g., CPM [10], Stacked Hourglass Network [9] can provide 2D pose estimation with high accuracy. On top of this simple 2D pose to 3D pose estimation network [6], Fang et al.[7] propose a pose grammar to learn to refine the 3D pose using bi-directional RNNs, which is designed to explicitly incorporate a set of knowledge regarding human body configuration. To exploit the temporal information for 3D human pose estimation, Hossain et al.[17]

incorporate a sequence-to-sequence regression model using recurrent neural network. Pavllo

et al.[18] use dilated temporal convolutions to capture long-term information.

However, for 3D pose estimation, it is still challenging to handle poses of diverse viewpoints and the body parts of high flexibility. In the task of skeleton based human action recognition [19], some attempts for reducing the effect of viewpoint diversity have been done. However, addressing the viewpoint variation challenge for 3D pose estimation is still overlooked and remains an open problem. In this work, we design a view invariant module to enhance the 3D pose estimation performance.

Ii-B Adversarial Learning

Goodfellow et al.[20] propose Generative Adversarial Networks (GAN) to help learn efficient generative models via an adversarial process. They simultaneously train two models: a generative model that captures the data distribution, and a discriminative model

that estimates the probability that a sample comes from the training data of

. Motivated by this seminal work, adversarial approaches have been widely applied in various fields [21, 22].

Some recent works [11, 12, 23] adopt adversarial learning to encourage the deep network to acquire plausible human body configurations. For 2D pose estimation, Chou et al.[11] and Chen et al.[12] design discriminators to distinguish groudtruth from generated ones by keypoint heatmaps, and 2D pose, respectively. Kanazawa et al.[23] add a discriminator to their 3D human body recovery network to determine whether the parameters of the generated 3D mesh correspond to real human bodies or not.

For 3D pose estimation, adding supervision on each joint (e.g., loss) is widely used for optimizing the estimator without considering the high level human body configurations. This inevitably generates some implausible poses. Yang et al.[24] design a multi-source discriminator to distinguish the predicted 3D poses from the ground-truth. Three information sources, image, geometric descriptor, as well as the heatmaps and depth maps are utilized for discrimination. In this work, we use a simple view invariant discriminator with the 3D poses of consistent views as input to distinguish the fake 3D poses from real ones. Making distinction under consistent views can reduce the difficultly of adversarial learning.

Iii Proposed View Invariant Model

As shown in Figure 1, our framework consists of three modules: a base 3D pose estimation network, a view-invariant hierarchy refinement/correction network (VI-HC), and a view-invariant discriminator (VID). A base 3D pose estimation module is used to generate initial estimated 3D pose from 2D joints locations. The proposed VI-HC refines the estimated 3D poses under consistent views with global and local transformations. A view-invariant discriminator is used to enhance the performance of the generator, i.e. our pose estimator, by enforcing high-level constraints under consistent views during training.

Generally, the mapping function from a 2D pose to 3D pose can be formulated as:

(1)

where denotes the learnable parameters of the model function , and denotes the number of joints. The objective of the whole model is to estimate each 3D pose as close to the ground-truth 3D pose as possible.

Iii-a Base 3D Pose Estimation Network

Networks that can provide 3D pose estimation could be taken as our base networks. To demonstrate the robustness of our proposed view invariant modules, we take two powerful, representative but simple networks as proposed by [6] and [17] as our base networks, respectively. For the two base networks, one is designed for individual frame 3D pose estimation while the other is designed for a sequence of 3D pose estimation by exploiting temporal information.

Baseline-1. This is a simple network proposed by Martinez et al.[6] and we use it to obtain our initial 3D pose from the 2D joint locations input. Note that the 2D joint locations input is obtained from 2D pose estimator, i.e.

, Hourglass. This base network is built by stacking two fully connection blocks with residual connection. Each block consists of several linear fully connected layers, followed by batch normalization, a dropout layer and a ReLU activation layer. The network encodes the input 2D pose to high dimensional discriminative features first and then projects these representation to 3D space to provide an initial 3D pose

. We represent the coordinate of the -th joint of as .

Baseline-2. This is a simple network proposed by Hossain et al.[17] and we use it to obtain our initial 3D poses from a sequence of 2D joint locations inputs. To exploiting the temporal information across a sequence of 2D joint locations to estimate 3D poses, a sequence-to-sequence LSTM network with residual connections on the decoder side is designed.

Iii-B View Invariant Hierarchical Correction

Fig. 2: Location distributions of different joints before and after transformation. The distribution of head joint (with the root joint located in the coordinate original) (a) before and (b) after global transformation. (c) After global transformation, the distribution of the flexible joint of left wrist is still scattered (with the left shoulder joint located in the coordinate original). (d) After the local transformation for the left arm part, the distribution of left wrist is more concentrated (with the left shoulder joint located in the coordinate original).

In real life, 3D human poses present great diversity in viewpoints. Such diversity results in scattered distribution for a joint as the examples in Figure 2 and requires a more powerful 3D pose estimation model to handle all the poses with diverse views. To that end, we propose a view-invariant hierarchical correction module which consists of a global body correction stage and a local body parts correction stage as illustrated in Figure 1 (b) and (c).

In the global body correction stage, a global transformation module transforms the initial 3D poses that come from the base network to a consistent view, followed by a refinement subnetwork. Similarly, in the local body parts correction stage, we divide the body into five parts with a partition manner similar to that of previous works [25, 26, 7]. For each part, a local part transformation module transforms that body part to a consistent view for further refinement.

Global body correction. In the global body correction stage, we focus on correcting 3D poses at consistent viewpoints with respect to the global bodies. The body transformation module transforms body pose to a consistent view, e.g., facing towards the camera with the upper body being upright, as illustrated in Figure 3.

Fig. 3: Illustration of transformation for global body correction.

Specially, we define a plane

which is determined by the locations of three joints - left/right hips and chest. The root point which is the middle of the left and right hips is defined as the coordinate origin. Within the network, a transformation is performed to make the norm vector

of parallel with the axis and the vector from root point to chest parallel with the axis, Y axis is then parallel with . and vector are used to determine the rotation angles with respect to axes. The transformed poses are facing the camera with the upper body being upright.

The transformation of a global body in the 3D space is formulated as:

(2)

where denotes the coordinate of the -th joint after the transformation, and . , , and denote the rotation matrices along the , , and axes, with rotation angles of , , , respectively. For example, can be represented as:

(3)

Note that ,, are obtained from vectors and .

After the transformation on the initial 3D poses, global body refinement using a fully connected linear subnetwork is performed under such consistent view to effectively correct the 3D poses.

Local body parts correction. Human bodies are non-rigid and very flexible. Global body transformation can reduce the viewpoint diversity and make the distribution of a joint more concentrated (see Figure 2 (b) verse (a)). However, there is still significant flexibility and viewpoint diversity for some joints on local body parts. As illustrated in Figure 2 (c), some flexible joints like wrist still have scattered distribution after global body transformation.

To further reduce the viewpoint diversity with respect to each body part and achieve view invariant, we propose local body part transformations for five body parts, i.e., left/right arms, left/right legs, and chest-thorax-jaw-head joint part, respectively. After the local body part transformations, the views are much more consistent and thus the location distribution of each joint is more concentrated, as shown in Figure 2 (d) versus (c). For the body part, similar to the global transformation, we take three joints to from a plane and get the norm vector . Then, a transformation is performed to make the norm vector parallel to the axis, and the vector formed by two joints (e.g., shoulder and elbow joints for the arm part) parallel to the axis. Similarly, the transformation parameters are obtained from vectors and . Note that for the arm part, upper arm (shoulder and elbow joints) is taken as the sub-part to form the vector . For the leg part, upper leg (hip and knee joints) is taken as the sub-part to form the vector . For the chest-thorax-jaw-head joint chain part, the connection of the chest and thorax joints is taken as the sub-part to form the vector .

For each body part, a refinement subnetwork similar to that for global body refinement is used to further correct that part under a consistent view.

Note that after the local body parts correction, the refined five parts are inversely transformed back based on the transform parameters and combined to have a full body pose. Similarly, based on the global transform parameters, the full body is finally inversed transformed back to the original view to obtain the final 3D pose as illustrated in Figure 1.

Iii-C View Invariant Discriminator Module

We encourage our pose estimator to explore prior human body configurations to avoid implausible 3D poses by adversarial learning. Considering it is easier to learn patterns from data of consistent viewpoints than that of diverse viewpoints, we design a simple discriminator to distinguish generated 3D poses from real 3D poses under consistent views.

During adversarial learning, our 3D pose estimator which is composed of the base network and view-invariant hierarchical correction network is taken as the generator . The estimated 3D poses are treated as “fake” samples (label 0) while the groudtruth 3D poses as “real” samples (label 1) to train the discriminator. The goal of is to generate poses with the distribution being similar to that of the groundtruth 3D poses, intending that the discriminator cannot differentiate between real samples and the estimated poses.

We propose to conduct the adversarial learning using the 3D poses under consistent viewpoints rather than the original views as illustrated in Figure 1. The more concentrated of the pose distribution by removing viewpoint variation, the easier to optimize the discriminator. Otherwise, the discriminator needs to be able to distinguish fake from real samples for various views, which is more challenging.

As the generator , the pose estimator tries to predict poses as real as possible to fool discriminator by optimizing the following additional loss:

(4)

where is the binary cross entropy loss. denotes the estimated 3D pose under the consistent view (ahead the Inv. Body Trans. as shown in Figure 1). represents the classification score of the discriminator.

The loss for training discriminator is

(5)

where denotes the groudtruth 3D pose under consistent view, i.e., after global body transformation.

Iii-D Joint Learning

Our designed view invariant 3D pose estimation network is an end-to-end network. We jointly train the network based on the loss

(6)

where is the loss on the joints for both the global body correction and local body parts correction stages, while is the cross-entropy loss from the discriminator, and

is a hyperparameter which is experimentally determined.

Iv Experiments

Iv-a Datasets

Similar to previous works [2, 3, 6], we conduct our experiments on the popular Human3.6M dataset [27]. Human3.6M is currently the largest publicly available dataset for 3D human pose estimation. This dataset consists of 3.6 million images where 7 professional actors perform 15 everyday activities such as purchasing, walking and sitting down. Four cameras are used. We also show qualitative results on the MPII 2D pose estimation dataset [28], for which the ground truth 3D is not available and the images are captured in the wild.

Iv-B Evaluation Metrics and Protocols

To fully validate the effectiveness of our proposed method, we use several commonly used evaluation metrics as follows.

Joint Error: the mean per joint position error (MPJPE). Most previous 3D pose estimation works use this metric [2, 3, 6, 15, 14, 29].

PA Joint Error: the MPJPE is calculated after aligning the predicted 3D pose and groundtruth 3D pose via a rigid transformation using Procrustes Analysis [30].

Bone Error: the mean per bone position error. It measures the relative joint location accuracy compared with the groundtruth [3].

Bone Std:

the bone length standard deviation. It measures the stability of bone length

[3].

We report performance of our proposed method using three protocols.

Protocol #1: For Human3.6M, the standard Protocol #1 is to use all four camera views of subjects S1, S5, S6, S7 and S8 for training and the subjects S9 and S11 for testing. We report the MPJPE in millimeters between the groundtruth and our prediction across all joints.

Protocol #2: Under the same setting as Protocol #1, the evaluation by PA Joint Error is referred to as Protocol #2 [6, 7, 4].

Protocol #3: Due to limited camera views, it is easy for a pose estimation model to overfit to the limited camera views. To validate the generalization ability of a model, Fang et al. [7] propose a cross-view protocol, where only 3 camera views are used for training while the other one is for testing. We refer to this protocol as Protocol #3.

Fig. 4: Qualitative results of our proposed method on Human3.6M (first two rows) and MPII (last row). The last example in the second row is a failure case because of the lack of appearance information.

Iv-C Implementation Details

We implement our method using PyTorch

[31]. In our base networks, similar to [6, 7, 17], the state-of-the-art stacked hourglass network [9], pre-trained on MPII and fine-tuned on Human3.6M, is used to estimate 2D poses on Human3.6M.

Our refinement network VI-HC has a similar architecture as the base network [6]

, where linear block and residual connection are used. Differently, the number of neurons for our fully connected layers are 400 and 800, respectively. For the discriminator, we adopt a simple encoder design which is composed of four fully connected layers. We set

as 0.001. The entire network is end-to-end trained with the base network pretrained first.

Iv-D Ablation Studies

Method Avg.
Baseline-1 (Martinez et al. ICCV’17) 62.8 -
B+HC 61.8 -1.0
B+VI-LC 60.2 -2.6
B+VI-GC 59.8 -3.0
B+VI-HC 58.4 -4.4
B+VI-HC-D 57.8 -5.0
B+VI-HC-VID (ours) 57.1 -5.7
TABLE I: Ablation studies of different components in our design on top of Baseline-1 on Human3.6M under Protocol #1, in terms of MPJPE (mm).
Method Avg.
Baseline-1 (Martinez et al. [6] ICCV’17) 74.2 -
B+VI-HC-VID (ours) 68.9 -5.3
TABLE II: Ablation studies of different components in our design on top of a weaker base model Baseline-1 on Human3.6M under Protocol #1, in terms of MPJPE (mm).

To demonstrate the effectiveness of each component of our proposed model, we perform various experiments on Human3.6M under Protocol #1 and report the MPJPE results in Table I. Without loss of generality, we perform the ablation study on top of Baseline-1 [6].

  • Baseline-1: the baseline 3D pose estimation network [6] which we take as our base network.

  • B+HC: the Base network [6] followed by Hierarchical Correction without view transformations.

  • B+VI-GC (or B+VI-LC): the Base network followed by View Invariant Global (or Local) Correction only.

  • B+VI-HC: the Base network followed by View Invariant Hierarchical Correction.

  • B+VI-HC-DB+VI-HC scheme with a Discriminator during training, where the discriminator operates under the original views.

  • B+VI-HC-VIDB+VI-HC scheme with View Invariant Discriminator. This is our final scheme. It is the same as B+VI-HC-D except that the discriminator operates under the transformed views rather than the original views.

  • Baseline-1: a weaker baseline 3D pose estimation network which uses only one fully connection block rather than two [6].

  • B+VI-HC-VID: a weaker base network Baseline-1 followed by our proposed VI-HC and VID.

Metric Joint Error PA Joint Error Bone Error Bone Std
Method Baseline Ours Baseline Ours Baseline Ours Baseline Ours
Knee( Hip) 64.6 50.8 63.4 21.1
Ankle( Knee) 87.6 64.7 83.5 26.5
Wrist( Elbow) 112.9 81.8 76.7 28.9
Elbow( Shoulder) 86.4 57.1 61.7 19.8
Shoulder( Thorax) 60.7 35.8 40.5 9.2
Avg. 68.1 51.3 51.6 16.0
TABLE III: Detailed results for some joints and bones for Baseline-1 [6] and Ours based on Baseline-1, under Protocol #2 on Human3.6M.
Bone pairs U.Arm L.Arm U.Leg L.Leg Avg.
Baseline 17.2 26.2 13.1 16.1 18.2
Ours 11.3 17.2 6.8 6.9 10.6
TABLE IV: Evaluations of the symmetry of limbs for Baseline-1 [6] and Ours based on Baseline-1, under Protocol #2 on Human3.6M.

As shown in Table I, view invariant global body correction (B+VI-GC) and view invariant local body parts correction (B+VI-LC) reduce the MPJPE by 3.0mm and 2.6mm, respectively. Combining the two level corrections achieve 4.4mm error reduction. Moreover, the adversarial learning under consistent view further decreases the error by 1.3mm, and our final scheme achieves 5.7mm reduction.

Consistent viewpoint helps effectively refine 3D poses. To verify that correcting the initial 3D poses under consistent viewpoints is more effective, we also evaluate the scheme with two stage 3D pose corrections without view transformations, i.e., B+HC. Compared with Baseline, this scheme only reduces the MPJPE by 1.0mm while the correction with view transformations reduces the MPJPE by 4.4mm. There are two main reasons. i) Additional correction introduces more parameters so that it becomes harder to train the model. Martinez et al.[6] also report that deeper network dose not improve the performance. ii) The poses are from various viewpoints. It is not easy for a model to handle them with such high diversity. In contrast, our model with view transformations B+VI-HC transforms the poses to consistent views and makes the learning easier.

B+VI-GC produces superior results to B+VI-LC. In B+VI-LC, each type of body part has a subnetwork which only makes use of the intra part rather than cross part information for refinement. In contrast, B+VI-GC takes all the joints as input for refinement. First, more context information (all the joints) is used for the refinement in global correction B+VI-GC than that in local part correction B+VI-LC. Second, local corrections can only improve the local joint details relative to the part, but have difficulty correcting the position errors of the part as a whole, due to lack of context information. The global errors could have larger contribution to the total error.

Consistent viewpoint helps the discriminator. If we feed the poses under the original views to the discriminator (B+VI-HC-D), the adversarial learning gains 0.6mm compared with the one not using adversarial learning (B+VI-HC). In contrast, the gain increases to 1.3mm if we feed the poses transformed to consistent views to the discriminator (B+VI-HC-VID). The poses with consistent views can help better train the discriminator and further improve the performance of the pose estimator.

To validate the robustness of our view invariant design on different base networks, we also conduct experiments on a weaker base network, which is built using only one fully connection block rather than two [6], referred to as, Baseline-1. Similarly, our view invariant scheme achieves an error reduction of 5.3mm as shown in Table II.

Method Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SittingD. Smoke Wait WalkD. Walk WalkT. Avg.
Zhou et al.[32] (CVPR’16) 87.4 109.3 87.1 103.2 116.2 143.3 106.9 99.8 124.5 199.2 107.4 118.1 114.2 79.4 97.7 113.0
Du et al.[33] (ECCV’16) 85.1 112.7 104.9 122.1 139.1 135.9 105.9 166.2 117.5 226.9 120.0 117.7 137.4 99.3 106.5 126.5
Park et al.[34] (ECCVW’16) 100.3 116.2 90.0 116.5 115.3 149.5 117.6 106.9 137.2 190.8 105.8 125.1 131.9 62.6 96.2 117.3
Pavlakos et al.[4] (CVPR’17) 67.4 71.9 66.7 69.1 72.0 77.0 65.0 68.3 83.7 96.5 71.7 65.8 74.9 59.1 63.2 71.9
Zhou et al.[2] (ICCV’17) 54.8 60.7 58.2 71.4 62.0 65.5 53.8 55.6 75.2 111.6 64.1 66.0 51.4 63.2 55.3 64.9
Sun et al.[3] (ICCV’17) 52.8 54.8 54.2 54.3 61.8 53.1 53.6 71.7 86.7 61.5 67.2 53.4 47.1 61.6 53.4 59.1
Fang et al.[7] (AAAI’18) 50.1 54.3 57.0 57.1 66.6 73.3 53.4 55.7 72.8 88.6 60.3 57.7 62.7 47.5 50.6 60.4
Hossain et al.[17] (ECCV’18) 48.4 50.7 57.2 55.2 63.1 72.6 53.0 51.7 66.1 80.9 59.0 57.3 62.4 46.6 49.6 58.3
Pavlakos et al.[35] (CVPR’18) (wo/ Ord) 59.1
Pavlakos et al.[35]* (CVPR’18) 48.5 54.4 54.4 52.0 59.4 65.3 49.9 52.9 65.8 71.1 56.6 52.9 60.9 44.7 47.8 56.2
Yang et al.[24] (CVPR’18) 51.5 58.9 50.4 57.0 62.1 65.4 49.8 52.7 69.2 85.2 57.4 58.4 43.6 60.1 47.7 58.6
Martinez et al.[6] (ICCV’17) (Baseline-1) 51.8 56.2 58.1 59.0 69.5 78.4 55.2 58.1 74.0 94.6 62.3 59.1 65.1 49.5 52.4 62.9
Ours (with Baseline-1) 46.6 54.0 55.1 55.2 61.4 69.8 52.0 52.6 68.1 75.0 56.7 56.0 60.5 44.5 48.7 57.1
Hossain et al.[17] (ECCV’18) (Baseline-2) 48.4 50.7 57.2 55.2 63.1 72.6 53.0 51.7 66.1 80.9 59.0 57.3 62.4 46.6 49.6 58.3
Ours (with Baseline-2) 48.5 49.5 55.0 52.5 62.1 69.5 52.7 49.6 63.9 76.6 57.4 55.8 60.3 46.5 49.3 56.6
TABLE V: Comparisons on Human3.6M under Protocol #1 in terms of MPJPE (mm). The underlined numbers represent the better results between ours and the baseline. Note that the 2D inputs are obtained with a fine-tuned stacked hourglass 2D pose detector for both Baselines and our schemes. For the work Pavlakos et al.[35] marked by (*), additional annotations of the ordinal depth on the 2D human pose datasets are utilized. Pavlakos et al.[35] denotes the results without using ordinal depth annotations.
Method Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SittingD. Smoke Wait WalkD. Walk WalkT. Avg.
Zhou et al.[32] (CVPR’16) 99.7 95.8 87.9 116.8 108.3 107.3 93.5 95.3 109.1 137.5 106.0 102.2 106.5 110.4 115.2 106.7
Bogo et al.[36] (ECCV’16) 62.0 60.2 67.8 76.5 92.1 77.0 73.0 75.3 100.3 137.3 83.4 77.3 86.8 79.7 87.7 82.3
Nie et al.[29] (ICCV’17) 62.8 69.2 79.6 78.8 80.8 72.5 73.9 96.1 106.9 88.0 86.9 70.7 71.9 76.5 73.2 79.5
Moreno-Noguer[5] (CVPR’17) 66.1 61.7 84.5 73.7 65.2 67.2 60.9 67.3 103.5 74.6 92.6 69.6 71.5 78.0 73.2 74.0
Pavlakos et al.[4] (CVPR’17) 51.9
Fang et al.[7] (AAAI’18) 38.2 41.7 43.7 44.9 48.5 55.3 40.2 38.2 54.5 64.4 47.2 44.3 47.3 36.7 41.7 45.7
Hossain et al.[17] (ECCV’18) 35.7 39.3 44.6 43.0 47.2 54.0 38.3 37.5 51.6 61.3 46.5 41.4 47.3 34.2 39.4 44.1
Pavlakos et al. [35]* (CVPR’18) 34.7 39.8 41.8 38.6 42.5 47.5 38.0 36.6 50.7 56.8 42.6 39.6 43.9 32.1 36.5 41.8
Yang et al.[24] (CVPR’18) 26.9 30.9 36.3 39.9 43.9 47.4 28.8 29.4 36.9 58.4 41.5 30.5 29.5 42.5 32.2 37.7
Martinez et al.[6] (ICCV’17) (Baseline-1) 39.5 43.2 46.4 47.0 51.0 56.0 41.4 40.6 56.5 69.4 49.2 45.0 49.5 38.0 43.1 47.7
Ours (with Baseline-1) 35.8 41.0 42.4 44.1 45.9 50.6 39.5 37.7 52.2 56.6 45.5 41.7 46.6 33.7 38.6 43.4
Hossain et al.[17] (ECCV’18) (Baseline-2) 35.7 39.3 44.6 43.0 47.2 54.0 38.3 37.5 51.6 61.3 46.5 41.4 47.3 34.2 39.4 44.1
Ours (with Baseline-2) 35.9 38.2 42.8 41.9 45.6 51.5 38.1 36.9 50.1 58.1 45.5 39.6 45.2 34.6 39.1 42.8
TABLE VI: Comparisons on Human3.6M under Protocol #2 in terms of MPJPE (mm) using PA Joint Error metric. Note that for the work Pavlakos et al.[35] marked by (*), additional annotations of the ordinal depth on the 2D human pose datasets are utilized.
Method Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SittingD. Smoke Wait WalkD. Walk WalkT. Avg.
Pavlakos et al.[4] (CVPR’17) 79.2 85.2 78.3 89.9 86.3 87.9 75.8 81.8 106.4 137.6 86.2 92.3 72.9 82.3 77.5 88.6
Bie et al.[29] (ICCV’17) 103.9 103.6 101.1 111.0 118.6 105.2 105.1 133.5 150.9 113.5 117.7 108.1 100.3 103.8 104.4 112.1
Zhou et al.[2] (ICCV’17) 61.4 70.7 62.2 76.9 71.0 81.2 67.3 71.6 96.7 126.1 68.1 76.7 63.3 72.1 68.9 75.6
Fang et al.[7] (AAAI’18) 57.5 57.8 81.6 68.8 75.1 85.8 61.6 70.4 95.8 106.9 68.5 70.4 73.8 58.5 59.6 72.8
Martinez et al.[6] (ICCV’17) (Baseline-1) 65.7 68.8 92.6 79.9 84.5 100.4 72.3 88.2 109.5 130.8 76.9 81.4 85.5 69.1 68.2 84.9
Ours (with Baseline-1) 56.4 60.9 69.1 70.0 72.4 84.1 60.3 71.3 82.9 89.0 67.1 70.9 74.0 66.4 65.2 70.8
TABLE VII: Comparisons on Human3.6M under Protocol #3 in terms of MPJPE (mm). Samples of three camera views are used for training and those of the other one are used for testing.
Fig. 5: Our method can correct the implausible pose.

Iv-E Plausibility Analysis

Table III shows the performance improvement of our proposed method for all joints and bones. The columns with results of Joint Error and PA Joint Error show that for the joints far away from the root joint (e.g., wrist, elbow), it is more difficult to estimate, since these joints are more flexible and have more scattered spatial distributions. Our proposed method dramatically improves the accuracy of these joints thanks to the hierarchical correction design, in which the view transformations reduce the diversity of viewpoints and easy the refinements. We have the similar observations on the bone errors (e.g., in terms of Bone Error and Bone Std), where the bones in four limbs have larger errors.

To evaluate the robustness of our scheme in terms of body structure preservation, in Table IV, we present the symmetry evaluation results of Baseline and Ours. The symmetry metric is defined as the difference between the left and right limb lengths. The smaller of the difference, the better of the preservation of the body structure. Our method predicts plausible 3D poses with more symmetrical structures.

We show some visualization results on Human3.6M and MPII in Figure 4. The results show that our method predicts plausible 3D pose well. We also show our model can correct some implausible poses thanks to the view invariant correction and adversarial learning. Figure 5 shows that the baseline model estimates pose with wrong body structure, while our model provides much better prediction.

Iv-F Comparison with the State-of-the-arts

We compare our schemes based on the two base networks with the state-of-the-art approaches on Human3.6M.

Table V presents the comparisons under Protocol #1. Our proposed method achieves an MPJPE of 57.1 mm. Compared with the powerful baseline networks [6] as marked by Baseline-1 and Baseline-2, our schemes achieve about 9% and 3% improvements and obtains gains even on all the action classes. For some challenging actions with diverse postures and viewpoints like sitting down, our method gains about 21% over Baseline-1. These results validate that our proposed model is very efficient by taking the viewpoint consistency into account. Based on Baseline-2 [17], our scheme achieves the best performance (MPJPE 56.5 mm) among the state-of-the-art approaches except Pavlakos et al.[35]* (CVPR’18), which uses additional annotations of the ordinal depth on the 2D human pose datasets. Our performance is superior to that of Pavlakos et al.[35]* (CVPR’18) (wo/ Ord) when it does not use these additional annotations. Note that the relative gain on Baseline-2 is smaller than that on Baseline-1 because the higher performance of a baseline, the harder to obtain gain with a smaller improvement space.

Under Protocol #2, as shown in Table VI, our proposed method performs best for all the action classes, compared with our baseline model. The improvement is about 9% over Baseline-1 [6] and 3% over Baseline-2. Our performance is inferior to that of Yang et al.[24]. One possible reason is that our base networks take off-line obtained 2D pose to estimate the 3D pose and lose the opportunity to further exploit the image information. Our performance is comparable to that of the other state-of-the-art approaches.

Under Protocol #3, as shown in Table VII, our model improves performance significantly, i.e., by 16%, compared with Baseline-1 [6], and outperforms the recently published best results. Our scheme is robust to inputs of unseen viewpoints, since the initial poses are transformed to consistent views within the network.

V Conclusions

In this paper, we propose a view invariant 3D human pose estimation framework to advance the state-of-the-art. The VI-HC subnetwork which transforms the initial 3D poses to consistent views is designed to efficiently correct the 3D poses. A view invariant discriminator is introduced to impose high-level constraints over body configurations to improve the performance. Experimental results demonstrate that our proposed framework improves the performance significantly compared with the powerful baseline methods and is robust to different baseline methods.

References

  • [1] N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris, “3D human pose estimation: A review of the literature and analysis of covariates.” Computer Vision and Image Understanding, vol. 152, pp. 1–20, 2016.
  • [2] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei, “Towards 3D human pose estimation in the wild: A weakly-supervised approach.” in International Conference on Computer Vision, 2017, pp. 398–407.
  • [3] X. Sun, J. Shang, S. Liang, and Y. Wei, “Compositional human pose regression.” in International Conference on Computer Vision, 2017, pp. 2621–2630.
  • [4] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-to-fine volumetric prediction for single-image 3D human pose.” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2017, pp. 1263–1272.
  • [5] F. Moreno-Noguer, “3D human pose estimation from a single image via distance matrix regression.” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1561–1570.
  • [6] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation.” in International Conference on Computer Vision, 2017, pp. 2659–2668.
  • [7] H. Fang, Y. Xu, W. Wang, X. Liu, and S. Zhu, “Learning pose grammar to encoder human body configuration for 3D pose estimation,” 2018.
  • [8] C. Chen and D. Ramanan, “3D human pose estimation = 2D pose estimation + matching.” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5759–5767.
  • [9] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation.” in European Conference on Computer Vision, vol. 9912, 2016, pp. 483–499.
  • [10] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines.” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
  • [11] C. Chou, J. Chien, and H. Chen, “Self adversarial training for human pose estimation.” in IEEE Conference on Computer Vision and Pattern RecognitionW, 2017.
  • [12] Y. Chen, C. Shen, X. Wei, L. Liu, and J. Yang, “Adversarial posenet: A structure-aware convolutional network for human pose estimation.” in International Conference on Computer Vision, 2017, pp. 1221–1230.
  • [13] H. Jiang, “3D human pose reconstruction using millions of exemplars.” in International Conference on Pattern Recognition, 2010, pp. 1674–1677.
  • [14] H. Yasin, U. Iqbal, B. Krüger, A. Weber, and J. Gall, “A dual-source approach for 3D pose estimation from a single image.” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4948–4956.
  • [15] B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua, “Direct prediction of 3D body poses from motion compensated sequences.” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 991–1000.
  • [16] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt, “Monocular 3d human pose estimation in the wild using improved cnn supervision,” in 3DV, 2017, pp. 506–516.
  • [17] M. R. I. Hossain and J. J. Little, “Exploiting temporal information for 3d human pose estimation,” in European Conference on Computer Vision.   Springer, 2018, pp. 69–86.
  • [18] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3d human pose estimation in video with temporal convolutions and semi-supervised training,” 2018.
  • [19] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data.” in International Conference on Computer Vision, 2017, pp. 2136–2145.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” in Advances in Neural Information Processing Systems, Jun. 2014.
  • [21] M. Mirza and S. Osindero, “Conditional generative adversarial nets.” CoRR, vol. abs/1411.1784, 2014.
  • [22] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016.
  • [23] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose.” CoRR, vol. abs/1712.06584, 2017.
  • [24] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang, “3d human pose estimation in the wild by adversarial learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2018.
  • [25] X. Chu, W. Ouyang, H. Li, and X. Wang, “Structured feature learning for pose estimation.” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4715–4723.
  • [26] M. F. Ghezelghieh, R. Kasturi, and S. Sarkar, “Learning camera viewpoint using cnn to improve 3D body pose estimation.” pp. 685–693, 2016.
  • [27] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments.” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1325–1339, 2014.
  • [28] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, in IEEE Conference on Computer Vision and Pattern Recognition.
  • [29] B. X. Nie, P. Wei, and S. Zhu, “Monocular 3D human pose estimation by predicting depth on joints.” in International Conference on Computer Vision, 2017, pp. 3467–3475.
  • [30] J. C. Gower, “Generalized procrustes analysis,” Psychometrika, vol. 40, no. 1, pp. 33–51, Mar 1975.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch.” 2017.
  • [32] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilid, “Sparseness meets deepness: 3D human pose estimation from monocular video.” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4966–4975.
  • [33] Y. Du, Y. Wong, Y. Liu, F. Han, Y. Gui, Z. Wang, M. Kankanhalli, and W. Geng, “Marker-less 3D human motion capture with monocular image sequence and height-maps.” in European Conference on Computer Vision, vol. 9908, 2016, pp. 20–36.
  • [34] S. Park, J. Hwang, , and N. Kwak, “3D human pose estimation using convolutional neural networks with 2D pose information.” in European Conference on Computer Vision Workshops (3), vol. 9915, 2016, pp. 156–169.
  • [35] G. Pavlakos, X. Zhou, and K. Daniilidis, “Ordinal depth supervision for 3D human pose estimation,” in CVPR, 2018.
  • [36] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3D human pose and shape from a single image.” in European Conference on Computer Vision, vol. 9909, 2016, pp. 561–578.