Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation

by   Jianfeng Zhang, et al.

Existing 3D human pose estimation models suffer performance drop when applying to new scenarios with unseen poses due to their limited generalizability. In this work, we propose a novel framework, Inference Stage Optimization (ISO), for improving the generalizability of 3D pose models when source and target data come from different pose distributions. Our main insight is that the target data, even though not labeled, carry valuable priors about their underlying distribution. To exploit such information, the proposed ISO performs geometry-aware self-supervised learning (SSL) on each single target instance and updates the 3D pose model before making prediction. In this way, the model can mine distributional knowledge about the target scenario and quickly adapt to it with enhanced generalization performance. In addition, to handle sequential target data, we propose an online mode for implementing our ISO framework via streaming the SSL, which substantially enhances its effectiveness. We systematically analyze why and how our ISO framework works on diverse benchmarks under cross-scenario setup. Remarkably, it yields new state-of-the-art of 83.6 best result by 9.7



There are no comments yet.


page 1

page 2

page 3

page 4


Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis

Camera captured human pose is an outcome of several sources of variation...

Adapted Human Pose: Monocular 3D Human Pose Estimation with Zero Real 3D Pose Data

The ultimate goal for an inference model is to be robust and functional ...

PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision

Existing self-supervised 3D human pose estimation schemes have largely r...

PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation

Existing 3D human pose estimators suffer poor generalization performance...

Non-Local Latent Relation Distillation for Self-Adaptive 3D Human Pose Estimation

Available 3D human pose estimation approaches leverage different forms o...

In-Bed Human Pose Estimation from Unseen and Privacy-Preserving Image Domains

Medical applications have benefited from the rapid advancement in comput...

Incremental Learning for Animal Pose Estimation using RBF k-DPP

Pose estimation is the task of locating keypoints for an object of inter...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D human pose estimation aims to localize 3D human body joints in images or videos. As a fundamental task in computer vision, it is widely applied to human-robot interaction 

Errity (2016), action recognition Yan et al. (2018), human tracking Mehta et al. (2017b), etc. This task is commonly resolved in a fully-supervised manner with golden annotations Martinez et al. (2017); Zhou et al. (2017); Mehta et al. (2017b); Zhao et al. (2019) that are collected in well-controlled laboratorial environments Ionescu et al. (2014). Despite their success in constrained scenarios, these methods are hardly generalized to new scenarios (e.g., in-the-wild scenes), due to severe differences in the underlying distributions (e.g., varying poses, camera viewpoints, body sizes and appearances).

Recent works address such a generalization challenge by leveraging either data augmentation strategies such as image composition Mehta et al. (2017b) and synthesis Chen et al. (2016); Varol et al. (2017), or more complicated model learning strategies like introducing kinematics priors Zhou et al. (2017); Dabral et al. (2018), separating 2D and depth features Sun et al. (2017); Mehta et al. (2017a); Sun et al. (2018); Habibie et al. (2019) or adopting adversarial learning Yang et al. (2018); Drover et al. (2018); Wandt and Rosenhahn (2019). However, they are still limited to the cases where training and test samples have similar poses and otherwise tend to suffer large performance drop, since their trained model is commonly biased to the training distribution and hardly generalizes well to an unseen one that is very different.

In this work, we propose a novel scheme named Inference Stage Optimization (ISO). Instead of focusing on improving model training, ISO improves and adapts the model at its inference stage before making predictions (Fig. 1). Our insight is that the target samples, although not labeled, carry valuable information about their distribution, which could be exploited to help adapt the model in the inference stage for correcting unfavorable training bias and improving generalization performance.

However, exploiting such prior from unlabeled data is highly non-trivial. Inspired by recent success of self-supervised learning (SSL) techniques for learning good representations from unlabeled samples in other domains, we propose to leverage SSL to explore the underlying prior from unlabeled target instances. Different from general objects, human poses present clear and informative geometry structure, thus we deploy two different SSL methods, namely random projection adversary Drover et al. (2018) and geometric cycle consistency Chen et al. (2019a), which are simple but effective at learning geometry-aware representations. ISO therefore enables the model to mine both geometric and distributional information from target instances and quickly adapt to the target scenario. As such, the model can estimate 3D poses more reliably across different scenarios, even in presence of severe distribution shifts.

Concretely, when training on labeled source data, instead of only performing fully-supervised learning (FSL) Martinez et al. (2017); Sun et al. (2017); Zhao et al. (2019), our proposed ISO trains the model with FSL and SSL jointly. Such a training scheme enables the model to leverage geometry-wise feedback from SSL to learn representations and estimate 3D poses. This also facilitates model optimization in the inference stage. During inference, ISO adapts the model parameters to the new scenario and distribution via performing SSL on each target instance. Equipped with such instance-specific adaptation, the model can estimate 3D pose for each sample from the new target scenario accurately. In addition, we also develop an online ISO to accumulate the learned adaptation knowledge from a sequence of target samples, which would speed up model adaptation and reduce computational overhead.

We conduct extensive experiments under cross-scenario setup: training a model on Human3.6M Ionescu et al. (2014) and evaluate it on MPI-INF-3DHP Mehta et al. (2017a) and 3DPW von Marcard et al. (2018). Notably, ISO achieves new state-of-the-art accuracy, 83.6% 3D PCK on MPI-INF-3DHP, improving upon the previous best result by 9.7%.

Our contributions are four-fold. 1) To our best knowledge, we are among the first to explore the practical cross-scenario 3D pose estimation task and develop an effective solution (i.e., ISO). Distinguished from existing works, we explore how to effectively adapt the models during the inference stage. 2) We identify and investigate two simple SSL techniques suitable for 3D pose estimation under the ISO framework, to exploit geometric and distributional knowledge from unlabeled target data. 3) We develop an online ISO framework, which can handle sequential data effectively and naturally apply to practical scenes where data usually come online sequentially. 4) We provide understandings on why and how ISO works for cross-scenario generalization by conducting systematic analysis, which may inspire future works on improving generalization of human pose estimation.

Figure 1: Illustration on our main idea. We consider cross-scenario setting where the model is trained on the source scenario (e.g., indoor scenes) but applied to a new target scenario (e.g., in-the-wild scenes). Existing methods (upper panel) usually use the trained model for predictions directly, which would suffer performance drop under such cross-scenario setup. Different from them, ISO adapts the model at its inference stage via performing self-supervised learning (SSL) on unlabeled target samples before making predictions (bottom panel), which largely improves its generalizability. Red arrow represents back-propagation based model update. Errors are labeled in black arrows.

2 Related work

3D pose estimation.

Lots of deep methods have been proposed for 3D pose estimation from 2D representations (e.g., images or poses) Tekin et al. (2016); Chen et al. (2016); Varol et al. (2017); Martinez et al. (2017); Sun et al. (2017, 2018); Fang et al. (2018); Zhao et al. (2019); Nibali et al. (2019); Cai et al. (2019); Sharma et al. (2019); Zhou et al. (2019), which highly rely on well-annotated datasets. These methods easily overfit to distribution-specific patterns such as camera views and pose subjects, and can hardly generalize to new scenarios. To improve their generalizability, semi- and weakly-supervised methods Tung (2017); Zhou et al. (2017); Yang et al. (2018); Dabral et al. (2018); Wandt and Rosenhahn (2019); Habibie et al. (2019); Wang et al. (2019); Chen et al. (2019b); Pavllo et al. (2019) have been developed. Some Zhou et al. (2017); Dabral et al. (2018); Pavllo et al. (2019) use kinematics priors for regularization or post-processing; others Yang et al. (2018); Wandt and Rosenhahn (2019) leverage adversarial training or separate 2D and depth features Sun et al. (2017); Mehta et al. (2017a); Sun et al. (2018); Habibie et al. (2019) for domain adaptation. Despite encouraging results, the applicability of these methods is still restricted in scope defined by the datasets they are trained on. Recently, several geometry-driven self-supervised methods Rhodin et al. (2018); Drover et al. (2018); Chen et al. (2019a); Kocabas et al. (2019); Pirinen et al. (2019); Li et al. (2020) use geometry consistency or epipolar constraint to generate 3D poses automatically. Different from all above methods, we are the first to learn distributional information from target instances at inference stage via SSL, which is demonstrated an effective method for out-of-distribution 3D pose estimation.

Learning on target instances.

Learning on target instances has emerged as a powerful technique for mining complex data distributions and priors. Bau et al. Bau et al. (2019) improve photo manipulation performance by adapting image priors to the statistics of an individual target image. Sun et al. Sun et al. (2019b) leverage rotation prediction pretext task for solving domain shift in image classification. Shocher et al. Shocher et al. (2018)

perform super-resolution of a target image via learning to recover it from its downsampled counterpart. However, these methods cannot be directly applied to 3D pose estimation. In this work, we propose a novel ISO framework to improve 3D pose estimation under cross-scenario setup through mining geometric and distributional knowledge from target instances.

3 Method

3.1 Problem formulation

Let denotes an image and denotes 2D spatial coordinates of keypoints of the human in the image. denotes the corresponding 3D joints position. We consider such cross-scenario setup: the model is trained on a source scenario (e.g., indoor scenes) with pose distribution , and applied to a new scenario (e.g., in-the-wild scenes) with unseen poses, viewpoints, body sizes and appearances drawn from a different distribution .

Empirically, a pose distribution can be disentangled to appearance and geometry factors Rhodin et al. (2018). The cross-scenario setup is faced with the pose distribution drift w.r.t. both of them. However, drift of appearance distribution can be well solved by powerful off-the-shelf 2D pose estimators. Thus we focus on addressing the drift w.r.t. pose geometry (i.e., poses, viewpoints, etc). We directly work with skeleton data and aim to obtain a 3D pose model that can lift 2D poses to 3D ones with good adaptive capability to a new scenario.

Suppose we have a pair of 2D and corresponding 3D poses drawn i.i.d. from the source distribution . Existing methods usually train a 3D pose model on these training samples and apply it directly on target samples drawn from the target distribution . In particular, the model with parameter is trained in a fully supervised learning (FSL) scheme:


where is a fully-supervised loss. Generally, is defined as mean squared errors (MSE) of the predicted and ground truth (GT) poses Martinez et al. (2017). Several earlier works complement such a loss with a bone supervision loss Sun et al. (2017); Zhao et al. (2019). Accordingly, is formulated as


Here and denote the GT and predicted 3D poses, respectively; and

denote the GT and predicted bone vectors computed from

and , respectively Sun et al. (2017). The obtained model is typically biased to the training samples and thus suffers limited generalizability.

Figure 2: Overall pipeline of ISO. (a) We first train our model by solving optimization of both FSL and SSL tasks in the source scenario with labeled data. During inference, given each unlabeled target sample, (b) we first perform SSL on it to update network parameters and (c) exploit the adapted network for final pose estimation.

3.2 Inference stage optimization

We introduce our Inference Stage Optimization (ISO) framework that allows a 3D pose model to mine geometric and distributional knowledge from target instances during the inference stage, and adapt to new scenarios with improved generalization performance. For simplicity, we consider a 3D pose model implemented by a

-layer neural network with parameters

for layer . The stacked parameter vector specifies the entire model for 3D pose estimation. The overall pipeline is illustrated in Fig. 2.

3.2.1 Training

Similar to existing methods, when training on the source scenario , our model parameters can be updated by solving the optimization problem in Eqn. (1). We call this the fully-supervised learning (FSL) task. However, our ISO also performs a self-supervised learning (SSL) task with self-supervised loss to train the pose estimation model so that it can learn to adapt via SSL feedback in the inference stage.

We choose two geometry-aware SSL methods to exploit pose geometry information from the skeleton data: random projection adversary Drover et al. (2018) and geometric cycle consistency Chen et al. (2019a), which are effective at geometry adaptation. Note with our framework, more SSL methods can be explored in the future.


The idea of random projection adversary SSL is that if a 2D pose is lifted to 3D accurately, and rotated and projected with randomly generated camera view, the resulting ‘synthetic’ 2D pose should lie within the valid 2D poses distribution. We build a pose discriminator

to classify each input 2D pose as real or fake (randomly projected from 3D poses). The loss is defined as


where and denote real and fake 2D poses, respectively. We follow Drover et al. (2018) to generate random camera view by sampling an azimuth angle between and an elevation angle between .


The geometric cycle consistency SSL complements ISO-Adversary with cycle consistency among 2D and 3D spaces. Specifically, by lifting the randomly projected 2D pose back to 3D and then re-projecting it to the original camera view, the resulting 3D and 2D poses should be consistent with the original ones. The training can thus be supervised by exploiting the cycle-consistency of the lift-project-lift process. Combined with the adversarial loss in Eqn. (3), the loss is


where and denote original and re-projected 2D poses, and denote lifted and re-lifted 3D poses, and are weights for 2D and 3D loss terms, respectively.

During training, we optimize both FSL and SSL tasks to update network parameters. Following standard multi-task learning framework Caruana (1997), the SSL task shares some of the network parameters with the FSL task, where . We call these shared layers as shared feature extractor. The SSL task uses its task-specific parameters . We call these unshared parameters the SSL head, and the FSL head. As shown in Fig. 2 (a), the joint architecture has a shared bottom and two heads. Both heads output a vector, indicating the 3D pose prediction. The only difference between them is that their network parameters are updated by solving different optimization problems.

We train the model in a multi-task learning fashion on the same data drawn from . The joint-training problem is formulated as


where denotes network parameters of the pose discriminator and is a relative weight for balancing different loss terms. Here denotes the self-supervised loss in Eqn. (3) or Eqn. (4).

3.2.2 Inference

After minimizing Eqn. (5) on data from with distribution , we obtain the network parameters , , and for the shared featured extractor, FSL head, SSL head and pose discriminator, respectively. During inference, ISO performs SSL on each single target instance to update the shared feature extractor, SSL head and pose discriminator (Fig. 2 (b)), which can be formulated as


The SSL process is done using standard gradient descent (or a variant) with learning rate and iteration . Additionally, a mini-batch contains several copies of such that a single optimization iteration can involve adversarial samples (i.e., randomly projected 2D poses) as much as possible, which ensures better performance. After optimizing Eqn. (6), we obtain the updated parameter of the shared feature extractor, and make a prediction using (Fig. 2 (c)). The motivation behind this formulation is that the joint training scheme (FSL+SSL at the training phase) enables the FSL head to be adaptive to the representations learned from SSL. In this way, the FSL head, though being frozen, can be directly applied for making accurate predictions over the representations updated by the SSL branch during inference.

We implement ISO in a vanilla mode, i.e., performing SSL on each target instance individually before making prediction on it. For vanilla ISO, the optimization problem in Eqn. (6) is always initialized with parameters , and . After performing iterations SSL on instance , we obtain the updated parameters , , . After making a prediction on , , and are discarded.

Besides vanilla ISO, when the target instances arrive sequentially, we propose a corresponding online ISO by streaming the SSL to continuously exploit distributional knowledge among them. Specifically, the online ISO solves the same optimization problem to update network parameters. However, when learning on , , and are instead initialized with , and updated on the previous instance . This allows the model to benefit from the distributional information available in instances as well as , and thus speeds up the model adaptation.

The summary of both vanilla and online ISO on target instances during inference is illustrated in Algorithm 1.

Input : target instances , pre-trained network parameters , learning rate , training iteration .
Output : 3D pose estimations .
Initialization: with
for  to  do
        if vanilla ISO then
               // online ISO
        end if
       for  to  do
               Compute gradients (Eqn. (6)) where .
               Update parameters: where .
        end for
       Predict 3D pose using the network parameters .
end for
Algorithm 1 Inference Stage Optimization.

3.3 Network details

Our 3D pose estimation model primarily consists of the residual block (RB) proposed in Martinez et al. (2017)

. Each RB consists of two linear layers, Batch Normalization (BN) 

Ioffe and Szegedy (2015)

, leaky ReLU 

He et al. (2015) and dropout Srivastava et al. (2014)

with residual connection 

He et al. (2016)

. The feature dimension and dropout probability are set to 1,024 and 0.5, respectively. Specifically, the shared feature extractor consists of a linear layer followed by three stacked RBs. It first transforms the input

-dimension vector to a 1024-dimension vector, which is then fed to the FSL and SSL heads separately. Both the FSL and SSL heads contain an unshared RB followed by a linear layer for 3D pose estimation. The pose discriminator takes as input the -dimension vector (2D pose) and outputs classification results (real or fake). We use three stacked RBs but remove all BN layers. For the 2-way classifier used for representation learning analysis (Sec. 4.3), we use the same architecture as the pose discriminator, except for the first layer since it takes 3D poses as inputs. The hidden feature used for visualization is extracted from the final residual block of the classifier (1024-dimension vector).

4 Experiments

We aim to answer the following questions through experiments: 1) Is ISO able to improve cross-scenario generalization performance of 3D pose estimation? 2) How does ISO take effect to boost generalization performance? 3) Does ISO introduce too much overhead in the inference stage?

4.1 Experiment setup

We quantitatively evaluate the generalizability of our method in cross-scenario setup, i.e., training a model on Human3.6M and evaluate its performance on the more challenging 3D pose benchmarks MPI-INF-3DHP Mehta et al. (2017a) and 3DPW von Marcard et al. (2018), which feature more diverse motions and scenes. We train our model on subjects S1, S5, S6, S7 and S8 of Human3.6M Martinez et al. (2017); Zhou et al. (2017) and evaluate it on the official test set of MPI-INF-3DHP and 3DPW. For MPI-INF-3DHP, we use Mean Per Joint Position Error (MPJPE), 3D Percentage of Correct Keypoints (PCK) with a threshold 150mm and the corresponding Area Under Curve (AUC) as metrics and adopt three evaluation protocols Habibie et al. (2019): (i) unscaled (US); (ii) glob. scaled (GS); (iii) procrustes (PA). For 3DPW, we follow Kanazawa et al. (2019) to use Procrustes Aligned MPJPE (PA-MPJPE) and 3D PCK as metrics. In addition, we use MPII Andriluka et al. (2014) and LSP Johnson and Everingham (2010), the standard 2D pose benchmarks with diverse scenes that reflect challenging factors such as strong pose deformations and abundant viewpoints in the real world, to qualitatively verify the effectiveness of our method.

We train our model for 200 epochs on Human3.6M, adopting Adam 

Kingma and Ba (2015) as optimizer with an initial learning rate of and using exponential decay and mini-batch size of 64. We use horizontal flip augmentation at both training and inference. During inference, for both 3DHP and 3DPW, we freeze batch normalization layers and perform SSL on each single target instance before making prediction. Specifically, for both vanilla and online ISO, we adopt Adam optimizer with learning rate . We set iteration as 10 and 1 for vanilla and online ISO, respectively. In following experiments, unless otherwise stated we use ISO-Cycle SSL technique.

4.2 Does ISO boost generalization?

We compare ISO (online) with several state-of-the-art approaches on 3DHP and 3DPW datasets. Some methods consider domain adaptation Yang et al. (2018), or use complex network architectures Martinez et al. (2017); Kanazawa et al. (2018); Dabral et al. (2018); Ci et al. (2019); Zhao et al. (2019); Chang et al. (2019); Doersch and Zisserman (2019) and training schemes Wandt and Rosenhahn (2019); Arnab et al. (2019); Sun et al. (2019c). We use Baseline to denote the plain model trained with only FSL task; Joint is the model trained with FSL and SSL tasks jointly; Vanilla refers to the model adapted using vanilla ISO; Online is the model adapted using online ISO.

Results on 3DHP.

We compare ISO against the methods in Yang et al. (2018); Ci et al. (2019); Chang et al. (2019) under cross-scenario setup. We directly report their results from original papers. Note some of them have missing metrics or do not specify evaluation protocols. Additionally, we implement and compare with methods Zhao et al. (2019); Wandt and Rosenhahn (2019) based on their released code.111Implementation is based on source code: SemGCN and RepNet for Zhao et al. (2019) and Wandt and Rosenhahn (2019), respectively. Table 2 shows the results under different metrics and protocols. Our method achieves the highest accuracy in terms of 3D PCK and MPJPE across all evaluation protocols, outperforming the second best by a large margin. This verifies the generalizability of our approach.

Results on 3DPW.

We also compare ISO with state-of-the-art approaches on 3DPW. Some methods exploit temporal information Dabral et al. (2018); Kanazawa et al. (2019); Doersch and Zisserman (2019), while some others are trained on the training set of 3DPW Arnab et al. (2019); Sun et al. (2019c). Table 2 reports the results. Our method outperforms several approaches in terms of PA-MPJPE and even achieves comparable results with the fully-supervised method Sun et al. (2019c). This shows the generalization capability of our method.

Yang Yang et al. (2018) 69.0 32.0 -
Ci Ci et al. (2019) 74.0 36.7 -
Chang Chang et al. (2019) 76.5 40.2 -
Wandt Wandt and Rosenhahn (2019) 81.8 54.8 92.5
Zhao Zhao et al. (2019)(US) 76.2 42.8 126.1
Ours (US) 83.6 48.2 92.2
Zhao Zhao et al. (2019)(GS) 77.1 45.5 108.0
Ours (GS) 84.5 50.9 88.4
Wandt Wandt and Rosenhahn (2019)(PA) 81.6 47.0 95.4
Zhao Zhao et al. (2019)(PA) 86.0 46.7 96.8
Ours (PA) 91.3 54.0 75.8
Table 2: Our results (14-joints) on 3DPW. denotes training using GT data.
Martinez Martinez et al. (2017) - 157.0
Dabral Dabral et al. (2018) - 92.3
Kanazawa Kanazawa et al. (2018) 84.1 76.7
Kanazawa Kanazawa et al. (2019) 86.4 80.1
Arnab Arnab et al. (2019) - 77.2
Doersch Doersch and Zisserman (2019) - 74.7
Sun Sun et al. (2019c) - 69.5
Ours 82.0 70.8
Table 1: Results on 3DHP. denotes our impleme-
ntation. US, GS and PA denote different protocols.
Qualitative results.

We visualize some 3D pose estimations of ISO on the challenging LSP, MPII, 3DHP and 3DPW datasets in Fig. 3. Most of the involved poses and camera views are unseen to our model. However, our ISO can still achieve good results even in presence of self-occlusion (1st column), large pose variations (2nd, 3rd column), and unusual views (4th column). Additionally, ISO compared with Baseline produces more geometrically plausible results. These verify the superior generalizability of ISO to challenging new scenarios.

Baseline 78.9 43.7 103.8
Joint-Adv 80.9 46.1 97.0
Vanilla-Adv 82.1 47.2 95.3
Online-Adv 83.0 47.6 93.1
Joint-Cyc 81.3 46.9 96.2
Vanilla-Cyc 82.5 47.6 94.1
Online-Cyc 83.6 48.2 92.2
Table 3: Ablation of different SSL
techniques on MPI-INF-3DHP.
How does the choice of self-supervised learning technique impact accuracy?

We first study the influence of different SSL techniques on the model’s generalizability. We use Adv and Cyc to represent ISO-Adversary and ISO-Cycle SSL techniques, respectively. The results are shown in Table 3. We can observe Adv (Joint, Vanilla and Online settings) improves accuracy upon Baseline by a large margin. In addition, we observe Cyc achieves even better results than Adv on all three settings by adding additional geometric cycle consistency constraint. These observations demonstrate the importance of adversarial learning and geometric knowledge to cross-scenario 3D pose estimation, which may motivate more SSL techniques in the future.

Figure 3: Example 3D pose estimations from LSP, MPII (top row) and 3DHP, 3DPW (bottom row). ISO results are shown in the left four columns. The rightmost column shows results of Baseline. Errors are labeled in black arrows. Please refer to supplement for more qualitative results.
How does hyper-parameters impact accuracy?

We then analyze the sensitivity of our method to hyper-parameters i.e., learning rate and training iteration used when performing ISO (Cyc). Specifically, we report 3D PCK for both vanilla and online ISO, and show the results in Fig. 4. We first analyze the impact of by varying while fixing to . From Fig. 4 (Left) we can observe that increasing from 1 to 10 for vanilla ISO, the accuracy is consistently increased from 81.5% PCK to 82.5% PCK, due to the geometric knowledge mined from the target instances. However, further increasing degrades the performance, caused by overfitting to the SSL task. We can also see that the model adapted under online ISO achieves best performance 83.6% PCK when , and the performance decreases when adopting a larger . The main reason is performing SSL under online mode with will make the model quickly overfit to the SSL task, thus hamper 3D pose estimation. Then we fix to 10 and 1 for vanilla and online ISO, respectively, and apply different (ranging from to ) to study the influence of learning rate on performance. Fig. 4 (Right) shows that both modes achieve best performance when . Further decreasing learning rate, the performance of both modes gradually degrades and gets close to Joint (i.e., the model without adapting) with 81.3% PCK. However, performing ISO with a large (e.g., ), the accuracy quickly drops, especially for online mode (70.9% PCK), since training with a large learning rate, the model easily overfits to the SSL task and thus restricts performance.

Figure 4: Analysis on hyper-parameters. Left: Training iteration . Right: Learning rate . For online ISO, the best and are set to 1 and . Further increasing them causes poor performance. For vanilla ISO, the best and are set to 10 and .
Figure 5: Distribution of limbs length ratio produced by ISO and Baseline on 3DHP. Left: Ratio of upper to lower arm. Middle: Ratio of upper to lower leg. Right: Ratio of upper to lower torso. Ground truth ratios are 1.3, 1.3 and 1.0 for arm, leg and torso, respectively. L and R indicate left and right body parts, respectively.
Figure 6: Visualization of
hidden features using t-SNE.
Figure 7: Per body-part accuracy on 3DHP. PCK of each part is computed as the PCK of corresponding skeleton joints.

4.3 Why ISO performs well?

We investigate why and how ISO can improve cross-scenario generalization. All below experiments are conducted on 3DHP using Online under the unscaled protocol, unless otherwise specified.

Geometric distribution alignment.

Our main insight is performing ISO on target instances enables the model to mine geometric knowledge (e.g., limb length ratios and body parts symmetry) about the target distribution. To verify this, we inspect the distribution alignment in geometry of output poses from Baseline and ISO (Online). Specifically, we compute the limb length ratios of upper to lower arms and legs (both for left and right sides), and torso Zhou et al. (2017); Chen et al. (2019a). The results are shown in Fig. 5. We can observe the ratio distributions generated by ISO are sharper and closer to the real ratio distributions of 3DHP, compared with those by Baseline. Additionally, ISO produces more symmetric ratio distributions for the left and right sides of arms and legs than Baseline, which verifies its ability to capture the symmetry of body parts. All these results clearly demonstrate the model adapted via ISO can mine geometric knowledge about the target distribution and thus generalize well to it, without requiring any prior for regularization Dabral et al. (2018) or post-processing Zhou et al. (2017).

Representation learning.

To further analyze how ISO helps during inference, we train a 2-way classifier to predict which dataset (Human3.6M or 3DHP) a given 3D pose comes from. The classifier after trained can achieve averagely 99.5% accuracy on both datasets, demonstrating the classifier’s ability to accurately capture the inter-dataset difference of geometry and judge the dataset (or distribution) a 3D pose comes from. Then, we apply this classifier to distinguish whether the 3D poses estimated by Baseline and ISO are close to the distribution of GT 3D poses from 3DHP. The classifier only identifies 52.6% of the 3D poses estimated by Baseline drawn from the target 3DHP distribution, while 83.4% of the 3D poses estimated by ISO drawn from 3DHP. This demonstrates the representations adapted by ISO are more similar to the target ones. Additionally, we visualize the hidden feature (1024-dimension vector) of the classifier by t-SNE Long et al. (2015) in Fig. 7. We can see performing SSL on target instances draws the feature distribution of the generated 3D poses closer to those of GTs (blue and green circles). All these results clearly demonstrate ISO enables the model to adapt to the real distribution of 3DHP during inference stage.

Per body-part improvement.

In addition to distribution alignment, we also study the performance improvement of our method on each body part. We first divide all skeleton joints into eight parts: Hip, Spine, Shoulder, Head, Elbow, Wrist, Knee and Ankle. Then we compute mean 3D PCK for each part and present the results in Fig. 7. We can see Online improves over Baseline by a large margin for Head, Elbow, Wrist and Ankle. All these parts are difficult to estimate especially for samples from new scenarios, due to high flexibility. However, Online successfully estimates these parts, which demonstrates the effectiveness of our method for cross-scenario generalization.

4.4 Is ISO costly or sensitive to noise?

Inference time analysis.

Our ISO scheme is slightly slower than a regular inference scheme, which only performs a single forward pass for each sample. Here, we provide two potential solutions to improve the computational efficiency. For vanilla ISO, we set iteration to 1 (instead of 10) and learning rate to (instead of ). The new setup is denoted as Vanilla-lr. For online ISO, since is already 1, we propose to perform SSL once per 10 samples, denoted as Online-skip. For all settings, we count average per-sample inference time in seconds and show results in Table 5.222 The time is counted on single GPU TITAN X and CPU Intel I7-5820K 3.3GHz. We observe by adopting the new inference setup, the computational efficiency can be improved by nearly and speedup for vanilla and online ISO, respectively, with good performance almost the same as the original. Significantly, we see Online-skip achieves almost the same efficiency as the regular inference scheme while improving the performance by a large margin.

Robustness to noisy observations.

We evaluate robustness of our method under different levels of noise by adding noise to the input 2D poses. Specifically, we add Gaussian noise to the GT 2D poses, where

is the standard deviation in pixel 

Martinez et al. (2017). The results are shown in Table 5. The accuracy decreases linearly with , which indicates the noise of 2D poses has major impact on the results. However, this issue can be alleviated by using state-of-the-art 2D pose estimators Nie et al. (2019); Sun et al. (2019a); Cheng et al. (2019) or training with synthetic error Moon et al. (2019); Chang et al. (2019). Note the maximum person size from head to foot is approximately 200px in the input data. Thus, Gaussian noise with is considered as extremely large. However, even under such large noise, ISO produces a better result (79.6% 3D PCK) than Baseline (78.9% 3D PCK with GT 2D inputs), which verifies its robustness.

Method PCK AUC MPJPE Time[s]
Vanilla 82.5 47.6 94.1 0.244
Vanilla-lr 82.1 47.3 94.6 0.027
Online 83.6 48.2 92.2 0.027
Online-skip 83.0 48.0 92.7 0.004
Baseline 78.9 43.7 103.8 0.003
Table 5: Performances with different levels 2D pose noise from .
ISO 83.6 48.2 92.2
ISO () 82.5 47.4 94.0
ISO () 79.6 43.7 103.4
Baseline 78.9 43.7 103.8
Table 4: Inference time analysis of different
inference modes of ISO.

5 Conclusion

We propose a new ISO framework for improving the generalizability of 3D pose estimation models. It explores underlying priors in target instances and leverages SSL techniques to mine such knowledge for estimating 3D poses accurately even under strong distribution shifts between source and target scenarios. ISO achieves state-of-the-art performance on challenging MPI-INF-3DHP benchmark under cross-scenario setting. In future, we plan to investigate more SSL techniques in our framework.

Broader impact

We propose Inference Stage Optimization (ISO) framework for cross-scenario 3D human pose estimation, which enables the 3D pose estimation model to mine distributional knowledge about the target scenario and quickly adapt to it with enhanced generalization. It can be applied to lots of 3D pose estimation related applications including human-robot interaction, action recognition, human tracking, etc., which are all important research topics in artificial intelligence. Generally, improving generalization performance for the 3D human pose estimation task may have many applications, which could be positive, negative or more complicated, but would depend on the nature of the organization using them and what task they use these applications for.


  • [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014) 2d human pose estimation: new benchmark and state of the art analysis. In CVPR, Cited by: §4.1.
  • [2] A. Arnab, C. Doersch, and A. Zisserman (2019) Exploiting temporal context for 3d human pose estimation in the wild. In CVPR, Cited by: §4.2, §4.2, Table 2.
  • [3] D. Bau, H. Strobelt, W. Peebles, J. Wulff, B. Zhou, J. Zhu, and A. Torralba (2019) Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–11. Cited by: §2.
  • [4] Y. Cai, L. Ge, J. Liu, J. Cai, T. Cham, J. Yuan, and N. M. Thalmann (2019) Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In ICCV, Cited by: §2.
  • [5] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §3.2.1.
  • [6] J. Chang, G. Moon, and K. M. Lee (2019) AbsPoseLifter: absolute 3d human pose lifting network from a single noisy 2d human pose. In ICCV, Cited by: §4.2, §4.2, §4.4, Table 2.
  • [7] C. Chen, A. Tyagi, A. Agrawal, D. Drover, R. MV, S. Stojanov, and J. M. Rehg (2019) Unsupervised 3d pose estimation with geometric self-supervision. In CVPR, Cited by: §1, §2, §3.2.1, §4.3.
  • [8] W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischinski, D. Cohen-Or, and B. Chen (2016) Synthesizing training images for boosting human 3d pose estimation. In 3DV, Cited by: §1, §2.
  • [9] X. Chen, K. Lin, W. Liu, C. Qian, and L. Lin (2019) Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation. In CVPR, Cited by: §2.
  • [10] B. Cheng, B. Xiao, J. Wang, H. Shi, T. S. Huang, and L. Zhang (2019) Bottom-up higher-resolution networks for multi-person pose estimation. In CoRR, Cited by: §4.4.
  • [11] H. Ci, C. Wang, X. Ma, and Y. Wang (2019) Optimizing network structure for 3d human pose estimation. In ICCV, Cited by: §4.2, §4.2, Table 2.
  • [12] R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma, and A. Jain (2018) Learning 3d human pose from structure and motion. In ECCV, Cited by: §1, §2, §4.2, §4.2, §4.3, Table 2.
  • [13] C. Doersch and A. Zisserman (2019)

    Sim2real transfer learning for 3d human pose estimation: motion to the rescue

    In NIPS, Cited by: §4.2, §4.2, Table 2.
  • [14] D. Drover, R. MV, C. Chen, A. Agrawal, A. Tyagi, and C. Phuoc Huynh (2018) Can 3d pose be learned from 2d projections alone?. In ECCVw, Cited by: §1, §1, §2, §3.2.1, §3.2.1.
  • [15] A. Errity (2016) Human–computer interaction. An Introduction to Cyberpsychology 241. Cited by: §1.
  • [16] H. Fang, Y. Xu, W. Wang, X. Liu, and S. Zhu (2018) Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI, Cited by: §2.
  • [17] I. Habibie, W. Xu, D. Mehta, G. Pons-Moll, and C. Theobalt (2019) In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In CVPR, Cited by: §1, §2, §4.1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    In ICCV, Cited by: §3.3.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.3.
  • [20] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv. Cited by: §3.3.
  • [21] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §1, §1.
  • [22] S. Johnson and M. Everingham (2010) Clustered pose and nonlinear appearance models for human pose estimation.. In BMVC, Cited by: §4.1.
  • [23] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In CVPR, Cited by: §4.2, Table 2.
  • [24] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik (2019) Learning 3d human dynamics from video. In CVPR, Cited by: §4.1, §4.2, Table 2.
  • [25] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICCV, Cited by: §4.1.
  • [26] M. Kocabas, S. Karagoz, and E. Akbas (2019) Self-supervised learning of 3d human pose using multi-view geometry. In CVPR, Cited by: §2.
  • [27] Y. Li, K. Li, S. Jiang, Z. Zhang, C. Huang, and R. Y. D. Xu (2020) Geometry-driven self-supervised method for 3d human pose estimation. In AAAI, Cited by: §2.
  • [28] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. arXiv. Cited by: §4.3.
  • [29] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: §1, §1, §2, §3.1, §3.3, §4.1, §4.2, §4.4, Table 2.
  • [30] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, Cited by: §1, §1, §2, §4.1.
  • [31] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas, and C. Theobalt (2017) Vnect: real-time 3d human pose estimation with a single rgb camera. ACM Trans. on Graphics 36 (4), pp. 44. Cited by: §1, §1.
  • [32] G. Moon, J. Chang, and K. M. Lee (2019) PoseFix: model-agnostic general human pose refinement network. In CVPR, Cited by: §4.4.
  • [33] A. Nibali, Z. He, S. Morgan, and L. Prendergast (2019) 3d human pose estimation with 2d marginal heatmaps. In WACV, Cited by: §2.
  • [34] X. Nie, J. Zhang, S. Yan, and J. Feng (2019) Single-stage multi-person pose machines. In ICCV, Cited by: §4.4.
  • [35] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, Cited by: §2.
  • [36] A. Pirinen, E. Gärtner, and C. Sminchisescu (2019) Domes to drones: self-supervised active triangulation for 3d human pose reconstruction. In NIPS, Cited by: §2.
  • [37] H. Rhodin, M. Salzmann, and P. Fua (2018) Unsupervised geometry-aware representation for 3d human pose estimation. In ECCV, Cited by: §2, §3.1.
  • [38] S. Sharma, P. T. Varigonda, P. Bindal, A. Sharma, and A. Jain (2019) Monocular 3d human pose estimation by generation and ordinal ranking. In ICCV, Cited by: §2.
  • [39] A. Shocher, N. Cohen, and M. Irani (2018) “Zero-shot” super-resolution using deep internal learning. In CVPR, Cited by: §2.
  • [40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. J. of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.3.
  • [41] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §4.4.
  • [42] X. Sun, J. Shang, S. Liang, and Y. Wei (2017) Compositional human pose regression. In ICCV, Cited by: §1, §1, §2, §3.1.
  • [43] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In ECCV, Cited by: §1, §2.
  • [44] Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt (2019) Test-time training for out-of-distribution generation. arXiv. Cited by: §2.
  • [45] Y. Sun, Y. Ye, W. Liu, W. Gao, Y. Fu, and T. Mei (2019) Human mesh recovery from monocular images via a skeleton-disentangled representation. In ICCV, Cited by: §4.2, §4.2, Table 2.
  • [46] B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua (2016) Direct prediction of 3d body poses from motion compensated sequences. In CVPR, Cited by: §2.
  • [47] H. F. Tung (2017)

    Adversarial inverse graphics networks: Learning 2d-to-3d lifting and image-to-image translation from unpaired supervision

    In ICCV, Cited by: §2.
  • [48] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In CVPR, Cited by: §1, §2.
  • [49] T. von Marcard, R. Henschel, M. Black, B. Rosenhahn, and G. Pons-Moll (2018) Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, Cited by: §1, §4.1.
  • [50] B. Wandt and B. Rosenhahn (2019) RepNet: weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In CVPR, Cited by: §1, §2, §4.2, §4.2, Table 2, footnote 1.
  • [51] L. Wang, Y. Chen, Z. Guo, K. Qian, M. Lin, H. Li, and J. S. Ren (2019) Generalizing monocular 3d human pose estimation in the wild. In ICCVw, Cited by: §2.
  • [52] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, Cited by: §1.
  • [53] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang (2018) 3d human pose estimation in the wild by adversarial learning. In CVPR, Cited by: §1, §2, §4.2, §4.2, Table 2.
  • [54] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas (2019) Semantic graph convolutional networks for 3d human pose regression. In CVPR, Cited by: §1, §1, §2, §3.1, §4.2, §4.2, Table 2, footnote 1.
  • [55] K. Zhou, X. Han, N. Jiang, K. Jia, and J. Lu (2019) HEMlets pose: learning part-centric heatmap triplets for accurate 3d human pose estimation. In ICCV, Cited by: §2.
  • [56] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei (2017) Towards 3d human pose estimation in the wild: a weakly-supervised approach. In ICCV, Cited by: §1, §1, §2, §4.1, §4.3.