PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision

by   Kehong Gong, et al.
National University of Singapore

Existing self-supervised 3D human pose estimation schemes have largely relied on weak supervisions like consistency loss to guide the learning, which, inevitably, leads to inferior results in real-world scenarios with unseen poses. In this paper, we propose a novel self-supervised approach that allows us to explicitly generate 2D-3D pose pairs for augmenting supervision, through a self-enhancing dual-loop learning framework. This is made possible via introducing a reinforcement-learning-based imitator, which is learned jointly with a pose estimator alongside a pose hallucinator; the three components form two loops during the training process, complementing and strengthening one another. Specifically, the pose estimator transforms an input 2D pose sequence to a low-fidelity 3D output, which is then enhanced by the imitator that enforces physical constraints. The refined 3D poses are subsequently fed to the hallucinator for producing even more diverse data, which are, in turn, strengthened by the imitator and further utilized to train the pose estimator. Such a co-evolution scheme, in practice, enables training a pose estimator on self-generated motion data without relying on any given 3D data. Extensive experiments across various benchmarks demonstrate that our approach yields encouraging results significantly outperforming the state of the art and, in some cases, even on par with results of fully-supervised methods. Notably, it achieves 89.1 evaluation setup, improving upon the previous best self-supervised methods by 8.6


page 4

page 8


Unsupervised 3D Pose Estimation with Geometric Self-Supervision

We present an unsupervised learning approach to recover 3D human pose fr...

3D Human Pose Machines with Self-supervised Learning

Driven by recent computer vision and robotic applications, recovering 3D...

Adversarial 3D Human Pose Estimation via Multimodal Depth Supervision

In this paper, a novel deep-learning based framework is proposed to infe...

Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation

Existing 3D human pose estimation models suffer performance drop when ap...

Self-supervised Learning of Pose Embeddings from Spatiotemporal Relations in Videos

Human pose analysis is presently dominated by deep convolutional network...

Self-Supervised Learning of Image Scale and Orientation

We study the problem of learning to assign a characteristic pose, i.e., ...

FetusMap: Fetal Pose Estimation in 3D Ultrasound

The 3D ultrasound (US) entrance inspires a multitude of automated prenat...

Code Repositories


[CVPR 2022] PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision (Oral)

view repo

1 Introduction

Figure 1: Overview of our PoseTriplet framework. The pose estimator, imitator, hallucinator are trained jointly in a dual-loop strategy. In the first loop, the estimator provides physically implausible motion information, which is then enhanced by the imitator via enforcing physical constraints to generate physically plausible motion. In the second loop, the hallucinator generates more diverse motion patterns given motion sequence from previous loop, and sends them to the imitator again for further refinement. This dual-loop paradigm facilitates tight co-evolution of the three components and enables iterative self-improving training of the estimator with the generated diverse and plausible motion data.

Video-based 3D human pose estimation aims to infer 3D pose sequences from videos, and therefore plays a crucial role in many applications such as action recognition [yan2018spatial, si2019attention], virtual try-on [liu2021spatt], and mixed reality [mehta2017vnect, joo2020exemplar, chengpnerf]. Existing methods [martinez2017simple, sun2018integral, nie2019spm, kocabas2020vibe, pavllo2019videopose3d] mainly rely on the fully-supervised paradigms, in which the ground truth 3D data are given as input. However, capturing 3D pose data is cost-intensive and time-consuming, as it typically requires a multi-view setup or a motion capturing system [mehta2017vnect, ionescu2014human3], making it infeasible under in-the-wild scenarios.

To this end, two categories of methods have been introduced to alleviate the 3D data availability issue. The first category explores the semi-supervised settings, in which only a small amount of the 3D annotations are given [zhou2017towards, li2019boosting, mitra2020multiview]. The second category, on the other hand, assumes no 3D data are available at all and only 2D poses are provided. Under this setup, state-of-the-art methods have mainly focused on imposing weak supervision signals to guide the training, such as aligning the projection of an inferred 3D pose with a 2D pose [chen2019unsupervised, yu2021towards, hu2021unsupervised]. Due to the lack of 3D data and hence the missing of 2D-3D pairs, these methods are, by nature, brittle to the challenging scenarios such as unseen poses inherent to the in-the-wild tasks.

In this paper, we propose a novel self-supervised approach termed as PoseTriplet, which allows for explicitly generating physically- and semantically-plausible 2D-3D pose pairs, so that full supervisions can be imposed and further significantly strengthen the self-learning process. This is made possible through introducing a reinforcement-learning-based imitator, which is jointly optimized with the pose estimator alongside a pose hallucinator. Specifically, the imitator takes the form a of physics simulator with non-differentiable dynamics to ensure physically plausibility. The hallucinator helps generate more diverse motion with generative motion completion. These three key components are integrated into a self-contained framework and co-evolve via a dual-loop strategy as the training proceeds. With only 2D pose data as input, PoseTriplet progressively generates, refines and hallucinates 3D data, which in turn reinforces all components in the loop. Once trained, each component of PoseTriplet can be readily taken out and serves as an off-the-shelf tool for its dedicated task, such as pose estimation or imitation.

The key motivation behind co-evolving the pose estimator, imitator and hallucinator, lies in their complementary natures. In particular, pose estimator takes 2D poses as input and generates 3D poses with reasonable semantics (e.g., nature behaviors) but implausible dynamics; such derived 3D poses are then refined through the physics-based imitator that enforces physical constraints. Conversely, the reinforcement-learning-based imitator is possible to generate unnatural behaviors (e.g., overly energetic movements), which can be rectified through the pose estimator to ensure the semantic plausibility. Pose hallucinator, on the other hand, enhances the data diversity by producing realistic 3D pose sequences under both the semantic and physical guidance, which further strengthens data synthesizing and hence improves generalization performance.

We show the overall workflow of PoseTriplet in Fig. 1, which effectively aligns with aforementioned motivation. Unlike prior endeavors that rely on self-consistency-based supervisions or 3D sequences as input, PoseTriplet, through the dual-loop scheme, turns the input 2D poses into dependable 3D poses of realistic semantics and dynamics, thereby lending itself to much stronger supervisions and consequently the co-evolution of the pose estimator, imitator and hallucinator. Experimental results across H36M, 3DHP, and 3DPW datasets demonstrate that, PoseTriplet gives rises to pose estimation results significantly superior to the state-of-the-art self-supervised methods, and sometimes even on par with results from fully-supervised ones. Notably, it achieves 89.1% 3D PCK on MPI-INF-3DHP under self-supervised cross-dataset evaluation setup, improving upon the previous best self-supervised method [hu2021unsupervised, kundu2020self] by 8.6%.

Our contribution is therefore a novel scheme dedicated for self-supervised 3D pose estimation, achieved by the co-evolution of a pose estimator, imitator, and hallucinator. The three components complement and benefit one another, together leading to a self-contained system that enables realist 3D pose sequences and further the 2D-3D augmented supervisions. By taking only 2D poses as input, PoseTriplet delivers truly encouraging results across various benchmarks, largely outperforming the state of the art and even approaching full-supervised results.

2 Related works

3D pose estimation 3D pose estimation have been wildly explored under fully supervised, semi-supervised, self-supervised. Various approaches have been explored under fully supervised setting [martinez2017simple, mehta2017vnect, sun2018integral, nie2019spm, Yang2020Distill, pavllo2019videopose3d, kocabas2020vibe, Yang2021Pose, zhang2021bmp, wang2021mvp, WangJueICCV19]. Through offering impressive results, those approaches highly rely on accurate motion capture data, which are hard to collect. To address high cost of data collection, semi-supervised methods [zhou2017towards, li2019boosting, mitra2020multiview] are proposed to utilize the information from unlabeled data. Besides semi-supervised approach, augmentation based methods [Li_2020_CVPR, gong2021poseaug] are proposed to enlarge the data amount through evolution strategy [Li_2020_CVPR] or learnable approach [gong2021poseaug].

Different from the above schemes, self-supervised methods, with multi-view data, explore the intrinsic supervision for model training, without requiring ground truth 3D pose [kocabas2019epipolar, wandt2021canonpose, iqbal2020weakly]. For instance, Kocabas et al. [kocabas2019epipolar] utilize the epipolar geometry to generate pseudo label,  [wandt2021canonpose, iqbal2020weakly] utilize the 3D pose consistency across different views. Though being effective, those approaches require synchronized multiple cameras, which are not usual in real scenarios. Other methods [drover2018can, chen2019unsupervised, yu2021towards, hu2021unsupervised] explore the more challenge single view setting. For example, Drover et al. [drover2018can] utilize the prior that a random projection of a plausible 3D pose estimation will be plausible in 2D pose distribution through adversary training. Chen et al. [chen2019unsupervised] improves this idea by adding cycle consistency. Yu et al. [yu2021towards] further introduces the scale steps for 2D poses to resolve the ambiguity issue. Zhang et al. [zhang2020inference]

applies self-supervised learning on test data to adapt model to new scenarios.

Our method belongs to the self-supervised approaches under single view setting. Different from previous self-supervised approaches which implement weak supervision signal through consistency [chen2019unsupervised] or adversary [drover2018can, yu2021towards], our method directly uses the strong supervision signal from self-generated data, results in more accurate and stable model performance. The pseudo label strategy [li2019boosting] under semi-supervised category is close to our approach. However, our approach does not require ground truth data for model pretraining, and our method introduces physical plausibility refinement and diversity enhancement to achieve better performance, which are absent in [li2019boosting].

Physics-based pose estimation The above methods are all Kinematics based. Though providing impressive results, they do not consider physical constrains, thus suffering physical implausible artifacts (e.g., foot skating and ground penetration). To ensure physical plausibility, recent works explore physical constraints. Rempe et al. [rempe2020contact] introduces physical law to the foot contact and human dynamic, while its iterative optimization is high time costly (e.g., 30 minutes for 2s clip). Later,  [shimada2020physcap, shimada2021neural, xie2021physics] propose differentiable physical constrain to reduce the time cost. But they only consider foot contact, making them less effective in scenarios with other important contact (e.g., lay down, sitting with chair).

Different from optimization based approaches, physics simulation based methods use physics simulators to provide realistic physical constrains. DeepMimic [peng2018deepmimic] tries to imitate various motion from reference mocap data in physics engine via reinforcement learning. SFV [peng2018sfv]

proposes to refine the low fidelity motion data from video-based pose estimation through imitation learning. However, their adopted imitation learning requires days of training for just one clip. Later, SimPoe 

[yuan2021simpoe] addresses this issue by introducing RFC [yuan2020residual] to effectively reduce the time consumption by training one policy for all motion clips. Our method is built upon SimPoe [yuan2021simpoe] for better generalization and low time cost. However, different from those methods only using physical constraints for post processing, our method propose to involve it in the learning loop. As such, no mocap data for pose estimation training and imitation learning is required.

Motion synthesis Motion synthesis includes non-learning based and learning based approaches. In non-learning based approaches, the motion graph method [kovar2008motion] first builds transition edges between different motion points based on their similarity, and then generates new motion data through traversing the graphs. Motion matching [buttner2015motion] searches proper future frames in motion data based on motion states in real time. In learning based approaches, motion perdition based methods [martinez2017human, zhang2020we, li2018convolutional, gui2018adversarial, barsoum2018hp, yuan2020dlow, pavllo2018quaternet] aim to predict future poses conditioned on previous poses. Action generation [wang2020learning, butepage2017deep, battan2021glocalnet] aims to generate pose sequence conditioned on action labels. Motion completion [hernandez2019human, kaufmann2020convolutional, holden2016deep, duan2021ssmc, harvey2018recurrent, harvey2020robust] generates realistic transitions between key frames, which most relevant to our work in their aims. The pose hallucination in our framework also aims to generate novel motion sequence, where motion graph and motion match methods are not applicable due to tight restriction in their generated data. We therefore choose motion completion considering it can generate longer sequence with continuously input key frames.

3 Methodology

Given a 2D pose sequence of length , where is the 2D spatial coordinate of body joints at time , our goal is to estimate the 3D pose sequence , where is the corresponding 3D joint position under the camera coordinate system. Conventionally, a pose estimator with parameter is trained with a large set of paired 2D and 3D pose data through fully-supervised learning approaches [martinez2017simple, mehta2017vnect, sun2018integral, kocabas2020vibe, pavllo2019videopose3d]:



denotes the loss function which is typically defined as mean square errors (MSE) between predicted and ground truth 3D poses sequences. However, ground truth 3D pose data is expensive to capture, which limits the applicability of these approaches. To avoid using 3D data, previous self-supervised approaches typically apply weak 2D re-projection loss 

[drover2018can, chen2019unsupervised, yu2021towards, hu2021unsupervised] to learn the estimator:


where is the perspective projection function. The re-projection loss only provides weak supervision which tends to induce unstable or unnatural estimations. In this work, we aim to design a self-supervised learning framework of which the core is an iterative self-improving paradigm. Specifically, we propose to enhance the current estimation with some specifically designed transformation (e.g., to produce more smooth and diverse motion):


The enhanced estimates are then projected to 2D pose to obtain paired training data , which are used to improve the pose estimator:


Here and denote the parameters of the current estimator and improved estimator. The improved estimator can then be utilized to start a new iteration of data enhancement and training. Build on this self-improving paradigm, we can train a superior pose estimator starting from only a set of 2D pose sequences .

Figure 2: Detail of our PoseTriplet framework. Given available 2D pose sequence , the pose estimator transforms it to low-fidelity 3D pose sequence . is then served as semantic guidance signal (i.e., reference motion) for imitator to obtain physically plausible motion . The hallucinator then generates novel and diverse motion from , which is then refined by the imitator to obtain the final enhanced diverse and plausible motion . is then projected to 2D-3D pairs to train the estimator. The improved estimator takes the available 2D pose sequence and start another round of dual-loop optimization.

3.1 PoseTriplet

To construct an effective self-improving framework, we identify two challenging aspects for enhancing the 3D motion sequence: 1) the pose estimation from the estimator may not be physically plausible due to ignorance of force, mass and contact modeling; 2) existing 2D motion may be limited in diversity and thus the learned model cannot generalize well. To address these challenges, we introduce a pose imitator based on the reinforcement learning aided human motion modeling and a pose hallucinator based on generative motion interpolation accordingly to refine and diversify the 3D motion. The former helps correct the physical artifacts while the latter generates novel pose sequences based on the existing estimates. We find these two aspects in motion are complementary and thus combine them together. The resulting pipeline helps obtain 3D motion data

with significantly improved physical plausibility and motion diversity. Nevertheless, we find naive two-step combination of the two approaches generate inferior-quality 3D pose sequence. The reason is that performing motion diversification first could be ineffective due to implausible estimate while conducting motion diversification later could introduce physical artifacts. Therefore, we further introduce a dual-loop scheme and unify the two components with pose estimator into a novel self-supervised framework named PoseTriplet.

Dual-loop architecture Concretely, as shown in Fig. 2, our PoseTriplet introduces a dual-loop architecture to integrate the three modules: a pose estimator , a pose imitator , and a pose hallucinator . Given the set of available 2D pose sequence , the pose estimator first transforms them to low-fidelity 3D pose sequence:


is converted to low-fidelity reference motions and served as semantic guidance signal to the pose imitator, which imposes the physical human motion dynamic modeling and obtains physically plausible motion sequence:


By learning a generative motion completion model, the pose hallucinator then generates novel and diverse motion sequences based on the improved plausible motion from the imitator:


Afterwords, instead of closing the loop by treating as augmented data to the estimator, we introduce another loop. We feed back into the imitator to correct the induced physical artifacts and obtain the final expected plausible and diverse motion sequences:


is then projected to 2D to obtain paired data for training the pose estimator.

By jointly optimizing this dual-loop architecture, the three components form a tight co-evolving paradigm: 1) the estimator benefits from the diverse and plausible augmented data to learn more accurate estimation. 2) the imitator learns more robust and physically natural motion based on the improved estimation and diverse data generated from the hallucinator. 3) the hallucinator generates diverse pose sequence of higher quality based on the improved data from the imitator.

Loop starting Another challenging aspect of this self-improving learning paradigm is the loop starting. Without access to 3D motion data, the whole framework cannot start learning. Recall our pose imitator employs physics-based human motion model, we thus develop a zero-data generating strategy that produces initial 3D pose sequence for starting the dual-loop learning. Specifically, we generate root trajectory signal in horizontal plane with random direction and proper velocity. This trajectory is then used for guidance signal for RL agent. By control the agent to follow the generated trajectory, we can generate motion sequences that are physically plausible. These motion sequences is then projected to obtain 2D-3D pose pairs and used to train a initial pose estimator. In this way, the whole dual-loop learning can be started.

3.2 Module detail

3.2.1 Pose estimator

The pose estimator estimates the 3D pose sequence from the input sequence . Specifically, we adopt a similar estimator architecture as VideoPose [pavllo2019videopose3d], which predicts both root trajectory and root-relative joint locations. The trajectory can be used as additional movement signal to pose imitator. Meanwhile, the noise in root movement can be corrected by the pose imitator and in turn help the pose estimator. We use Mean Square Error (MSE) loss for the root-related pose estimation and Weighted L1 loss for the trajectory estimation following [pavllo2019videopose3d].

Projection for training estimator Given the generated motion sequence data

, we project them to 2D to obtain paired training data. We consider two strategies for the projection: 1) Heuristic random projection. We set the virtual camera with certain elevation, azimuth range, height and distance range to match the indoor capture environment. This is similar to the projection strategy for 3D pose data synthesis as Chen 

et al.[chen2016synthesizing]; 2) Generative adversarial learning based projection [gong2021poseaug]. A generator is used to regress the camera orientation and position for each motion sequence. The regression is learned through a discriminator by distinguishing the real and the projected 2D pose sequences with the generated camera parameters. In this way, reasonable camera viewpoint distribution can be extracted from real 2D pose data, improving the plausibility of generated 2D-3D paired data. The two strategies are combined in our framework to ensure the diversity of camera viewpoints.

3.2.2 Pose imitator

The 3D pose sequences predicted from pose estimator , due to lack of physical constrain, would suffer unnatural artifacts such as foot skating, floating, floor penetration. Those artifacts prevent it from being used as training data directly for estimator or hallucinator . To address the issue, motivated by [peng2018deepmimic, peng2018sfv, yuan2021simpoe], we introduce a reinforcement learning based pose imitator to imitate the low fidelity 3D pose sequence from pose estimator to generate more physically plausible motion sequence .


The imitation process can be seen as a Markov decision process. Given a reference motion and current state

, the agent interacts with the simulation environment with action and receive reward . the action is determined by a policy conditioned on state ; the reward is determined based on how similar the agent behaves like the reference motion. When an action is taken, the current state changes to next state through transition function . The goal is to learn a policy that maximizes the average cumulative rewards (i.e., performing similar behavior in physics simulator as reference motion), where is the discounting factor. The state, action and rewards are detailed below.

State includes current pose , current velocity , and target pose from reference motion. To deal with the noisy reference motion from the pose estimator, we introduce an extra encoded feature by concatenating and fusing the past and future motion information. In this way, the control policy is aware of past and future reference motion, and is thus more robust to the noise.

Action involves two kinds of forces: internal force and external force. The internal force is applied by actuator on the non-root joints (e.g., elbow, knee). Following previous work [peng2017learning], we use PD (proportional–derivative) control for internal force control. The external force is a virtual force applied on root joint(i.e., hip) [yuan2020residual] for extra interaction (e.g., sitting on the chair) and is regressed by the policy network.

Rewards measure the motion differences between the agent and reference motion. These differences capture pose related (pose, velocity), root related (root height, root velocity) and body end factors (position, velocity). Besides, a regulation loss on virtual force is applied to avoid unnecessary external force following [yuan2020residual]. As we find that the agent is hard to move with the above setting due to the noisy reference motion, we further introduce a feet relative position into the motion characteristics to enhance the feet motion.

3.2.3 Pose hallucinator

The pose hallucinator aims to generate novel and diverse motion sequence based on the refined data from pose imitator. In this work, we choose motion interpolation technique to generate novel pose motions. Specifically, we sample key-frames from the refined pose sequence, and interpolate the missing frames via neural networks to generate new motion data. In details, the pose hallucinator is constructed by a recurrent neural network (RNN) structure. The inputs are the sampled temporal key-frames (we sample key-frames with a certain frame interval). Conditioned on these sampled key-frames, the model predicts the intermediate frames in sequential manner. A reconstruction loss and an adversary loss is used to train this model. The reconstruction loss measures the

distance between the ground truth and predicted poses. The adversary loss provides temporal supervision to avoid RNN collapse (i.e., predicting average motion). In the inference stage, we randomly select frames from different motion clips and generate novel motion sequences based on these sampled key-frames.

4 Experiments

We study three questions in experiments. 1) Is PoseTriplet able to improve performance of video pose estimator for both intra- and cross-dataset scenarios? 2) How does the performance improves with the round of co-evolving process? 3) How does the amount of training data affects model performance? We conduct experiments with H36M (source dataset) and 3DHP/3DPW (for cross-dataset evaluation). Throughout the experiments, we adopt VideoPose [pavllo2019videopose3d] (T=27) as our pose estimator. We report results from estimator for comparison. Please refer to supplementary for more implementation details.

4.1 Dataset

H36M [ionescu2014human3] is the most popular 3D pose benchmark captured by marker-based motion capture system. It contains 3.6 million video frames for 11 subjects and 15 scenarios. Following previous works [chen2019unsupervised, yu2021towards], we use the 2D poses of subject S1, S5, S6, S7, S8 as our training set and evaluate the performance on S9 and S11. The two standard metrics Mean Per Joint Position Error (MPJPE) in millimeters and Procrustes Aligned Mean Per Joint Position Error (PA-MPJPE) are used for evaluation.

3DHP [mehta2017vnect] is a large 3D pose dataset. It contains both indoor and outdoor scenarios. Following previous works [kolotouros2019spin, chen2019unsupervised], we report the metrics of MPJPE, Percentage of Correct Keypoints (PCK) and Area Under the Curve (AUC) after scale and rigid alignment for evaluation. We only use its test set to evaluate the model’s generalization performance.

3DPW [vonMarcard2018] is a more challenging in-the-wild dataset. It contains more complicated activities and scenarios. Same as 3DHP, we only use its test set to evaluate model’s generalization performance. Follow previous work [kolotouros2019spin], we report MPJPE and PA-MPJPE for 3DPW.

4.2 Quantitative results

Results on H36M We compare our PoseTriplet with other state-of-the-art self-supervised methods [rhodin2018unsupervised, chen2019unsupervised, kundu2020self, yu2021towards, hu2021unsupervised] under GT (ground truth 2D poses) and Det (detected 2D poses) settings as shown in Table 1. Among which, [rhodin2018unsupervised, chen2019unsupervised, kundu2020self] implement weak supervision (i.e., consistency supervision), [yu2021towards, hu2021unsupervised] utilize temporal information through adversary learning [yu2021towards] and smoothness constrains [hu2021unsupervised]. Our method outperforms the best of them by a large margin in MPJPE for both GT (85.3 vs. 68.2) and Det (82.1 vs. 78.0) settings. The result verifies that our method with co-evolving strategy and augmented supervision performs better compared with previous approaches. Moreover, our method also outperforms some weakly-supervised approaches [wu2016single, tung_aign, li2019boosting, iqbal2020weakly] which involve ground truth data during training. Especially, comparing with Li et al. [li2019boosting] which implements low rank representation and temporal smoothing for pseudo 3D label generation, our approach, utilizing the advantage of physics simulator, provides better refinement and outperforms [li2019boosting] by a large margin in MPJPE (88.8 Vs 78.8) even it use ground truth data (i.e., subject 1). This verifies the effectiveness of our co-evolving strategy on reducing reliance on 3D data.

Mode Method GT Det
P1 () P2 () P1 () P2 ()
Full Martinez et al. [martinez2017simple] 45.5 37.1 62.9 47.7
Full Pavllo et al. [pavllo2019videopose3d] 37.2 27.2 46.8 36.5
Weak 3DInterpreter [wu2016single] - 88.6 - 98.4
Weak AIGN [tung_aign] - 79.0 - 97.4
Weak Drover et al. [drover2018can] - 38.2 - 64.6
Weak Li et al. [li2019boosting] - - 88.8 66.5
Weak Umar et al. [iqbal2020weakly] - - - 55.9
Self Rhodin et al. [rhodin2018unsupervised] - - 131.7 98.2
Self Chen et al. [chen2019unsupervised] - 51.0 - 68.0
Self Kundu et al. [kundu2020self] - - - 62.4
Self Kundu et al. [kundu2020kinematic] - - - 63.8
Self Yu et al. [yu2021towards] 85.3 42.0 92.4 52.3
Self Hu et al. [hu2021unsupervised] - - 82.1 -
Self Wandt et al. [wandt2021canonpose] - - 81.9 53.0
Self Ours 68.2 45.1 78.0 51.8
Table 1: Results on H36M in terms of MPJPE (P1) and PA-MPJPE (P2). uses multi-view setting. Best results are shown in bold under self supervised setting.

Results on 3DHP We then evaluate the generalization performance of our method on cross dataset 3DHP. We compare our PoseTriplet with state-of-the-art methods, including fully supervised, weakly supervised, and self supervised approaches [mehta2017vnect, sun2019human, kolotouros2019spin, chen2019unsupervised, kundu2020kinematic, kundu2020self, yu2021towards]. As shown in Table 2, under cross-data evaluation, our method overruns previous self-supervised methods [chen2019unsupervised, kundu2020self, yu2021towards] significantly in PCK (82.2 vs. 89.1) and MPJPE (103.8 vs. 79.5). The result indicates that the diverse and plausible motion generated by our PoseTriplet improves generalization. Exception is that Kundu et al. [kundu2020self], which uses extra data and unpaired 3D poses for model training and thus achieves slightly better performance in AUC (56.3 vs. 53.1). Our method also outperforms self-supervised methods [chen2019unsupervised, kundu2020kinematic, kundu2020self, yu2021towards] trained on 3DHP dataset directly. In addition, our method achieves better performance than weakly supervised approaches [sun2019human, kolotouros2019spin] in all metrics, even though they use unpaired images and 3D poses for supervision during the training process. Notably, our method even achieves comparable performance with fully supervised approaches [mehta2017vnect, sun2019human, kolotouros2019spin]). In summary, the cross-dataset performance of our self-supervised framework PoseTriplet is comparable with the intra-dataset result from fully/semi-supervision. This indicates a good generalization performance of our PoseTriplet.

Mode Method CE PCK () AUC () MPJPE ()
Full VNect [mehta2017vnect] 83.9 47.3 98.0
Full HMR [sun2019human] 86.3 47.8 89.8
Full SPIN [kolotouros2019spin] 92.5 55.6 67.5
Weak HMR [sun2019human] 77.1 40.7 113.2
Weak SPIN [kolotouros2019spin] 87.0 48.5 80.4
Self Chen et al. [chen2019unsupervised] 71.1 36.3 -
Self Kundu et al. [kundu2020kinematic] 80.2 44.8 97.1
Self Kundu et al. [kundu2020self] 84.6 60.8 93.9
Self Yu et al. [yu2021towards] 86.2 51.7 -
Self Chen et al. [chen2019unsupervised] 64.3 31.6 -
Self Kundu et al. [kundu2020self] 82.1 56.3 103.8
Self Yu et al. [yu2021towards] 82.2 46.6 -
Self Ours 89.1 53.1 79.5
Table 2: Results on 3DHP in terms of PCK, AUC, and MPJPE. CE denotes cross-data evaluation. uses extra unpaired 2D/3D dataset for training. Best results are shown in bold.

Results on 3DPW We further evaluate the generalization performance of our method on in-the-wild 3DPW dataset. Note that there is few works evaluated on 3DPW under the self-supervised cross dataset setting. Therefore, we compare to the supervised approaches [kolotouros2019spin, pavllo2019videopose3d, sun2019human, wang2020predicting] directly. From Table 3, we can observe our method achieves comparable results with the fully supervised baseline without relying on any 3D data. This demonstrates that our method performs well on complicated and challenging in-the-wild scenarios.

Mode Method CE MPJPE () P-MPJPE ()
Full Wang et al.[wang2020predicting] 124.2 -
Full DSD-SATN [sun2019human] - 69.5
Full CRMH [jiang2020coherent] 105.3 62.3
Full BMP [zhang2021bmp] 104.1 63.8
Full VideoPose [pavllo2019videopose3d] 101.8 63.0
Self Ours 115.0 69.5
Table 3: Results on 3DPW in terms of MPJPE and PA-MPJPE. CE denotes cross-data evaluation.

4.3 Qualitative results

While previous self-supervised methods rely on weak supervision signal (e.g., consistency loss), our method trains the pose estimator with augmenting supervision from the self-generated data, resulting in more stable, plausible, and accurate estimation111Fig.3-8 are video figure that are best viewed in Adobe Reader (click and play), and videos are in supplementary materials.. As shown in Fig. 3, although Hu et al. [hu2021unsupervised] implements temporal smoothness prior during the training process, jittering effect is still obvious. While our result, learned from co-evolving approach, is much smoother. Yu et al. [yu2021towards] introduce a scale estimation strategy for 2D pose to reduce the scale ambiguity. Through the weak supervision from the bone length consistency and scale distribution, his result still contains scale ambiguity (i.e., the body size varies) as shown in Fig. 4. Ours result maintains stable and accurate in term of body size compared with it. We further demonstrate the result from 3DHP (Fig. 5) and 3DPW (Fig. 6). These results demonstrate that our method perform well on unseen poses for in-the-wild scenarios. More in-the-wild examples can be viewed in the supplementary material in video format.


Figure 3: Result on UID [hu2021unsupervised] comparison with Hu et al. [hu2021unsupervised]. The figure includes: input (left), ours (middle), Hu et al. [hu2021unsupervised] (right).


Figure 4: Result on H36M comparison with Yu et al. [yu2021towards]. The figure includes: input (left), ours (middle), Yu et al. [yu2021towards] (right). Red skeleton is prediction, green skeleton is ground truth.

4.4 Ablation study

4.4.1 Ablation on round of co-evolution

We then analysis how the co-evolving round improves the performance of each component (estimator , imitator , hallucinator

). To demonstrate the improvement, we select three evaluation metrics for each component. For estimator, we evaluate the trained model

on H36M test set and report the MPJPE as evaluation metric. For imitator, we evaluate the trained policy on GT 3D reference motion (H36M) to measure the number of termination (e.g., fall down) as evaluation metric. For hallucinator, we evaluate the trained model on GT 3D data (Walking scenario [harvey2020robust] in H36M) for intermediate pose completion. We measure the MPJPE of pose and root position as evaluation metric. We involve an oracle by training each model using the GT data directly as showed in the last row of Table 4. Through iterative co-evolving, the performance of estimator , imitator , hallucinator are improved and getting closer to the result which is trained with GT data. We further provide visualization result for imitator (Fig. 7), hallucinator (Fig. 8). This result shows that the imitator and hallucinator co-evolved by our PoseTriplet without using 3D data achieve a comparable performance compared with the oracle trained with GT 3D data.

Num. P1 (pose) Termination Num. P1 (pose) P1 (root)
0 193.6 - - -
1 112.2 928 - -
2 77.8 280 71.4 62.6
3 68.2 132 67.3 54.0
Oracle 37.2 81 53.0 33.7
Table 4: Results on co-evolving for estimator , imitator , hallucinator . Note that round 0 is the Loop starting, and we involve hallucinator after round one to ensure the quality of initial pose estimation.


Figure 5: Result on 3DHP compared with Ground Truth. The figure includes: input (left), ours (middle), ground truth (right).


Figure 6: Result on 3DPW compared with Ground Truth. The figure includes: input (left), ours (middle), ground truth (right).


Figure 7: Results on co-evolving for imitator . The figure includes: video source (left), our co-evolving result (middle), oracle trained with ground truth data(right).


Figure 8: Results on co-evolving for hallucinator . The figure includes: ground truth (left), our co-evolving result (middle), oracle trained with ground truth data(right).

4.4.2 Ablation on amount of data usage

To study how the amount of data affects the performance, we construct an ablation experiment with limited 2D pose data. As shown in Table 5, we gradually involve more data in our method, (i.e., S1, S1+S5, S1+S5+S6+S7+S8). Result shows that the performance of PoseTriplet can be improved gradually by adding more 2D pose data in both intra and cross-dataset scenarios.

Mode Sub H36M 3DHP 3DPW
Self S1 89.2 94.0 135.8
Self S1,S5 81.9 83.5 128.6
Self S1,S5,S6,S7,S8 68.2 79.5 115.0
Table 5: Results on ablation amount of data in terms of MPJPE.

5 Conclusion

In this work we present a novel framework PoseTriplet for self-supervised 3D pose estimation, which is achieved by a co-evolution strategy of a pose estimator, imitator, and hallucinator. These three components, complement and strength one another through a dual-loop strategy as the training procedure. The framework enables generating diverse and plausible motion data, which help train superior pose estimator. Experiments on varies benchmarks demonstrate that PoseTriplet yields encouraging results. It outperforms the state of the art self-supervised approaches and even competes with fully-supervised approaches.

Limitations The major limitation is that our pipeline suffers low training efficiency, e.g., it takes 7 days to train for 3 rounds on a machine with a Intel Xeon Gold 6278C CPU and a Tesla T4 GPU. The reason is that the imitator () is implemented with CPU-based reinforcement learning (RL) and the hallucinator () is instantiated with RNN architecture. In the future, we will explore GPU-based RL implementation and more efficient hallucinator architecture (e.g., transformer) to speed up the training process.

Acknowledgement This project is supported by NUS Faculty Research Committee Grant (WBS: A-0009440-00-00) and NUS Advanced Research and Technology Innovation Centre (Project Reference ECT-RP2). Kehong would like to thank Ye Yuan for the discussion.