PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation

by   Kehong Gong, et al.
National University of Singapore

Existing 3D human pose estimators suffer poor generalization performance to new datasets, largely due to the limited diversity of 2D-3D pose pairs in the training data. To address this problem, we present PoseAug, a new auto-augmentation framework that learns to augment the available training poses towards a greater diversity and thus improve generalization of the trained 2D-to-3D pose estimator. Specifically, PoseAug introduces a novel pose augmentor that learns to adjust various geometry factors (e.g., posture, body size, view point and position) of a pose through differentiable operations. With such differentiable capacity, the augmentor can be jointly optimized with the 3D pose estimator and take the estimation error as feedback to generate more diverse and harder poses in an online manner. Moreover, PoseAug introduces a novel part-aware Kinematic Chain Space for evaluating local joint-angle plausibility and develops a discriminative module accordingly to ensure the plausibility of the augmented poses. These elaborate designs enable PoseAug to generate more diverse yet plausible poses than existing offline augmentation methods, and thus yield better generalization of the pose estimator. PoseAug is generic and easy to be applied to various 3D pose estimators. Extensive experiments demonstrate that PoseAug brings clear improvements on both intra-scenario and cross-scenario datasets. Notably, it achieves 88.6 on MPI-INF-3DHP under cross-dataset evaluation setup, improving upon the previous best data augmentation based method by 9.1 https://github.com/jfzhang95/PoseAug.



There are no comments yet.


page 7


AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation

This paper addresses the problem of cross-dataset generalization of 3D h...

PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision

Existing self-supervised 3D human pose estimation schemes have largely r...

Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation

Existing 3D human pose estimation models suffer performance drop when ap...

When Human Pose Estimation Meets Robustness: Adversarial Algorithms and Benchmarks

Human pose estimation is a fundamental yet challenging task in computer ...

Predicting Camera Viewpoint Improves Cross-dataset Generalization for 3D Human Pose Estimation

Monocular estimation of 3d human pose has attracted increased attention ...

Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation

The target of human pose estimation is to determine body part or joint l...

Cascaded deep monocular 3D human pose estimation with evolutionary training data

End-to-end deep representation learning has achieved remarkable accuracy...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a)     Source dataset: H36M
(b)     Cross dataset: 3DHP
Figure 1: Estimation error (in MPJPE) on H36M (intra-dataset evaluation) and 3DHP (cross-dataset evaluation) of four well established models [zhao2019semantic, martinez2017simple, pavllo2019videopose3d, cai2019exploiting] trained with and without PoseAug. PoseAug significantly improves their performance for both the intra- and cross-dataset settings.

3D human pose estimation aims to estimate 3D body joints in images or videos. It is a fundamental task with broad applications in action recognition [yan2018spatial, si2019attention], human-robot interaction [errity2016human], human tracking [mehta2017vnect], . This task is typically solved using learning-based methods [martinez2017simple, zhao2019semantic, cai2019exploiting, nie2019spm] with ground truth annotations that are collected in the laboratorial environments [ionescu2014human3]. Despite their success in indoor scenarios, these methods are hardly generalizable to cross-scenario datasets (, an in-the-wild dataset). We argue that their poor generalization is mainly due to the limited diversity of training data, such as limited variations in human posture, body size, camera view point and position.

Recent works explore data augmentation to improve the training data diversity and enhance the generalization of their trained models. They either generate data through image composition [rogez2016mocap, mehta2017vnect, singleshotmultiperson2018] and synthesis [chen2016synthesizing, varol2017learning], or directly generate 2D-3D pose pairs from the available training data by applying pre-defined transformations [Li_2020_CVPR]. However, all of these works regard data augmentation and model training as two separate phases, and conduct data augmentation in an offline manner without interaction with model training. Consequently, they tend to generate ineffective augmented data that are too easy for model training, leading to marginal boost to the model generalization. Moreover, these methods heavily rely on pre-defined rules such as joint angle limitations [akhter2015joint_angle_limit] and kinematics constraints [rogez2016mocap] for data augmentation, which limit the diversity of the generated data and make the resulting model hardly generalize to more challenging in-the-wild scenes.

To improve the diversity of augmented data, we propose PoseAug

, a novel auto-augmentation framework for 3D human pose estimation. Instead of conducting data augmentation and network training separately, PoseAug jointly optimizes the augmentation process with network training end-to-end in an online manner. Our main insight is that the feedback from the training process can be used as effective guidance signals to adapt and improve the data augmentation. Specifically, PoseAug exploits a differentiable augmentation module (the ‘augmentor’) implemented by a neural network to directly augment 2D-3D pose pairs in the training data. Considering the potential domain shift with respective to geometry in pose pairs (, postures, view points) 

[rhodin2018unsupervised, Li_2020_CVPR, zhang2020inference], the augmentor learns to perform three types of augmentation operations to respectively control 1) the skeleton joint angle, 2) the body size, and 3) the view point and human position. In this way, the augmentor is able to produce augmented poses with more diverse geometric features and thus relieves the diversity limitation issue. With its differentiable capacity, the augmentor can be optimized together with the pose estimator end-to-end via an error feedback strategy. Concretely, by taking increasing training loss of the estimator as the learning target, the augmentor can learn to enrich the input pose pairs via enlarging data variations and difficulties; in turn, through combating such increasing difficulties, the pose estimator can become increasingly more powerful during the training process.

To ensure the plausibility of the augmented poses, we use a pose discriminator module to guide the augmentation, to avoid generating implausible joint angles [akhter2015joint_angle_limit], unreasonable positions or view points that may hamper model training. In particular, the module consists of a 3D pose discriminator for enhancing the joint angle plausibility and a 2D pose discriminator for guiding the body size, view point and position plausibility. The 3D pose discriminator adopts the Kinematic Chain Space (KCS) [wandt2019repnet] representation and extends it into a part-aware KCS for local-wise supervision. More concretely, it splits skeleton joints into several parts and focuses on joint angles in each part separately instead of the whole body pose, which yields greater flexibility of the augmented poses. By jointly training the pose augmentor, estimator and discriminator in an end-to-end manner (Fig. 2), PoseAug can largely improve the training data diversity, and thus boost model performance on both source and more challenging cross-scenario datasets.

Our PoseAug framework is flexible regarding the choice of the 3D human pose estimator. This is demonstrated by the clear improvements made with PoseAug on four representative 3D pose estimation models [zhao2019semantic, martinez2017simple, pavllo2019videopose3d, cai2019exploiting] over both source (H36M) [ionescu2014human3] and cross-scenario (3DHP) [mehta2017vnect] datasets (Fig. 1). Remarkably, it brings more than 13.1% average improvement w.r.t. MPJPE for all models on 3DHP. Moreover, it achieves 88.6% 3D PCK on 3DHP under cross-dataset evaluation setup, improving upon the previous best data augmentation based method [Li_2020_CVPR] by 9.1%.

Our contributions are three-fold. 1) To the best of our knowledge, we are the first to investigate differentiable data augmentation on 3D human pose estimation. 2) We propose a differentiable pose augmentor, together with the error feedback design, which generates diverse and realistic 2D-3D pose pairs for training the 3D pose estimator, and largely enhances the model’s generalization ability. 3) We propose a new part-aware 3D discriminator, which enlarges the feasible region of augmented poses via local-wise supervision, ensuring both data plausibility and diversity.

Figure 2: Overview of our PoseAug framework. The augmentor, estimator and discriminator are jointly trained end-to-end with an error-feedback training strategy. As such, the augmentor learns to augment data with guidance from the estimator and discriminator.

2 Related Work

3D human pose estimation Recent progress of 3D human pose estimation is largely driven by the deployment of various deep neural network models [tekin2016direct, martinez2017simple, fang2018learning, zhao2019semantic, nibali20193d, cai2019exploiting, sharma2019monocular, zhou2019hemlets]. However, they all highly rely on well-annotated data for fully-supervised model training and hardly generalize to the new scenarios that present unseen patterns in the training dataset, such as new camera views and subject poses. Thus some recent works explore to leverage external information to improve their generalization ability. For example, some methods [zhou2017towards, yang20183d, dabral2018learning, wandt2019repnet, inthewild3d_2019, wang2019generalizing, chen2019weakly, pavllo2019videopose3d, kolotouros2019spin] utilize 2D pose data collected in the wild for model training, , through exploring kinematics priors for regularization or post-processing [zhou2017towards, dabral2018learning, pavllo2019videopose3d], and adversarial training [yang20183d, wandt2019repnet]

. More recently, geometry-based self-supervised learning 

[rhodin2018unsupervised, drover2018can, chen2019unsupervised, kocabas2019epipolar, pirinen2019domes, ligeometry, rhodin2019neural] has been used to train models with unlabeled data. Though effective, applying these methods is largely constrained by the availability of suitable external datasets. Instead of focusing on complex network architectures and learning schemes, we explore a learnable pose augmentation framework to enrich the 3D pose data at hand directly. Specifically, the proposed framework can generate 2D-3D pose pairs with both diversity and plausibility for training pose estimation models. In addition, our framework is generic and can adapt to those methods to further improve their performance.

Data augmentation on 3D human poses Data augmentation is widely used to alleviate the bottleneck of training data diversity and improve model generalization ability. Some works augment data by stitching image patches [rogez2016mocap, mehta2017vnect, zhang2021bmp], and some generate new data with graphics engines [chen2016synthesizing, varol2017learning]. More recently, Li et al.[Li_2020_CVPR] directly augment 2D-3D pose pairs through randomly applying partial skeleton recombination and joint angle perturbation on source datasets. To ensure data plausibility, several constraints are imposed, including joint angle limitation [akhter2015joint_angle_limit] and fixed augmentation range on view point and human position. Despite the good results on source data, these pre-defined rules limit the data diversity expansion and harm the model applicability to more challenging in-the-wild scenarios. Unlike all these methods, we make the first attempt to explore learnable data augmentation on 3D human pose estimation, which is shown effective for improving model generalization ability.

3 Method

3.1 Problem Definition

Let denote 2D spatial coordinates of keypoints of the human in the image, and denote the corresponding 3D joint position in the camera coordinate system. We aim to obtain a 3D pose estimator to recover the 3D pose information from the input 2D pose. Conventionally, the estimator , with parameters , is trained on a well-annotated source dataset (, well-controlled indoor environment [ionescu2014human3]) by solving the following optimization problem:



denotes paired 2D-3D poses from the source training dataset, and the loss function

is typically defined as mean square errors (MSE) between predicted and ground truth 3D poses. However, it is often observed that the pose estimator trained on such an indoor dataset can hardly generalize to a new dataset (, in-the-wild scenario) which features more diverse poses, body sizes, view points or human positions [inthewild3d_2019, zhang2020inference, wang2020predicting].

To improve generalization ability of the model, we propose to design a pose augmentor , to augment the training pose pair into a more diverse one for training the model :


There are several strategies to construct the augmentor in an offline manner, , random [chen2016synthesizing, mehta2017vnect, varol2017learning] or evolution-based augmentations [Li_2020_CVPR]. Differently, we propose to implement the augmentor via a neural network with parameters and train it jointly with the estimator in an online manner, such that the pose estimator loss can be fully exploited as a surrogate for the augmentation diversity and effectively guide the augmentor learning. In particular, the augmentor is trained to generate harder augmented samples that could increase the training loss of the current pose estimator:


3.2 PoseAug Formulation

Our proposed framework aims to generate diverse training data, with proper difficulties for the pose estimator, to improve model generalization performance. Two challenges thus need to be tackled: how to make the augmented data diverse and beneficial for model training; and how to make them natural and realistic. To address them, we propose two novel ideas in training the augmentor.

Error feedback learning for online pose augmentation Instead of performing random pose augmentation in an offline manner [rogez2016mocap, chen2016synthesizing, Li_2020_CVPR], the proposed pose augmentator deploys a differentiable design which enables online joint-training with the pose estimator . Using the training error from the pose estimator as feedback (see Eqn. (3)), the pose augmentor learns to generate poses that are most suitable for the current pose estimator—the augmented poses present proper difficulties and diversity due to online augmentation, thus maximally benefiting generalization of the trained 3D pose estimation model.

Discriminative learning for plausible pose augmentation Purely pursuing error-maximized augmentations may result in implausible training poses that violate the bio-mechanical structure of human body and may hurt model performance. Previous augmentation methods [rogez2016mocap, chen2016synthesizing, Li_2020_CVPR] mostly rely on pre-defined rules for ensuring plausibility (, joint angle constraint [akhter2015joint_angle_limit]), which however would severely limit the diversity of generated poses. For example, some harder yet plausible poses may fail to pass their rule-based plausibility check [Li_2020_CVPR] and will not be adopted for model training. To address this issue, we deploy a pose discriminator module over the local relation of body joints [wandt2019repnet] to assist training the augmentor, thus ensuring the plausibility of augmented poses without sacrificing the diversity.

3.3 Architecture

Fig. 2 summarizes our PoseAug architecture design. It includes 1) a pose augmentor that augments the input pose pair to an augmented one for pose estimator training; 2) a pose discriminator module with two discriminators in 3D and 2D spaces, to ensure the plausibility of the augmented data; and 3) a 3D pose estimator, that provides pose estimation error feedback.

Augmentor Given a 3D pose

, the augmentor first obtains its bone vector

via a hierarchical transformation111The hierarchical transformation converts the joints of into column vectors of , each of which represents a line segment connecting two adjacent joints.  [wandt2019repnet, Li_2020_CVPR], which can be further decomposed into a bone direction vector (representing the joint angle) and a bone length vector (representing the body size).

Then the augmentor applies multi-layer perceptron (MLP) for feature extraction from the input 3D pose

. Additionally, a noise vector based on Gaussian distribution is concatenated with

in the feature extraction process to incur sufficient randomness for enhancing the feature diversity. The extracted features are then used for regressing three operation parameters (, and (, )) to change the joint angles, body size, as well as view point and position as illustrated in Fig. 3. Among these parameters,

  1. is the bone angle residual vector that is used for adjusting the Bone Angle (BA) as follows:

    (BA operation). (4)

    Specifically, BA operation will rotate the input bone direction vector by , generating a new bone direction vector .

  2. represents the bone length ratio vector that is used for adjusting the Bone Length (BL):

    (BL operation). (5)

    BL operation modifies the input bone length vector by to adjust the body size. Notably, to ensure bio-mechanical symmetry, the left and right body parts share the same parameters.

  3. and denote the rotation and translation parameters respectively for Rigid Transformation (RT) operation to control pose view point and position:

    (RT operation), (6)

    where is the augmented bone vector from the above BA and BL operations. is the inverse hierarchical conversion to transform back to a 3D pose [wandt2019repnet, Li_2020_CVPR].

By applying these operations, the augmentor can generate the augmented 3D pose with more challenging pose, body size, view point and position from the original 3D pose (Fig. 3). The augmented pose is then re-projected to 2D with , where denotes perspective projection [Hartley2003MVG] via the camera parameters from the original data. The augmented 2D-3D pair is then used for further training the pose estimator.

Figure 3: Augmentation operations with PoseAug. A source 3D pose is augmented by modifying its posture (via BA operation), body size (via BL operation) and view point and position (via RT operation).
Figure 4: Illustrations of the difference between original and part-aware KCS based discriminator.

Given a novel and valid augmented pose, the original KCS based discriminator would wrongly classify it as fake as it does not appear in source data (H36M), while the part-aware KCS based discriminator would recognize is as real and approve it, since it inspects local joint relations. It can be seen the part-aware KCS based discriminator can help the augmentor generate more diverse and plausible pose augmentation.

Discriminator Due to lacking priors in the augmentation procedure, the augmented poses may present implausible joint angles that violate the bio-mechanical structure [akhter2015joint_angle_limit], or unreasonable positions and view points. Though such poses are indeed harder cases for the estimator, training on them would not benefit the model generalization ability.

To ensure the plausibility of the augmented poses, we introduce a pose discriminator module to guide the augmentation. Specifically, the module consists of a 3D pose discriminator for evaluating the joint angle plausibility and a 2D discriminator for evaluating the body size, viewpoint and position plausibility.

The key to the 3D pose discriminator design is to ensure the pose plausibility without sacrificing the diversity. Inspired by the Kinematic Chain Space (KCS) [wandt2019repnet], we design a part-aware KCS as input to the discriminator. Instead of taking the whole body pose into consideration as in the original KCS, our part-aware KCS only focuses on local joint angle and thus enlarges the feasible region of the augmented pose, ensuring both plausibility and diversity (Fig. 4).

Specifically, to compute the part-aware KCS of an input pose, either or its augmentation , we convert the pose to its bone direction vector as above and separate it into 5 parts (torso and left/right arm/leg) [akhter2015joint_angle_limit], denoted as respectively. We then calculate the following local joint angle matrix for the -th part:


which encapsulates the inter joint angle information within the -th part. Based on the above local KCS representation, a 3D pose discriminator is constructed which takes the as input and is trained for distinguishing the original and augmented 3D poses.

Besides the 3D discriminator, we also introduce a 2D discriminator to guide the augmentor to generate real body size, view points and positions. As the 2D poses contain information such as view point (rotation), position (translation), and body size (bone length), the 2D discriminator can learn such information through adversarial training and guide the pose augmentor in generating realistic rotation , translation , and bone length ratio .

Estimator The pose estimator estimates 3D poses from 2D poses. We use the original and augmented 2D-3D pose pair and to train the pose estimator. The pose estimator contains a feature extractor to capture internal features from 2D poses, and a regression layer to estimate the corresponding 3D poses. Moreover, any existing effective estimator can be implemented in our PoseAug framework. In Sec. 4.3, we conduct experiments to check robustness of PoseAug with different estimators, and the results show PoseAug can bring noticeable improvements on both source and cross-scenario datasets for all models.

3.4 Training Loss

Pose estimation loss We adopt the mean squared errors (MSE) of the ground truth (GT) and predicted poses as the pose estimation loss, which is formulated as


We train the pose estimator using with both original and augmented pose pairs jointly, which can significantly boost performance for the challenging in-the-wild scenes.

Pose augmentation loss To facilitate model training, augmented data should harder than the original one, , , but not too hard to hurt the training process. A simple way to design the loss function is to let the difference between the pose estimation loss on augmented and original data within a proper range. Inspired by [mao2017least, li2020pointaugment], we implement a controllable feedback loss as


where controls the difficulty level for the generated poses, making the value of stay within a certain range w.r.t. . During training, as the pose estimator becomes increasingly more powerful, we accordingly increase value to generate more challenging augmentation data for training it.

Additionally, to prevent extremely hard cases from causing training collapse, we introduce a rectified L2 loss for regularizing the augmentation parameters and :


where denotes and , and denotes the mean value over all of its elements. Combining Eqn. (9) and Eqn. (10), the overall augmentation loss is formulated as


Pose discrimination loss For the discrimination loss , we adopt the LS-GAN loss [mao2017least] for both 3D and 2D spaces:


where and denote the original (real) and the augmented (fake) pose pairs, respectively.

End-to-end training strategy With the differentiable design, the pose augmentor, discriminator and estimator can be jointly trained end-to-end. We update them alternatively by minimizing losses Eqn. (11), Eqn. (12) and Eqn. (8). In addition, we first pre-train the pose estimator before training the whole framework end-to-end, which ensures stable training and produces better performance.

4 Experiments

We study four questions in experiments. 1) Is PoseAug able to improve performance of 3D pose estimator for both intra-dataset and cross-dataset scenarios? 2) Is PoseAug effective at enhancing diversity of training data? 3) Is PoseAug consistently effective for different pose estimators and cases with limited training data? 4) How does each component of PoseAug take effect? We experiment on H36M, 3DHP and 3DPW. Throughout the experiments, unless otherwise stated we adopt single-frame version of VPose [pavllo2019videopose3d] as pose estimator.

4.1 Datasets

Human3.6M (H36M) [ionescu2014human3] Following previous works [martinez2017simple, zhao2019semantic]

, we train our model on subjects S1, 5, 6, 7, 8 of H36M and evaluate on subjects S9 and S11. We use two evaluation metrics: Mean Per Joint Position Error (MPJPE) in millimeters and MPJPE over aligned predictions with GT 3D poses by a rigid transformation (PA-MPJPE).

MPI-INF-3DHP (3DHP) [mehta2017vnect] It is a large 3D pose dataset with 1.3 million frames, presenting more diverse motions than H36M. We use its test set to evaluate the model’s generalization ability to unseen environments, using metrics of MPJPE, Percentage of Correct Keypoints (PCK) and Area Under the Curve (AUC).

3DPW [vonMarcard2018] It is an in-the-wild dataset with more complicated motions and scenes. To verify generalization of the proposed method to challenging in-the-wild scenarios, we use its test set for evaluation with PA-MPJPE as metric.

MPII [andriluka20142d] and LSP [johnson2010clustered] They are in-the-wild datasets with only 2D body joint annotations and used for qualitatively evaluating model generalization for unseen poses.

4.2 Results

Results on H36M We compare PoseAug with state-of-the-art methods [zhao2019semantic, sharma2019monocular, pavllo2019videopose3d, moon2019camera, Li_2020_CVPR] on H36M. Similar to [Li_2020_CVPR], we use 2D poses from HR-Net [sun2019deep] as inputs. As shown in Table 1, our method outperforms SOTA methods [zhao2019semantic, sharma2019monocular, pavllo2019videopose3d, moon2019camera] by a large margin, indicating its effectiveness. Notably, compared with the previous best augmentation method [Li_2020_CVPR], our PoseAug achieves lower MPJPE even though it uses external bone length data for data augmentation and nearly more data than ours for model training. This clearly verifies advantages of PoseAug’s online augmentation scheme—it can generate more diverse and informative data that better benefit model training.

Method MPJPE () PA-MPJPE ()
SemGCN (CVPR’19) [zhao2019semantic] 57.6 -
Sharma et al.(CVPR’19) [sharma2019monocular] 58.0 40.9
VPose (CVPR’19) [pavllo2019videopose3d] (1-frame) 52.7 40.9
Moon et al.(ICCV’19) [moon2019camera] 54.4 -
Li et al.(CVPR’20) [Li_2020_CVPR] 50.9 38.0
Ours 50.2 39.1
Table 1: Results on H36M in terms of MPJPE and PA-MPJPE. Best results are shown in bold.
Method CE PCK () AUC () MPJPE ()
Mehta et al. [mono3dhp2017] 76.5 40.8 117.6
VNect [mehta2017vnect] 76.6 40.4 124.7
Multi Person [singleshotmultiperson2018] 75.2 37.8 122.2
OriNet [luo2018orinet] 81.8 45.2 89.4
LCN [ci2019optimizing] 74.0 36.7 -
HMR [hmrKanazawa17] 77.1 40.7 113.2
SRNet [zeng2020srnet] 77.6 43.8 -
Li et al. [Li_2020_CVPR] 81.2 46.1 99.7
RepNet [wandt2019repnet] 81.8 54.8 92.5
Ours 88.6 57.3 73.0
Ours(+Extra2D) 89.2 57.9 71.1
Table 2: Results on 3DHP. CE denotes cross-scenario evaluation. PCK, AUC and MPJPE are used for evaluation.
Figure 5: Example 3D pose estimations from LSP, MPII, 3DHP and 3DPW. Our results are shown in the left four columns. The rightmost column shows results of Baseline—VPose [pavllo2019videopose3d] trained w/o PoseAug. Errors are highlighted by black arrows.

Results on 3DHP (cross-scenario) We then evaluate how PoseAug facilitates model generalization to cross-scenario datasets. We compare PoseAug against various state-of-the-art methods, including the latest one using offline data augmentation [Li_2020_CVPR], the ones exploiting complex network architecture [ci2019optimizing, zeng2020srnet] and weakly-supervised learning [hmrKanazawa17, wandt2019repnet] and the ones trained on the training set of 3DHP [mono3dhp2017, mehta2017vnect, singleshotmultiperson2018, luo2018orinet]. From Table 2, we can observe our method achieves the best performance w.r.t. all the metrics, outperforming previous approaches by a large margin. This verifies the effectiveness of PoseAug in improving model generalization to unseen scenarios. Moreover, PoseAug can further improve the performance (from 73.0 to 71.1 in MPJPE) by using additional in-the-wild 2D poses (MPII) to train the 2D discriminator. This demonstrates its extensibility in leveraging extra 2D poses to further enrich the diversity of augmented data.

Results on 3DPW (cross-scenario) We train four 3D pose estimators [martinez2017simple, zhao2019semantic, cai2019exploiting, pavllo2019videopose3d] without and with PoseAug on H36M and compare their generalization performance on 3DPW. As shown in Table 4, on average, PoseAug brings improvements for all the models.

Method PA-MPJPE ()
SemGCN [zhao2019semantic] 102.0
+ PoseAug 82.2 (-19.8)
SimpleBaseline [martinez2017simple] 89.4
+ PoseAug 78.1 (-11.3)
ST-GCN [cai2019exploiting](1-frame) 98.0
+ PoseAug 73.2 (-24.8)
VPose [pavllo2019videopose3d] (1-frame) 94.6
+ PoseAug 81.6 (-13.0)
Table 3: Results in PA-MPJPE for four estimators on 3DPW.

Qualitative results For subjective evaluation, we choose four challenging datasets, , LSP, MPII, 3DHP and 3DPW, with large varieties of postures, body sizes, and view points between their data and the data from H36M. Results are shown in Fig. 5. We can see our method performs fairly well, even for those unseen difficult poses.

SemGCN [zhao2019semantic] 67.5 64.7 57.5 44.4 101.9 98.7 95.6 97.4
+ PoseAug 65.2 (-2.3) 60.0 (-4.8) 55.0 (-2.5) 41.5 (-2.8) 89.9 (-11.9) 89.3 (-9.4) 89.1 (-6.5) 86.1 (-11.2)
SimpleBaseline [martinez2017simple] 60.5 55.6 53.0 43.3 91.1 88.8 86.4 85.3
+ PoseAug 58.0 (-2.5) 53.4 (-2.2) 51.3 (-1.7) 39.4 (-3.9) 78.7 (-12.4) 78.7 (-10.1) 76.4 (-10.1) 76.2 (-9.1)
ST-GCN [cai2019exploiting] (1-frame) 61.3 56.9 52.2 41.7 95.5 91.3 87.9 87.8
+ PoseAug 59.8 (-1.5) 54.5 (-2.4) 50.8 (-1.5) 36.9 (-4.8) 83.5 (-12.1) 77.7 (-13.6) 76.6 (-11.3) 74.9 (-12.9)
VPose [pavllo2019videopose3d] (1-frame) 60.0 55.2 52.7 41.8 92.6 89.8 85.6 86.6
+ PoseAug 57.8 (-2.2) 52.9 (-2.3) 50.2 (-2.5) 38.2 (-3.6) 78.3 (-14.4) 78.4 (-11.4) 73.2 (-12.4) 73.0 (-13.6)
Table 4: Performance comparison in MPJPE for various pose estimators trained w/o and with PoseAug on H36M and 3DHP datasets. DET, CPN, HR and GT denote 3D pose estimation model trained on different 2D pose sources, respectively. We evaluate the model on H36M test set with the corresponding 2D pose sources. On 3DHP test set, we use GT 2D poses as input for evaluating model’s generalization. We can observe PoseAug consistently decreases errors for all datasets and estimators.

4.3 Analysis on PoseAug

Applicability to different estimators Our PoseAug framework is generic and applicable to different 3D pose estimators. To demonstrate this, we employ four representative 3D pose estimators as backbones: 1) SemGCN [zhao2019semantic], a graph-based 3D pose estimation network; 2) SimpleBaseline [martinez2017simple], an effective MLP-based network; 3) ST-GCN [cai2019exploiting] (1-frame), a pioneer network that uses GCN-based architecture to encode global and local joint relations; and 4) VPose [pavllo2019videopose3d] (1-frame), a fully-convolutional network with SOTA performance. We train these models on the H36M dataset using 2D poses from four different 2D pose detectors, including CPN [chen2018cpn], DET [Detectron2018], HR-Net [sun2019deep] and groundtruth (GT). We evaluate these models on the test set of H36M and 3DHP w.r.t. MPJPE metric. On H36M, we use the corresponding 2D poses for evaluation; while on 3DHP, we evaluate these models with GT 2D poses to filter out the influence of 2D pose detectors. The results are shown in Table 4. We can see PoseAug brings clear improvements to all models on both H36M and more challenging 3DHP datasets. Notably, they obtain more than 13.1% average improvement on 3DHP when trained with PoseAug.

(a)     Source dataset: H36M
(b)     Cross dataset: 3DHP
Figure 6: Ablation study on limited data setup. We report MPJPE for evaluation. Best viewed in color.

Effectiveness for limited training data cases 3D pose annotations are expensive to collect, making limited training data a common challenge. To demonstrate the effectiveness of our method on addressing such cases, we use pose data from H36M S1 and S1+S5 for model training which only contain 16% and 41% training samples, respectively. The results in Fig. 6 show PoseAug consistently improves model performance with varying amounts of training data, on both H36M and 3DHP. Meanwhile, the improvements brought by our method are more significant for cases with less training data (, MPJPE in 3DHP, S1: 116.4 90.3, Full: 86.6 73.0). Moreover, in cross-scenario generalization, our method trained with only S1 achieves the comparable result (MPJPE: 90.3) to baseline trained using full dataset (MPJPE: 86.6), and our method trained with S1+S5 can outperform baseline trained using full dataset by a large margin (77.9 vs 86.6 in MPJPE).

Analysis on the augmentor We then check the effectiveness of each module in augmentor. Table 5 summarizes the results. By gradually adding the BA, RT and BL operations, the pose estimation error can be monotonically decreased from 41.8/86.6 to 38.8/73.5 (on H36M/3DHP). Moreover, incorporating the error feedback guidance can further improve performance to 38.2 for H36M and 73.0 for 3DHP. These verify the effectiveness of each module of the augmentor in producing more effective augmented samples. Among these modules, RT contributes the most to cross-scenario performance, which implies it benefits data diversity most effectively.

Method BA RT BL Feedback H36M () 3DHP ()
Baseline 41.8 86.6
Variant A 39.7 (-2.1) 85.2 (-1.4)
Variant B 39.2 (-2.6) 75.9 (-10.7)
Variant C 39.1 (-2.7) 75.5 (-11.1)
Variant D 38.8 (-3.0) 73.5 (-13.1)
PoseAug 38.2 (-3.6) 73.0 (-13.6)
Table 5: Ablation study on components of the augmentor. We report MPJPE on H36M and 3DHP datasets.
Figure 7: Distribution on view point (top row) and position (bottom row) for original data H36M, and augmented data from Li et al. [Li_2020_CVPR], PoseAug (3rd column) and PoseAug with extra 2D poses. This distribution shows PoseAug significantly improves diversity of view point and position.

Analysis on diversity improvement To demonstrate effectiveness of PoseAug in enhancing data diversity, considering RT operation which augments the view point and position contributes the most to cross-scenario performance, as shown in Table 5, we make diversity analysis on view point and position distribution. Fig. 7 demonstrates the distributions of view point and position of H36M and the augmented data generated by Li et al. [Li_2020_CVPR] and our method. For H36M data, one can observe their view points concentrate near to the xz-plane with a limited diversity along the y-axis; and their positions form a small and concentrated cluster, also showing a limited diversity. This explains why the model trained on H36M hardly generalizes to in-the-wild scenarios. Similarly, we observe small divergence for the view point and position distribution of augmented data from Li et al. [Li_2020_CVPR]. This implies the diversity improvement from the handcrafted rule is limited. Comparably, our PoseAug can offer more plausible view points and positions using the learnable augmentor, with a much greater diversity. In addition, the diversity on human positions can be further improved with extra 2D poses, which also explains its resulted improved generalization ability in Table 2.

Analysis on the discriminator We here demonstrate the effectiveness of plausibility guidance from the 2D and 3D discriminators. Table 6 summarizes the results. By adding one of the 2D or 3D discriminators, the performance of baseline can be boosted by 2.2/5.8 and 2.2/7.0 on H36M/3DHP, respectively. Including both discriminators into PoseAug training can further boost the performance by 3.6/13.6 on H36M/3DHP, which clearly verify the effectiveness of both discriminators and also the importance of plausibility (in augmented poses) for estimator performance.

Method H36M () 3DHP ()
Baseline 41.8 86.6
Variant A 39.6 (-2.2) 80.8 (-5.8)
Variant B 39.6 (-2.2) 79.6 (-7.0)
PoseAug 38.2 (-3.6) 73.0 (-13.6)
Table 6: Ablation study on the discriminators and on H36M and 3DHP. MPJPE is used for evaluation.

Analysis on part-aware KCS (PA-KCS) To verify its effectiveness, we replace it in PoseAug with KCS [wandt2019repnet]. Table 7 summarizes the results. PA-KCS clearly outperforms KCS on both 3DHP and 3DPW. This verifies our PA-KCS provides better guidance than KCS during training.

Method KCS PA-KCS 3DHP () 3DPW ()
Baseline 86.6 94.6
Variant A 77.7 (-8.9) 88.4 (-6.2)
PoseAug 73.0 (-13.6) 81.6 (-13.0)
Table 7: Ablation study on part-aware KCS (PA-KCS). We report MPJPE on 3DHP and PA-MPJPE on 3DPW.

5 Conclusion

In this paper, we develop an auto-augmentation framework, PoseAug, that learns to enrich the diversity of training data and improves performance of the trained pose estimation models. The PoseAug effectively integrates three components including the augmentor, estimator and discriminator and makes them fully interacted with each other. Specifically, the augmentor is designed to be differentiable and thus can learn to change major geometry factors of the 2D-3D pose pair to suit the estimator better by taking its training error as feedback. The discriminator can ensure the plausibility of augmented data based on a novel part-aware KCS representation. Extensive experiments justify PoseAug can augment diverse and informative data to boost estimation performance for various 3D pose estimators.

Acknowledgement This research was partially supported by AISG-100E-2019-035, MOE2017-T2-2-151, NUS_ECRA_FY17_P08 and CRP20-2017-0006. JZ would like to acknowledge the support of NVIDIA AI Tech Center (NVAITC) to this research project.