Adversarial 3D Human Pose Estimation via Multimodal Depth Supervision

09/21/2018 ∙ by Kun Zhou, et al. ∙ 0

In this paper, a novel deep-learning based framework is proposed to infer 3D human poses from a single image. Specifically, a two-phase approach is developed. We firstly utilize a generator with two branches for the extraction of explicit and implicit depth information respectively. During the training process, an adversarial scheme is also employed to further improve the performance. The implicit and explicit depth information with the estimated 2D joints generated by a widely used estimator, in the second step, are together fed into a deep 3D pose regressor for the final pose generation. Our method achieves MPJPE of 58.68mm on the ECCV2018 3D Human Pose Estimation Challenge.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating the 3D human pose from a single RGB image [1, 2, 3, 4, 5, 6, 7, 8]

has drawn intensive research attentions over the last decade due to its broad applications. Thanks to the powerful DCNN (deep convolutional neural network), significant advances have been witnessed in this area. Nevertheless, there still exists a large gap between images and 3D poses for in-the-wild scenarios. This occurs as the result of the challenges to annotate 3D groundtruth positions for skeleton joints.

Many of previous works tackle this problem by decomposing the task into two stages, each aiming at training models with easy annotating: (a) performing 2D pose estimation; (b) recovering 3D pose from 2D pose directly. Although this decomposition helps with data annotations, as 2D pose annotation can be more easily obtained with in-the-wild images, it also discards necessary pictorial information for resolving the ambiguity in 3D pose recovery. Depth ordering information e.g., [4, 5], has been demonstrated to be effective for solving this ambiguity and the annotation can be done efficiently.

Taking [4] as consideration, the FBI (forward-or-backward information)

only partially reflects the absolute depth information. To further advance along this line, we propose an architecture for extracting multimodal depth information. More specifically, both the FBI, as an implicit depth information, and the explicit depth information are exploited to supervise the learning procedure. In addition, we improve the estimation of explicit depth by using a conditional adversarial learning scheme. At the last stage, a linear deep regressor with a novel loss function maps FBI and the explicit depth with the corresponding 2D human pose into the estimated 3D human pose.

Figure 1: The framework of our method. A multimodal generator is learned to predict a coarse 3D pose and FBI in a supervised way. An adversarial training is adopted to boost the performance of the part for 3D pose inference. Both the estimated coarse 3D pose and FBI are then together fed into a deep regressor for further 3D pose refinement.

2 Method

The overview of the proposed network architecture is shown in Fig. 1. A coarse 3D pose and FBI are estimated by our multimodal generator. A conditional adversarial learning architecture is employed to fine-tune the coarse 3D pose module. Finally, the coarse 3D pose and FBI are fed into the linear regressor to infer the 3D human pose.

2.1 Multimodal Depth Estimation

2.1.1 Multimodal Generator.

The multimodal generator consists of two parallel convNets. One convNet estimates a coarse 3D pose, an explicit depth representation, and the other convNet generates FBI, an implicit depth representation.

2.1.2 Explicit Depth Supervision.

Our explicit depth is represented by assigning an extra coordinate on each 2D joint location. In this work, the 2D joint locations are detected by a widely-used estimator[9]. The explicit depth with the corresponding 2D joint locations can be viewed as a coarse 3D pose aligned in the camera coordinate system. Such information can be easily extracted from the 3D pose groundtruth, and used to supervise the learning process. We use the same convNets architecture as in [8] for our coarse 3D pose prediction.

2.1.3 Implicit Depth Supervision.

FBI is an implicit depth information indicating if a bone is forward-or-backward facing with respect to the camera’s view. We selected

FBI relationships from a human skeleton. Each bone vector

has one of three status: definitely forward, definitely backward and possibly parallel to the image plane. The FBI of an image can be defined as a matrix F = where is a one-hot 3-dimensional vector, i.e., means the bone has the status. Interested readers are referred to [4] for more details of FBI. It is worth noting that such information is very easy to annotate. The users only need to do a binary selection for each skeleton bone, incurring around 20 seconds for each image. We use the same convNets as in [4] for FBI estimation.

Figure 2: (a) The architecture of conditional adversarial learning. Discriminator training is used to distinguish the authenticity of the samples and the generator is used to generate 3D poses that are anthropometrically valid to fool the discriminator. (b) Visual results of estimated coarse 3D poses for out-of-the-domain images. Second/Third row: results without/with conditional adversarial learning.

2.1.4 Adversarial Learning.

It has been proved in [10] that adversarial learning helps to predict more realistic poses. The whole process of our conditional adversarial learning is depicted in Fig. 2(a). In the pre-training stage, a coarse 3D pose is predicted by our generator. Subsequently, the coarse 3D pose is refined by conditional adversarial learning in the fine-tuning stage. Real and fake labels are generated from the discriminator, which in turn leads to generating plausible 3D poses. Thanks to the generalization power of conditional adversarial learning, our coarse 3D module is robust. The visual results of some images with very different domain characteristics are shown in Fig. 2(b).

2.2 3D Pose Refinement

Multimodal depth features fed into the network can effectively improve the accuracy of 3D pose estimation. Specifically, the coarse 3D pose and FBI are concatenated together and then mapped into the 3D pose by exploiting two cascaded linear regression blocks used in 


2.2.1 Weighted Regression Loss function.

Let p be the predicted 3D pose and P be the groundtruth 3D pose. The loss function for 3D pose regression is:


Here, is the basic L2 loss. is the mean of on the training dataset.

is a hyperparameter to adjust the trade-off between

and , and is set as 0.001 in the experiments. Actions with large poses are commonly hard to learn, and these hard samples should gain more attention. To this end, is designed to complement which allows different samples to get adaptive supervision focus.

3 Experiments

3.1 Training

3.1.1 Dataset.

Only 3D human pose data from the ECCV Challenge dataset, a subset of the large-scale dataset Human3.6M[1, 12], is used in the training process. FBI and our coarse 3D pose derived from the ECCV Challenge dataset are used for our multimodal generator training.

3.1.2 Implementation Details.

The whole framework is implemented on Tensorflow. The linear 3D pose regressor requires less than six hours for training.

The ECCV 3D Pose Challenge only provides RGB images and the corresponding 3D pose coordinates groundtruth. We use the 2D pose estimator [9] to assist in cropping the full human body in an image and resize it to 256256.

3.1.3 Results.

All the results in Table 1 are obtained from the evaluation server. The method inferring 3D poses only from 2D joint locations without the coarse 3D pose and FBI is denoted as ”Base”. The full method is denoted as ”Final”.

4 Analysis

It is clear that the most challenging task for 3D human pose estimation is the learning of depth. We proposed to simultaneously infer the explicit depth and implicit depth, in a supervised manner, using a convNets architecture with two independent branches. Despite the FBI lacks groundtruth of explicit depth for in-the-wild images, it can provide useful depth supervision, and it is also very easy to annotate. We take complementary advantages of the implicit and explicit depth supervision and feed the learned features together to the final regressor for 3D pose inference. A weighted regression loss function provides an adaptive feedback for different pose samples. Thanks to these designs, our proposed method achieves competitive 3D human pose estimation.

MPJPE A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 Avg
Base 54 56 55 53 59 55 69 74 89 63 61 60 53 65 59 62
Final 53 54 54 52 56 55 58 70 78 60 59 57 48 61 56 58.68
Table 1: Results on the official evaluation server (measured in millimeter).

We also find that the 2D joint coordinates of our coarse 3D pose is not reliable enough (left-right joint pairs sometimes flip). As future work, we will explore combining a stronger 2D pose detector and a more effective depth feature extractor for 3D human pose estimation.


  • [1] Catalin Ionescu, Fuxin Li, C.S.: Latent structured models for human pose estimation.

    In: International Conference on Computer Vision. (2011)

  • [2] Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut: Joint subset partition and labeling for multi person pose estimation.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4929–4937

  • [3] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: European Conference on Computer Vision, Springer (2016) 561–578
  • [4] Shi, Y., Han, X., Jiang, N., Zhou, K., Jia, K., Lu, J.: Fbi-pose: Towards bridging the gap between 2d images and 3d human poses using forward-or-backward information. arXiv preprint arXiv:1806.09241 (2018)
  • [5] Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: Computer Vision and Pattern Recognition (CVPR). (2018)
  • [6] Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estimation of multiple people in natural scenes–the importance of multiple scene constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 2148–2157
  • [7] Marinoiu, E., Zanfir, M., Olaru, V., Sminchisescu, C.: 3d human sensing, action and emotion recognition in robot assisted therapy of children with autism. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 2158–2167
  • [8] Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3d human pose estimation in the wild: a weakly-supervised approach. In: IEEE International Conference on Computer Vision. (2017)
  • [9] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, Springer (2016) 483–499
  • [10] Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3d human pose estimation in the wild by adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Volume 1. (2018)
  • [11] Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: International Conference on Computer Vision. Volume 1. (2017)  5
  • [12] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)