Human 3D keypoints via spatial uncertainty modeling

12/18/2020 ∙ by Francis Williams, et al. ∙ 5

We introduce a technique for 3D human keypoint estimation that directly models the notion of spatial uncertainty of a keypoint. Our technique employs a principled approach to modelling spatial uncertainty inspired from techniques in robust statistics. Furthermore, our pipeline requires no 3D ground truth labels, relying instead on (possibly noisy) 2D image-level keypoints. Our method achieves near state-of-the-art performance on Human3.6m while being efficient to evaluate and straightforward to

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to identify semantic human keypoints is a classical problem in computer vision that finds many applications in the real world spanning from gaming 

[15], athletics [14], robotics [12], and has now entered our household in allowing us to control smart devices with our body (e.g. Google Nest Hub Max and Facebook Portal). In the supervised setting, the identification of 2D keypoints is typically considered a “solved” problem [10, 1], with most of the recent research efforts either targeting 3D keypoint estimation [12, 9, 17]

, or having moved to the more challenging unsupervised learning setting 

[5].

While the success of 2D keypoint estimation pipelines is largely due to the ease of generating 2D annotations, specifying ground truth 3D keypoints from a single image is ill-posed. Hence, researchers have proposed circumventing this issue by leveraging statistical body models [12], motion-capture data [17], or resorting to multi-camera setups [16, 4] where 3D is recovered from 2D estimates via triangulation. In this paper, we perform several core technical contributions to the latter, towards the general objective of enabling high-quality motion capture to “in-the-wild” settings. In particular, existing multi-camera techniques [4, 6, 11, 13, 14] suffer from two significant shortcomings: \⃝raisebox{-0.75pt}{1} they assume a training set with a sufficiently large set of cameras to minimize self-occlusion and enable accurate extraction of ground truth 3D labels; \⃝raisebox{-0.75pt}{2} they are not equipped with spatial uncertainty estimates about their prediction.

If we hope to generalize the performance of multi-view setups to computer vision and be robust to outliers in “in-the-wild” (i.e. outside capture studio alike 

[3]), then both these shortcomings ought to be improved: \⃝raisebox{-0.75pt}{1} our architecture requires only 2D labels which possibly include noise, and \⃝raisebox{-0.75pt}{2}

our model produces an interpretable notion of keypoint uncertainty. We realize these improvements via a principled approach to modelling uncertainty: representing a keypoint as a parameterized probability distribution in 3D space that can be marginalized onto an image plane to compute a loss in 2D. Our modelling approach is motivated by robust statistics to enable training with large outliers in the labels. Finally, while we use explicit 2D labels for our experiments, we remark that our method could be trained on the output of an ensemble of off-the-shelf 2D keypoint models, leading to a fully unsupervised 3D human pose pipeline.

2 Method

Figure 1: Model architecture –

We predict keypoints as probability distributions in 3D parameterized by a mean

and covariance representing the position and spatial uncertainty of a keypoint. As input, our model accepts multiple images of a subject taken from multiple (known) camera views. We feed these images through a 2D backbone using the architecture from Xiao et al. [21] which predicts one heatmap per keypoint within each view. To predict , we triangulate the spatial mean of each heatmap using known camera parameters. To predict , we aggregate the bottleneck features of the backbone network, and use a fully connected network to predict a decomposition of that is guaranteed to be positive definite.

As input, we are given a multi-view setup consisting of cameras with known extrinsics and intrinsics capturing images where and are the width and height of the image width. We then seek to predict spatial distributions (over ) corresponding to the keypoints of a human subject. We assume as a prior that each keypoint follows a multivariate t-distribution with mean and scale matrix :

(1)

where

is a hyperparameter which controls the falloff of the t-distribution. A large value of

, leads to a distribution which is more robust to outliers (due to a faster falloff), but slower to train; we found via a parameter sweep that setting led to good results. We remark that the covariance of our distribution is a constant multiple of the scale matrix, and therefore refer to as the covariance in what follows. Our choice of parameterization is motivated by three factors: \⃝raisebox{-0.75pt}{1} it leads to interpretable keypoints, since, similar to a Gaussian, we can view the mean as the most likely spatial location of a keypoint, and the covariance as defining a spatial uncertainty; \⃝raisebox{-0.75pt}{2} the log likelihood loss of a multivariate-t distribution is robust to outliers, due to its rapidly decaying density; \⃝raisebox{-0.75pt}{3} a multivariate t-distribution can be projected with perspective onto a 2D version of itself, enabling 2D supervision.

Similarly to Algebraic Triangulation proposed by Iskakov et al. [4], our method consists of a 2D backbone network applied to each view, followed by a differentiable triangulation step. Unlike Iskakov et al. [4], our method directly models keypoint uncertainty, outputting a 3D distribution (as opposed to a point-wise quantity) while requiring only 2D labels which can contain noise; see Figure 1 for a schematic of our architecture.

Predicting keypoint distributions

To output a multivariate t-distribution, we must predict its two parameters and . To predict , we follow the Algebraic triangulation method in [4]. First, a 2D backbone network accepts the views of a subject as input and produces collections of heatmaps for each keypoint in each view. We apply a global softmax activation to each of these heatmaps to convert them to a probability distribution, and take the spatial mean of the image to generate a 2D keypoint prediction

(2)

where is the pixel at coordinate of the heatmap for keypoint in view , and are the width and height of the heatmap images. Again following Iskakov et al. [4], we use the architecture from Xiao et al. [21] as our 2D backbone, which is initialized with pretrained weights.

Similarly to Iskakov et al. [4], to output 3D mean estimates, we triangulate the 2D predictions by solving a total least squares problem [2] which amounts to minimizing a quadratic energy formed by the intersection of points in projective space subject to a unit norm constraint.

Similarly to Kumar et al. [7], to predict the covariance , we use the bottleneck layer for each image, , of the backbone network. Since the uncertainty about a keypoint should be a function of all views of a person, we aggregate these bottleneck features and predict a lower triangular matrix as

(3)

where and are both fully connected networks. We apply a shifted ELU activation () to the diagonal entries of to ensure they are strictly positive and then compute sigma as

(4)

which is positive definite by construction; note we omit the keypoint index for brevity of notation.

Supervision

3D keypoint labels are often acquired via some kind of triangulation process which may introduce bias into the labels, which will be learned downstream by the network. Thus, we propose to learn the triangulation directly from a collection of (possibly noisy) 2D keypoint labels in each camera view. To train our model, we maximize the likelihood of these 2D labels under a projected version of our 3D keypoint distribution. In contrast to training directly on 3D data, our method aims to produce 3D predictions whose projections agree with 2D keypoint labels under perspective projection. Moreover, training in this way allows our method to naturally consider an ensemble 2D labels (given, for example, by multiple 2D keypoint models or multiple labellers), aggregating these results in a way that is robust to noise and outliers.

Projecting keypoint distributions

To supervise on 2D labels, we project our keypoint distributions from 3D onto the image planes of each. Furthermore, we require that the projections of our 3D distributions have an explicit form so we can evaluate a loss in each image. We first observe that, similar to a Gaussian, for any point sampled from a multivariate t-distribution (1), its image under an affine transformation also follows a multivariate t with parameters and . While perspective projection is not an affine transformation, we can rely on para-perspective projection [2], an affine approximation to perspective to map our distributions from 3D to 2D. We remark that [22] use a similar projection technique for projecting Gaussians from 3D to 2D. Assuming that and are expressed in the same coordinate system as the camera (where the positive -axis is pointing in the “look-at” direction of the camera, and the origin is the camera center), we can write the para-perspective operation as

(5)

which is equivalent to first projecting the point onto a plane centered at parallel to the image plane, then scaling the projection of by the distance between the mean and the camera origin. We remark that because of the assumed camera orientation the norm is the depth of the keypoint. Applying the projection (5) to each keypoint and each view, we get multivariate t-distributions on each image plane. To train, we minimize the negative log likelihood of 2D keypoint labels under these distributions. Taking the negative log of (1), yields the following loss for each predicted keypoint and 2D label within a given camera view :

(6)

where we use Mahalanobis norms to short-hand notation, and and are the para-perspective projections of and onto the image plane:

(7)

Our final loss function is simply the average of the losses of each keypoint in each camera:

(8)
Figure 2: Results from the test set of Human3.6M trained on 2 cameras. The green circles represent the 95% confidence ellipsoid of the predicted keypoint distribution and the red points are the ground truth labels. The wrist keypoints have greater spatial uncertainty since the model makes more error in these areas. Also note the ground truth keypoints in Human3.6M that are used for training are extremely precise, as they are obtained by tracking a skeleton with a mo-cap system; this justifies the small spatial uncertainty predicted by the model.

width=0.9 Method Avg Dir Disc Eat Greet Phone Pose Purch Sit SitD Smoke Photo Wait Walk WalkD WalkT AT-Conf [4] 22.6 20.4 22.6 20.5 19.7 22.1 19.5 23.0 25.8 33.0 23.0 20.6 21.6 23.7 20.7 21.3 VT-Conf [4] 20.8 19.9 20.0 18.9 18.5 20.5 18.4 22.1 22.5 28.7 21.2 19.4 20.8 22.1 19.7 20.2 Ours 21.6 18.3 21.6 20.0 18.8 21.1 18.9 22.4 25.2 31.1 21.9 20.6 20.7 22.7 19.6 20.1

Table 1: MPJPE on all 4 cameras on Human3.6m

width=0.9 Method Avg Dir Disc Eat Greet Phone Pose Purch Sit SitD Smoke Photo Wait Walk WalkD WalkT AT-Conf [4] 30.9 24.3 31.2 26.7 27.7 33.2 24.2 30.3 39.0 48.4 31.8 33.0 26.6 26.5 29.7 25.8 VT-Conf [4] 27.8 21.4 27.2 24.3 24.1 29.1 22.2 27.7 35.3 46.4 28.9 28.3 24.1 23.2 26.9 23.2 Ours 30.8 24.0 30.8 26.1 28.1 33.6 23.9 30.1 39.1 49.5 30.9 32.5 26.0 26.5 29.3 25.9

Table 2: MPJPE on a model trained on 2 cameras (and tested on 2 different cameras) on Human3.6m

width=0.9 Method Avg Dir Disc Eat Greet Phone Pose Purch Sit SitD Smoke Photo Wait Walk WalkD WalkT AT-Conf [4] 158.4 105.4 173.4 49.9 1,029.9 67.2 55.0 68.6 70.2 505.1 61.6 76.7 75.2 40.6 61.4 50.18 VT-Conf [4] 39.4 34.3 36.3 37.3 43.0 40.9 31.9 41.1 44.3 54.3 40.5 39.5 41.1 32.4 37.6 34.0 Ours 90.7 118.4 108.7 56.5 154.5 69.2 86.3 66.6 75.6 183.7 69.8 96.4 90.7 45.7 79.8 59.7

Table 3: MPJPE on all model trained on 2 antipodal cameras (and tested on 2 antipodal cameras) on Human3.6m.

3 Experiments and results

We evaluate our method on the widely used Human3.6M dataset [3]. Human3.6M consists of video sequences of 7 subjects doing 15 actions taken from 4 cameras. Of these, 2 subjects are reserved for validation and 5 for testing. We evaluate our technique against the two state of the art methods proposed in Iskakov et al. [4]. While our method is similar in spirit to algebraic triangulation (AT) from Iskakov et al. [4], we outperform it in our experiments. In particular, our method is more robust to camera configurations where the triangulation problem is ill-conditioned; see Section 3.2. Our method performs slightly worse than the volumetric triangulation (VT) from Iskakov et al. [4], however it is much faster to evaluate, conceptually simpler, and requires no data preprocessing to compute bounding boxes, thus making it easier to integrate into practical vision pipelines. Furthermore, our method provides additional explainability of the results in the form of explicit geometric uncertainty; see Figure 2 for example predictions from our pipeline.

3.1 Comparison with 4 cameras – Table 1

As a baseline, we first compare our method to Iskakov et al. [4] using the full 4 cameras available in the dataset. As in Iskakov et al. [4], we report the MPJPE (Mean Per Joint Position Error), which measures the ground L2 distance between the ground truth keypoints relative to the pelvis. The results are reported in Table 1. Our method slightly outperforms the algebraic triangulation method, and performs slightly worse than the volumetric triangulation method from Iskakov et al. [4].

3.2 Comparison with 2 cameras – Table 2 & 3

We now compare our model on the same train/test split as the four camera case, except trained on a subset of 2 cameras and evaluated on 2 different cameras. In both the train and evaluation setups the cameras are pointing in the same direction, thus introducing possible occluded body parts which could increase error. We find our method outperforms the algebraic method but underperforms the volumetric method; see Table 2 for quantitative results.

Robustness Stress Test

Finally, as a robustness stress test we consider a 2 camera train/test split as above, but where the cameras are antipodal. This camera configuration leads to many predictions where the triangulation problem is ill-posed, causing large outliers. We see that in this case, our method is much more robust than algebraic triangulation, but much less robust than the volumetric triangulation; see Table 3 for quantitative results.

4 Conclusions and Future Work

In this technical report, we presented a technique for learning 3D keypoints from image (possibly noisy) image labels. Unlike prior work, our method requires only 2D supervision and produces outputs which are equipped with an interpretable notion of spatial uncertainty. Our method achieves near state-of-the-art results on a standard benchmark while remaining extremely simple to implement and fast to evaluate. We believe that such a method paves the way for important future work in the problem of human keypoint detection.

Fully Unsupervised Keypoint Detection

We remark that many existing methods exist in the literature for image level keypoint estimation (e.g. [18, 19, 20]

). Since our pipeline requires only noisy 2D annotations, we could leverage an ensemble of 2D predictions from existing models to generate labels on the fly. Our model, would not only learn to predict the mean of this ensemble, but also the covariance, providing an explicit measurement for the consensus between labellers. Furthermore, the backbone network in our model is, in general, not pretrained on the same dataset as it is evaluated on (for example, our backbone was pretrained on COCO

[8]). Thus, an ensemble prediction would be fully unsupervised, and could be used, for example, to construct very large datasets using only a few calibrated cameras.

Temporal Predictions

Spatial uncertainty is a core requirement for many temporal models such as Kalman Filters. A natural extension of our pipeline, is to aggregate keypoint predictions over multiple frames to predict the next frame in a way that minimizes uncertainty. Furthermore, leveraging the ensemble predictions described in the previous paragraph, our pipeline could be extended to perform fully unsupervised spatio-temporal keypoint prediction.

References

  • Cao et al. [2019] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh.

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • Hartley and Zisserman [2003] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
  • Ionescu et al. [2014] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
  • Iskakov et al. [2019] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov. Learnable triangulation of human pose. In International Conference on Computer Vision, 2019.
  • Jakab et al. [2018] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks through conditional image generation. Advances in Neural Information Processing Systems, 2018.
  • Kocabas et al. [2019] M. Kocabas, S. Karagoz, and E. Akbas.

    Self-supervised learning of 3d human pose using multi-view geometry.

    In

    Conference on Computer Vision and Pattern Recognition

    , 2019.
  • Kumar et al. [2019] A. Kumar, T. K. Marks, W. Mou, C. Feng, and X. Liu. Uglli face alignment: Estimating uncertainty with gaussian log-likelihood loss. In Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  • Lin et al. [2014] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • Martinez et al. [2017] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In International Conference on Computer Vision, 2017.
  • Papandreou et al. [2018] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In European Conference on Computer Vision, 2018.
  • Pavlakos et al. [2017] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Harvesting multiple views for marker-less 3d human pose annotations. In Conference on Computer Vision and Pattern Recognition, 2017.
  • Peng et al. [2018] X. B. Peng, A. Kanazawa, J. Malik, P. Abbeel, and S. Levine.

    Sfv: Reinforcement learning of physical skills from videos.

    ACM Transactions on Graphics, 2018.
  • Rhodin et al. [2018a] H. Rhodin, M. Salzmann, and P. Fua. Unsupervised geometry-aware representation for 3d human pose estimation. In European Conference on Computer Vision, 2018a.
  • Rhodin et al. [2018b] H. Rhodin, J. Spörri, I. Katircioglu, V. Constantin, F. Meyer, E. Müller, M. Salzmann, and P. Fua. Learning monocular 3d human pose estimation from multi-view images. In Conference on Computer Vision and Pattern Recognition, 2018b.
  • Shotton et al. [2011] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Conference on Computer Vision and Pattern Recognition, 2011.
  • Simon et al. [2017] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In Conference on Computer Vision and Pattern Recognition, 2017.
  • Tome et al. [2017] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Conference on Computer Vision and Pattern Recognition, 2017.
  • Tompson et al. [2015] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In Conference on Computer Vision and Pattern Recognition, 2015.
  • Toshev and Szegedy [2014] A. Toshev and C. Szegedy.

    Deeppose: Human pose estimation via deep neural networks.

    In Conference on Computer Vision and Pattern Recognition, 2014.
  • Wei et al. [2016] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Conference on Computer Vision and Pattern Recognition, 2016.
  • Xiao et al. [2018] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision, 2018.
  • Yamashita et al. [2019] K. Yamashita, S. Nobuhara, and K. Nishino. 3d-gmnet: Single-view 3d shape recovery as a gaussian mixture, 2019. arXiv:1912.04663.