1 Introduction
The ability to identify semantic human keypoints is a classical problem in computer vision that finds many applications in the real world spanning from gaming
[15], athletics [14], robotics [12], and has now entered our household in allowing us to control smart devices with our body (e.g. Google Nest Hub Max and Facebook Portal). In the supervised setting, the identification of 2D keypoints is typically considered a “solved” problem [10, 1], with most of the recent research efforts either targeting 3D keypoint estimation [12, 9, 17], or having moved to the more challenging unsupervised learning setting
[5].While the success of 2D keypoint estimation pipelines is largely due to the ease of generating 2D annotations, specifying ground truth 3D keypoints from a single image is illposed. Hence, researchers have proposed circumventing this issue by leveraging statistical body models [12], motioncapture data [17], or resorting to multicamera setups [16, 4] where 3D is recovered from 2D estimates via triangulation. In this paper, we perform several core technical contributions to the latter, towards the general objective of enabling highquality motion capture to “inthewild” settings. In particular, existing multicamera techniques [4, 6, 11, 13, 14] suffer from two significant shortcomings: \⃝raisebox{0.75pt}{1} they assume a training set with a sufficiently large set of cameras to minimize selfocclusion and enable accurate extraction of ground truth 3D labels; \⃝raisebox{0.75pt}{2} they are not equipped with spatial uncertainty estimates about their prediction.
If we hope to generalize the performance of multiview setups to computer vision and be robust to outliers in “inthewild” (i.e. outside capture studio alike
[3]), then both these shortcomings ought to be improved: \⃝raisebox{0.75pt}{1} our architecture requires only 2D labels which possibly include noise, and \⃝raisebox{0.75pt}{2}our model produces an interpretable notion of keypoint uncertainty. We realize these improvements via a principled approach to modelling uncertainty: representing a keypoint as a parameterized probability distribution in 3D space that can be marginalized onto an image plane to compute a loss in 2D. Our modelling approach is motivated by robust statistics to enable training with large outliers in the labels. Finally, while we use explicit 2D labels for our experiments, we remark that our method could be trained on the output of an ensemble of offtheshelf 2D keypoint models, leading to a fully unsupervised 3D human pose pipeline.
2 Method
As input, we are given a multiview setup consisting of cameras with known extrinsics and intrinsics capturing images where and are the width and height of the image width. We then seek to predict spatial distributions (over ) corresponding to the keypoints of a human subject. We assume as a prior that each keypoint follows a multivariate tdistribution with mean and scale matrix :
(1) 
where
is a hyperparameter which controls the falloff of the tdistribution. A large value of
, leads to a distribution which is more robust to outliers (due to a faster falloff), but slower to train; we found via a parameter sweep that setting led to good results. We remark that the covariance of our distribution is a constant multiple of the scale matrix, and therefore refer to as the covariance in what follows. Our choice of parameterization is motivated by three factors: \⃝raisebox{0.75pt}{1} it leads to interpretable keypoints, since, similar to a Gaussian, we can view the mean as the most likely spatial location of a keypoint, and the covariance as defining a spatial uncertainty; \⃝raisebox{0.75pt}{2} the log likelihood loss of a multivariatet distribution is robust to outliers, due to its rapidly decaying density; \⃝raisebox{0.75pt}{3} a multivariate tdistribution can be projected with perspective onto a 2D version of itself, enabling 2D supervision.Similarly to Algebraic Triangulation proposed by Iskakov et al. [4], our method consists of a 2D backbone network applied to each view, followed by a differentiable triangulation step. Unlike Iskakov et al. [4], our method directly models keypoint uncertainty, outputting a 3D distribution (as opposed to a pointwise quantity) while requiring only 2D labels which can contain noise; see Figure 1 for a schematic of our architecture.
Predicting keypoint distributions
To output a multivariate tdistribution, we must predict its two parameters and . To predict , we follow the Algebraic triangulation method in [4]. First, a 2D backbone network accepts the views of a subject as input and produces collections of heatmaps for each keypoint in each view. We apply a global softmax activation to each of these heatmaps to convert them to a probability distribution, and take the spatial mean of the image to generate a 2D keypoint prediction
(2) 
where is the pixel at coordinate of the heatmap for keypoint in view , and are the width and height of the heatmap images. Again following Iskakov et al. [4], we use the architecture from Xiao et al. [21] as our 2D backbone, which is initialized with pretrained weights.
Similarly to Iskakov et al. [4], to output 3D mean estimates, we triangulate the 2D predictions by solving a total least squares problem [2] which amounts to minimizing a quadratic energy formed by the intersection of points in projective space subject to a unit norm constraint.
Similarly to Kumar et al. [7], to predict the covariance , we use the bottleneck layer for each image, , of the backbone network. Since the uncertainty about a keypoint should be a function of all views of a person, we aggregate these bottleneck features and predict a lower triangular matrix as
(3) 
where and are both fully connected networks. We apply a shifted ELU activation () to the diagonal entries of to ensure they are strictly positive and then compute sigma as
(4) 
which is positive definite by construction; note we omit the keypoint index for brevity of notation.
Supervision
3D keypoint labels are often acquired via some kind of triangulation process which may introduce bias into the labels, which will be learned downstream by the network. Thus, we propose to learn the triangulation directly from a collection of (possibly noisy) 2D keypoint labels in each camera view. To train our model, we maximize the likelihood of these 2D labels under a projected version of our 3D keypoint distribution. In contrast to training directly on 3D data, our method aims to produce 3D predictions whose projections agree with 2D keypoint labels under perspective projection. Moreover, training in this way allows our method to naturally consider an ensemble 2D labels (given, for example, by multiple 2D keypoint models or multiple labellers), aggregating these results in a way that is robust to noise and outliers.
Projecting keypoint distributions
To supervise on 2D labels, we project our keypoint distributions from 3D onto the image planes of each. Furthermore, we require that the projections of our 3D distributions have an explicit form so we can evaluate a loss in each image. We first observe that, similar to a Gaussian, for any point sampled from a multivariate tdistribution (1), its image under an affine transformation also follows a multivariate t with parameters and . While perspective projection is not an affine transformation, we can rely on paraperspective projection [2], an affine approximation to perspective to map our distributions from 3D to 2D. We remark that [22] use a similar projection technique for projecting Gaussians from 3D to 2D. Assuming that and are expressed in the same coordinate system as the camera (where the positive axis is pointing in the “lookat” direction of the camera, and the origin is the camera center), we can write the paraperspective operation as
(5) 
which is equivalent to first projecting the point onto a plane centered at parallel to the image plane, then scaling the projection of by the distance between the mean and the camera origin. We remark that because of the assumed camera orientation the norm is the depth of the keypoint. Applying the projection (5) to each keypoint and each view, we get multivariate tdistributions on each image plane. To train, we minimize the negative log likelihood of 2D keypoint labels under these distributions. Taking the negative log of (1), yields the following loss for each predicted keypoint and 2D label within a given camera view :
(6) 
where we use Mahalanobis norms to shorthand notation, and and are the paraperspective projections of and onto the image plane:
(7) 
Our final loss function is simply the average of the losses of each keypoint in each camera:
(8) 
3 Experiments and results
We evaluate our method on the widely used Human3.6M dataset [3]. Human3.6M consists of video sequences of 7 subjects doing 15 actions taken from 4 cameras. Of these, 2 subjects are reserved for validation and 5 for testing. We evaluate our technique against the two state of the art methods proposed in Iskakov et al. [4]. While our method is similar in spirit to algebraic triangulation (AT) from Iskakov et al. [4], we outperform it in our experiments. In particular, our method is more robust to camera configurations where the triangulation problem is illconditioned; see Section 3.2. Our method performs slightly worse than the volumetric triangulation (VT) from Iskakov et al. [4], however it is much faster to evaluate, conceptually simpler, and requires no data preprocessing to compute bounding boxes, thus making it easier to integrate into practical vision pipelines. Furthermore, our method provides additional explainability of the results in the form of explicit geometric uncertainty; see Figure 2 for example predictions from our pipeline.
3.1 Comparison with 4 cameras – Table 1
As a baseline, we first compare our method to Iskakov et al. [4] using the full 4 cameras available in the dataset. As in Iskakov et al. [4], we report the MPJPE (Mean Per Joint Position Error), which measures the ground L2 distance between the ground truth keypoints relative to the pelvis. The results are reported in Table 1. Our method slightly outperforms the algebraic triangulation method, and performs slightly worse than the volumetric triangulation method from Iskakov et al. [4].
3.2 Comparison with 2 cameras – Table 2 & 3
We now compare our model on the same train/test split as the four camera case, except trained on a subset of 2 cameras and evaluated on 2 different cameras. In both the train and evaluation setups the cameras are pointing in the same direction, thus introducing possible occluded body parts which could increase error. We find our method outperforms the algebraic method but underperforms the volumetric method; see Table 2 for quantitative results.
Robustness Stress Test
Finally, as a robustness stress test we consider a 2 camera train/test split as above, but where the cameras are antipodal. This camera configuration leads to many predictions where the triangulation problem is illposed, causing large outliers. We see that in this case, our method is much more robust than algebraic triangulation, but much less robust than the volumetric triangulation; see Table 3 for quantitative results.
4 Conclusions and Future Work
In this technical report, we presented a technique for learning 3D keypoints from image (possibly noisy) image labels. Unlike prior work, our method requires only 2D supervision and produces outputs which are equipped with an interpretable notion of spatial uncertainty. Our method achieves near stateoftheart results on a standard benchmark while remaining extremely simple to implement and fast to evaluate. We believe that such a method paves the way for important future work in the problem of human keypoint detection.
Fully Unsupervised Keypoint Detection
We remark that many existing methods exist in the literature for image level keypoint estimation (e.g. [18, 19, 20]
). Since our pipeline requires only noisy 2D annotations, we could leverage an ensemble of 2D predictions from existing models to generate labels on the fly. Our model, would not only learn to predict the mean of this ensemble, but also the covariance, providing an explicit measurement for the consensus between labellers. Furthermore, the backbone network in our model is, in general, not pretrained on the same dataset as it is evaluated on (for example, our backbone was pretrained on COCO
[8]). Thus, an ensemble prediction would be fully unsupervised, and could be used, for example, to construct very large datasets using only a few calibrated cameras.Temporal Predictions
Spatial uncertainty is a core requirement for many temporal models such as Kalman Filters. A natural extension of our pipeline, is to aggregate keypoint predictions over multiple frames to predict the next frame in a way that minimizes uncertainty. Furthermore, leveraging the ensemble predictions described in the previous paragraph, our pipeline could be extended to perform fully unsupervised spatiotemporal keypoint prediction.
References

Cao et al. [2019]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh.
Openpose: Realtime multiperson 2d pose estimation using part affinity fields.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.  Hartley and Zisserman [2003] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
 Ionescu et al. [2014] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
 Iskakov et al. [2019] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov. Learnable triangulation of human pose. In International Conference on Computer Vision, 2019.
 Jakab et al. [2018] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks through conditional image generation. Advances in Neural Information Processing Systems, 2018.

Kocabas et al. [2019]
M. Kocabas, S. Karagoz, and E. Akbas.
Selfsupervised learning of 3d human pose using multiview geometry.
InConference on Computer Vision and Pattern Recognition
, 2019.  Kumar et al. [2019] A. Kumar, T. K. Marks, W. Mou, C. Feng, and X. Liu. Uglli face alignment: Estimating uncertainty with gaussian loglikelihood loss. In Conference on Computer Vision and Pattern Recognition Workshops, 2019.
 Lin et al. [2014] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 Martinez et al. [2017] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In International Conference on Computer Vision, 2017.
 Papandreou et al. [2018] G. Papandreou, T. Zhu, L.C. Chen, S. Gidaris, J. Tompson, and K. Murphy. Personlab: Person pose estimation and instance segmentation with a bottomup, partbased, geometric embedding model. In European Conference on Computer Vision, 2018.
 Pavlakos et al. [2017] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Harvesting multiple views for markerless 3d human pose annotations. In Conference on Computer Vision and Pattern Recognition, 2017.

Peng et al. [2018]
X. B. Peng, A. Kanazawa, J. Malik, P. Abbeel, and S. Levine.
Sfv: Reinforcement learning of physical skills from videos.
ACM Transactions on Graphics, 2018.  Rhodin et al. [2018a] H. Rhodin, M. Salzmann, and P. Fua. Unsupervised geometryaware representation for 3d human pose estimation. In European Conference on Computer Vision, 2018a.
 Rhodin et al. [2018b] H. Rhodin, J. Spörri, I. Katircioglu, V. Constantin, F. Meyer, E. Müller, M. Salzmann, and P. Fua. Learning monocular 3d human pose estimation from multiview images. In Conference on Computer Vision and Pattern Recognition, 2018b.
 Shotton et al. [2011] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Realtime human pose recognition in parts from single depth images. In Conference on Computer Vision and Pattern Recognition, 2011.
 Simon et al. [2017] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In Conference on Computer Vision and Pattern Recognition, 2017.
 Tome et al. [2017] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Conference on Computer Vision and Pattern Recognition, 2017.
 Tompson et al. [2015] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In Conference on Computer Vision and Pattern Recognition, 2015.

Toshev and Szegedy [2014]
A. Toshev and C. Szegedy.
Deeppose: Human pose estimation via deep neural networks.
In Conference on Computer Vision and Pattern Recognition, 2014.  Wei et al. [2016] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Conference on Computer Vision and Pattern Recognition, 2016.
 Xiao et al. [2018] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision, 2018.
 Yamashita et al. [2019] K. Yamashita, S. Nobuhara, and K. Nishino. 3dgmnet: Singleview 3d shape recovery as a gaussian mixture, 2019. arXiv:1912.04663.