1 Introduction
In this paper, we tackle the problem of estimating 3D coordinates of human joints from RGB images captured using synchronized (potentially moving) cameras with unknown positions, orientations, and intrinsic parameters using only 2D positions of these joints on captured images for supervision.
Historically, realtime capture of the human 3D pose has been undertaken only by large enterprises that could afford expensive specialized motion capture equipment [gleicher1999animation]. In principle, spatial coordinates of human body joints can be triangulated directly from cameraspace observations, if all intrinsic and extrinsic parameters of these cameras (called “camera calibration” for short) are available [iskakov2019learnable, karashchuk2020anipose]. Techniques for acquiring accurate calibration in controlled environments with fixed cameras, or in shortbaseline stereo captures, where selfcalibration from crossview correspondences is possible, have been extensively studied in prior work [building_rome_10.1145/2001269.2001293]. One scenario in which aforementioned constraints do not apply is the sports capture, in which closeups of players in extreme poses are captured in front of lowtexture backgrounds using far spaced moving cameras  plain backgrounds preclude calibration, as few feature correspondences can be detected across views (e.g. examples in Figure 1). Without known camera calibration, though, we have no practical way of relating position and uncertainty predictions from different views, which is critical for accurate 3D pose estimation, since, in most cases, only a subset of joints can be accurately localized from a single view due to (self) occlusion. Considering all of the above, a practical solution needs to be able to aggregate location and uncertainty estimates from multiple views but should not rely on prohibitively expensive 3D joint or camera supervision. As we discuss below, none of the existing approaches fully satisfies these requirements (see the top half of Figure 1 for illustrations).
Fullysupervised 3D pose estimation approaches yield the lowest estimation error, but make use of known 3D camera specification during either training [xie2020metafuse] or both training and inference [iskakov2019learnable]. However, the prohibitively high cost of 3D joint annotation and full camera calibration (with moving cameras) inthewild makes it difficult to acquire large enough labeled datasets representative of specific environments [rhodin2018skipose, joo2020exemplar] , therefore rendering supervised methods not applicable in this setup.
Weaklysupervised monocular 3D methods, [iqbal2020weakly, kocabas2019epipolar, wandt2020canonpose] as well as 2Dto3D lifting networks [chen2019unsupervised, wandt2019repnet], relax data constraints to enable 3D pose inference using just multiview 2D data without calibration at train time. Unfortunately, at inference time, these methods can only be applied to a single view at a time, therefore unable to leverage crossview information and uncertainty. A straightforward extension of these methods, performing rigid alignment of monocular predictions at inference time, yields suboptimal estimation error, as we show in Section 5.
Classical structurefrommotion approaches to 3D pose estimation [karashchuk2020anipose] iteratively refine both the camera and the 3D pose from noisy 2D observations. However, these methods are often slower than their neural counterparts, since they have to perform several optimization steps during inference, and, more importantly, most of them do not consider uncertainty estimates and inductive biases to speed up the inference, resulting in subpar performance and higher sensitivity to noise, especially with fewer cameras, as we show in Section 5.
To overcome these limitation we propose “MetaPose” (see Figure 1). Our method for 3D pose estimation aggregates pose predictions and uncertainty estimates across multiple views, requires no 3D joint annotations or camera parameters at both train and inference time, and adds very little latency to the resulting pipeline. Our approach uses an offtheshelf weaklysupervised 3D network to form an initial guess about the pose and the camera setup, and iteratively refines this guess using 2D joint location probability heatmaps generated by an offtheshelf 2D pose estimation network. To speed up the inference, and to compensate for errors in the offtheshelf networks, we train a neural optimizer to mimic the slow iterative optimization refinement. This modular approach not only yields low estimation error with low latency, but also enables us to easily analyze the performance of individual components and plugin new models, priors, or losses if needed. Our main contributions can be summarized as follows:

[leftmargin=*]

To the best of our knowledge, we are the first to show that a generalpurpose feedforward neural network can accurately estimate the 3D human pose and the camera configuration from multiple views, taking into account joint occlusions and prediction uncertainties.

We propose a 3stage modular training pipeline that uses only 2D supervision at train time, and a fast 2stage inference pipeline.

On the wellestablished Human3.6M [ionescu2013human36m] dataset our method yields the lowest pose estimation error across models that do not use 3D ground truth annotations (more than 40% improvement in PMPJPE over the SotA) when using all four cameras and degrades gracefully when restricted to just two cameras (more than 40% improvement over the best baseline).

On the challenging inthewild SkiPose PTZ [rhodin2018skipose] dataset with moving cameras, our method yields error comparable to the bundle adjustment baseline [karashchuk2020anipose] when using all six cameras, but has much faster inference time, and massively outperforms bundle adjustment when restricted to just two cameras.

We conduct a detailed ablation study dissecting sources of error into the error due to the weak camera model, imperfect initialization, imperfect 2D heatmaps, and the additional error introduced by the neural optimizer, suggesting areas where further progress can be made.
2 Related Work
In this section, we briefly cover prior work on multiview 3D human pose estimation. We focus on approaches that exploit multiview information, see [joo2020exemplar] for a survey of 3D human pose estimation in the wild.
Full supervision
Supervised methods [iskakov2019learnable, tu2020voxelpose, Chen_2020_CVPR_crossview] yield the lowest 3D pose estimation errors on multiview single person [ionescu2013human36m] and multiperson [Joo_2019_panoptic, belagiannis20143d, Chen_2020_CVPR_crossview] datasets, but require precise camera calibration during both training and inference. Other approaches [xie2020metafuse] use datasets with full 3D annotations and a large number of annotated cameras to train methods that can adapt to novel camera setups in visually similar environments, therefore somewhat relaxing camera calibration requirements for inference. martinez2017simple use pretrained 2D pose networks [newell2016stacked] to take advantage of existing datasets with 2D pose annotations. Epipolar transformers [epipolartransformers] use only 2D keypoint supervision, but require camera calibration to incorporate 3D information in the 2D feature extractors. In general, fullysupervised methods require some form of 3D supervision which is prohibitively expensive to collect inthewild, like the sports capture setting we consider in this paper.
Weak and selfsupervision
Several approaches exist that do not use paired 3D ground truth data. Many augment limited 3D annotations with 2D labels [Zhou_2017_weakly_supervised, hmrKanazawa17, Mitra_2020_CVPR, zanfir2020weakly]. Fittingbased methods [hmrKanazawa17, kocabas2020vibe, kolotouros2019spin, zanfir2020weakly] jointly fit a statistical 3D human body model and 3D human pose to monocular images or videos. Analysisbysynthesis methods [Rhodin_2018_ECCV, Kundu_2020_nvs, Jakab_2020_CVPR_unlabelled] learn to predict 3D human pose by estimating appearance in a novel view. Most related to our work are approaches that exploit the structure of multiview image capture. EpipolarPose [kocabas2019epipolar] uses epipolar geometry to obtain 3D pose estimates from multiview 2D predictions, and subsequently uses them to directly supervise 3D pose regression. iqbal2020weakly proposes a weaklysupervised baseline to predict pixel coordinates of joints and their depth in each view and penalized the discrepancy between rigidly aligned predictions for different views during training. The selfsupervised CanonPose [wandt2020canonpose] further advances stateoftheart by decoupling 3D pose estimation in the “canonical” frame from the estimation of the camera frame. drover2018can learn a “dictionary” mapping 2D pose projections into corresponding realistic 3D poses, using a large collection of simulated 3Dto2D projections. RepNet [wandt2019repnet] and chen2019unsupervised train similar “2Dto3D lifting networks” with more realistic data constraints. While all the aforementioned methods use multiview consistency for training, they do not allow pose inference from multiple images. A straightforward extension of these methods that, at inference time, applies rigid alignment to predicted monocular 3D estimates and ignores the uncertainty in these predictions, does not perform well, as we show in Section 5.
Iterative refinement
Estimating camera and pose simultaneously is a longstanding problem in vision [rosales2001estimating]. One of the more recent successful attempts is the work of bridgeman2019multi that proposed an endtoend network that refines the initial calibration guess using center points of multiple players in the field. In the absence of such external calibration signals, takahashi2018human performs bundle adjustment with bone length constraints, but do not report results on a public benchmark. AniPose [karashchuk2020anipose] performs joint 3D pose and camera refinement using a modified version of the robust 3D registration algorithm of zhou2016fast
. Such methods ignore predicted uncertainty for faster inference, but robustly iteratively estimates outlier 2D observations and ignores them during refinement. In Section
5, we show that these classical approaches struggle in illdefined settings, such as when we have a small number of cameras. More recently, SPIN [kolotouros2019spin] and Holopose [Guler_2019_CVPR_holopose] incorporate iterative pose refinement for monocular inputs, however, the refinement is tightly integrated into the pose estimation network. MetaPose effectively regularizes the pose estimation problem with a finitecapacity neural network resulting in both faster inference and higher precision.3 Method
Our method involves three stages; see Figure 2:

[leftmargin=*]

We first acquire singleview estimates of the full 3D pose using a pretrained monocular 3D network (EpipolarPose [kocabas2019epipolar] for H36M and CanonPose [wandt2020canonpose] for SkiPosePTZ) and infer an initial guess for the camera configuration by applying closedform rigid alignment to these singleview 3D pose estimates.

We compute detailed singleview heatmaps using a stateoftheart monocular 2D network (Stacked Hourglass [newell2016stacked] for both datasets) pretrained on available 2D labels and refine our initial guess for the 3D pose and cameras by maximizing the likelihood of reprojected points given these heatmaps via gradient descent.

We approximate this iterative refinement given an initial guess and multiview heatmaps via a forward pass of a neural network.
This modular multistage approach lets us prime gradient descent in stage 2 with a “good enough” initialization to start from the correct basin of the highly nonconvex multiview heatmap likelihood objective. Moreover, it lets us swap pretrained modules for monocular 2D and 3D without retraining the entire pipeline whenever a better approach becomes available. Finally, the neural optimizer in stage 3 provides orders of magnitude faster inference than the iterative refinement.
Setup
We assume that we have access to a labeled 2D training dataset with timestamps and cameras, where each entry consists of an RGB image , and a matrix of imagespace locations of joints in pixelspace . We also assume that for each timestamp there exists a set of ground truth 3D poses and corresponding camera parameters such that each projection  a matrix of locations of joints in projected onto camera in pixels  is close to ground truth imagespace coordinates . We assume that we have no access to ground truth 3D poses and camera parameters during training, but we want to learn a function , that for a fresh set of multiview images yields a good guess for the 3D pose estimate
and vectors of camera parameters
in terms of the reprojection error. Ultimately, we would like the mean perjoint position error between the ground truth 3D pose and our predictions to be as small as possible, without access to the ground truth, hence the method is “weaklysupervised”. In what follows, for improved readability, we sometimes omit the index when talking about a single entry. Our notation is also summarized in Table 5 in the supplementary.Prerequisites
We assume that we have access to two “external” networks that can be trained with 2D supervision: 1) a monocular 2D pose model that yields perjoint heatmaps and 2) a selfsupervised monocular 3D model that yields cameraframe 3D pose estimates , such that first two coordinates of each joint contains estimated pixel coordinates of that joint in the image, and the last coordinate contains estimated joint depth in “same” units (i.e. distances between all points are correct up to scale). We approximate and store each heatmap as an 2dimensional
component Gaussian mixture model
with means , spherical covariances and weights. This “compressed” format serves two main goals: it turns the joint position likelihood into a smooth function easier to optimize over, compared to, for example, pixellevel heatmap interpolation, and it reduces the dimensionality of the input to the neural optimizer, therefore making it possible to train smaller models. We use a weakprojection camera model, so each camera is defined by a tuple of rotation matrix, pixel shift and scale parameters
, and the projection operator is defined as: .3.1 Initial estimate – Stage 1.
The purpouse of this stage is to acquire an initial guess using the monocular 3D pose model . We provide exact expression for the initial guess in Supp. 8.3 and describe the intuitive idea behind it below. We choose the first camera to be the canonical frame (i.e. ), and use orthogonal Procrustes alignment (via SVD of the outer product of meancentered poses) to find the rotation relating 3D monocular prediction for camera to the monocular 3D estimate for the first camera . Similarly, we can find optimal shifts and scales that minimize the discrepancy between , and rotated, scaled and shifted for each of other cameras. The initial guess for the 3D pose then is the average of monocular 3D pose predictions from all cameras rigidly aligned to the first camera frame by corresponding optimal rotations, scales and shifts.
3.2 Iterative refinement – Stage 2.
In the second stage we use monocular heatmaps, in the form of Gaussian mixtures with parameters , to get a refined guess starting from the initial estimate . The refinement optimization problem over compressed heatmaps is as follows:
(1)  
where and are a numerically stable versions of and , and is the numerically stable loglikelihood of a point given a Gaussian mixture (Supp. 8.5). During optimization we reparameterized scale (always positive) as logscale, and rotation with a 6D vector as in zhou2019continuity (Supp. 8.4).
3.3 Neural optimizer – Stage 3.
In this final stage, we train an optimizer as a neural network to predict the optimal update to the current guess for the pose and cameras, similar to ma2020deepoptimizer for solving inverse problems. Specifically, the update is computed from heatmap mixture parameters , the current guess , the projections of the current guess onto each camera , and the current value of the refinement loss (we omit the dependency of on in the first line for readability):
(2)  
The neural network is optimized to minimize the reprojection loss between reprojected neural estimate and ground truth 2D joint locations while ensuring the overall pose and camera estimate remains close to the one obtained via iterative refinement:
(3)  
where the reprojection loss is defined as
(4) 
and the teacher loss that measures the distance between (the estimate for camera and pose given by the the iterative refinement in Stage 2) and the neural estimate :
(5) 
We use this distance between rotations since huynh2009metrics suggest that it is better suited for direct optimization than the L2distance. We need both losses because draws the neural optimizer towards the neighborhood of the correct solution, and penalizes small (in terms of ) deviations from that result in big reprojection errors.
Plugin Priors
Stages 2 and 3 can be easily extended with existing priors over human poses. For example, we were able to make our predictor more “personalized” by conditioning our model on vectors of normalized bone length ratios of a specific individual [takahashi2018human]. More specifically, we added a limb length component to the refinement loss:
(6) 
We passed as an input to each optimizer step , and trained the neural optimizer to minimize .
Inference
Note, during inference we only need to perform rigid alignment over monocular 3D estimates (Stage 1) to acquire an initial guess , and apply the neural optimizer to this guess in a feedforward manner (Stage 3), no iterative refinement (Stage 2) is necessary. We provide pseudocode for both training and inference in Supp. 8.1.
4 Experiments
In this section, we specify datasets and metrics we used to validate the performance of the proposed method and a set of baselines and ablation experiments we conducted to evaluate the improvement in error provided by each stage and each supervision signal.
Data
We evaluated our method on Human3.6M [ionescu2013human36m] (H36M) dataset with four fixed cameras and a more challenging SkiPosePTZ [rhodin2018skipose] (SkiPose) dataset with six moving pantiltzoom cameras. We used standard traintest evaluated protocol for H36M [iskakov2019learnable, kocabas2019epipolar] with subjects 1, 5, 6, 7, and 8 used for training, and 9 and 11 used for testing. We additionally pruned the H36M dataset by taking each 16th frame from it, resulting in 24443 train and 8516 test examples, each example containing information from four cameras. We evaluated our method on two subsets of SkiPose: the “full” dataset used by rhodin2018skipose (1315 train / 284 test), and a “clean” subset (1035 train / 230 test) used by wandt2020canonpose that excludes views where visibility is heavily obstructed by the winter storm, each example containing information from six cameras. We additionally augment the SkiPose dataset by randomly shuffling camera order during training to prevent the neural optimizer from overfitting on the small training set. In each dataset, we used the first 64 examples from the train split as a validation set.
Metrics
We report Procrustes aligned Mean Per Joint Position Error (PMPJPE) that measures the L2error of 3D joint estimates after applying the optimal rigid alignment to the full predicted 3D pose and the ground truth 3D pose, and therefore ignores the overall pose orientation error, as well as the Normalized Mean Per Joint Position Error (NMPJPE) that measures the error after optimally scaling and shifting the prediction for the given camera frame to the ground truth pose in that camera frame, so it is sensitive to the pose orientation, but ignores the errors in “scale”. We do not report the nonnormalized pose estimation error because, in the absence of intrinsic camera parameters, the correct absolute scale can not be estimated from data. We additionally report the “orderofmagnitude” of additional time () it takes to perform multiview 3D pose inference using different methods on top of monocular pose estimation time for each view.
Method 
supervision 
multiview 
uncertainty 
PMPJPE 
NMPJPE 
[s] 
Isakov et al. [iskakov2019learnable]  3D  ✓  ✓  20     
EpipolarPose (EP) [kocabas2019epipolar]  S  ✗  n/a  71  78   
Rhodin et al. [rhodin2018skipose]  2/3D  ✗  n/a  65  80   
Iqbal et al. [iqbal2020weakly]  2D  ✗  n/a  55  66   
CanonPose [wandt2020canonpose]  S  ✗  n/a  53  82   
AniPose [karashchuk2020anipose] + GT  2D  ✓  ✗  75  103  5 
Perview EP [kocabas2019epipolar]  S  ✗  n/a  86  97  
Average EP  S  ✓  ✓  74  88  
Iterative Ref. + EP  2D  ✓  ✓  45  70  10 
MetaPose + EP  2D  ✓  ✓  40  57  
MetaPose () + EP  S  ✓  ✓  43  58 
Method 
supervision 
uncertainty 
multiview 
PMPJPE 
NMPJPE 
[s] 
Rhodin [rhodin2018skipose]  2/3D  ✗  n/a    85   
CanonPose (CP) [wandt2020canonpose]  S  ✗  n/a  90  128   
AniPose [karashchuk2020anipose] + GT  2D  ✓  ✗  50  62  5 
Perview CP [wandt2020canonpose]  S  ✗  n/a  86  115  
Average CP  S  ✓  ✓  80  147  
Iterative Ref. + CP  2D  ✓  ✓  31  63  10 
MetaPose + CP  2D  ✓  ✓  53  86  
MetaPose () + CP  2D  ✓  ✓  48  60 
Baselines
On H36M we lowerbound the error with the stateoftheart fullysupervised baseline of iskakov2019learnable that uses ground truth camera parameters to aggregate multiview predictions during inference. We also compare the performance of our method to methods that use multiview 2D supervision during training but only perform inference on a single view at a time: selfsupervised EpipolarPose (EP) [kocabas2019epipolar] and CanonPose (CP) [wandt2020canonpose], as well as the weakly supervised baselines of iqbal2020weakly and rhodin2018skipose. We applied EpipolarPose to our own set of human bounding boxes on H36M, so our perview results differ from those reported by kocabas2019epipolar. On SkiPose we compared our model with the only two baselines available in the literature: CanonPose [wandt2020canonpose] and rhodin2018skipose. CanonPose is preferable to EpipolarPose on the SkiPose dataset because it does not assume fixed cameras. We include results on both the “clean” and “full” subsets of SkiPose (described previously) for completeness; however CanonPose was trained and evaluated only on the “clean” subset, so in the absence of monocular 3D methods trained and evaluated on the “full” SkiPose dataset, for the sake of completeness, we trained and evaluated our method using ground truth data with Gaussian noise in place of CanonPose predictions, only when evaluating on that subset (“full”) of SkiPose.
We also compared our method against “classical” bundle adjustment initialized with ground truth extrinsic camera parameters, and fixed ground truth intrinsic parameters of all cameras, therefore putting it into unrealistically favorable conditions. We used the welltested implementation of bundle adjustment in AniPose [karashchuk2020anipose] that uses an adapted version of the 3D registration algorithm of zhou2016fast. This approach takes point estimates of joint locations as an input (i.e. no uncertainty) and iteratively refines camera parameters and joint location estimates.
Ablation experiments
We additionally measured different sources of remaining error in our predictions by replacing each component discussed in Section 3 with either its ground truth equivalent or a completely random guess, as well as constraining supervision signals we pass to the neural optimizer. More specifically, we measured the effect of 1) varying the number of cameras used for training and evaluation; 2) priming different refinement methods with random and ground truth initialization; 3) using “fake heatmaps” centered around ground truth joints projected back into the image plane using different camera models; and 4) disabling and enabling losses and priors when training our neural optimizer. We did not measure the performance of the neural optimizer primed with the ground truth initialization because that would have just shown the ability of the neural network to approximate the identity function.
Architecture
For monocular 2D pose estimation, we used the stacked hourglass network [newell2016stacked]
pretrained on COCO pose dataset
[guler2018densepose]. For monocular 3D estimation in Stage 1, we applied EpipolarPose [kocabas2019epipolar] on Human3.6M and CanonPose [wandt2020canonpose] on SkiPosePTZ. We note that differences in the joint labeling schemes used by these monocular 3D methods and our evaluation set do not affect the quality of camera initialization we acquire via rigid alignment, as long as monocular 3D estimates for all views follow the same labeling scheme. Similar to prior work [ma2020deepoptimizer], each “neural optimizer step” is trained separately, and stop gradient is applied to all inputs. We used the same architecture across all experiments: fullyconnected 512dimensional layers followed by a fullyconnected 128dimensional, all with selu nonlinearities [klambauer2017selu], followed by a dense output of the size corresponding to the optimization space (flattened 3D pose and weak camera model parameters). We retrained each stage multiple times until the validation PMPJPE improved or we ran out of “stage training attempts”. We refer our readers to Section 8.2 in supplementary for a more detailed description of all components we used to train our neural optimizer and their reference performance on train and test.5 Results
Our experiments lead us to the following observations. In what follows, “(4.4)” stands for “Table 4, row #4”.
Stateoftheart performance – Tables 2 and 2
In terms of both PMPJPE and NMPJPE, our neural optimizer, primed with outputs of EpipolarPose (EP) on H36M and CanonPose (CP) on SkiPose, outperforms the bundleadjustment baseline (AniPose [karashchuk2020anipose]) initialized with ground truth (GT) by +35mm on H36M with four cameras and +2mm on SkiPose with six cameras. Our method also outperforms weaklysupervised baselines by +13mm on H36M and by +30mm on SkiPose. It also outperforms the iterative refiner that was used to “teach it” by 35mm on H36M, while being 10x faster than the iterative refiner.
Fewcamera performance
Selfsupervised performance
On H36M, the neural optimizer is able to learn in a selfsupervised fashion – using heatmap logprobability loss alone without 2D ground truth supervision, and still outperform the iterative refiner that uses the same initial guess (4.4 vs 4.4). On a smaller SkiPose dataset, selfsupervised training fails (4.4).
Analysis
We hypothesize that the weaklysupervised neural optimizer was able outperform the iterative refiner by learning to compensate for the errors in other pretrained components using ground truth 2D joint locations in the reprojection loss, since removing the reprojection loss decreases its performance (4.4 vs 4.4). We believe that the finitecapacity neural optimizer additionally regularizes the pose estimation problem, allowing the selfsupervised neural optimizer to outperform the iterative refiner that minimized the very same loss log probability loss (4.4 vs 4.4). This also hints at why the analogous setup fails on SkiPose – the smaller number of training examples led to overfitting and, consequently, poor performance on test. This also agrees with the measured discrepancy between train and test errors on SkiPose (Supp. 8.6). See Supp 8.8 for qualitative examples of poses output at Stages 13.
5.1 Ablations
We additionally studied the sources of remaining error (40mm on H36M and 48mm on SkiPose).
Heatmap error
We performed iterative refinement on “fake” heatmaps centered around reprojected ground truth joint locations (4.4 vs 4.4; 4.4 vs 4.4). This reveals that imperfections in heatmaps generated by the 2D pose estimation network contribute to at least 20mm error on both H36M and SkiPose across all views. Moreover, poor performance of the iterative refinement with ground truth heatmaps in the twocamera setup on SkiPose hints at why AniPose fails in that setting – first two cameras are too close relative to the distance to subjects (see Figure 4 in rhodin2018skipose), rendering this problem illdefined.
Camera model
Initialization
We primed AniPose [karashchuk2020anipose] and our iterative refiner with ground truth and random initial guesses. This reveals that both AniPose and iterative refinement fail to converge to a good solution without a good initialization (4.4, 4.4, 4.4), but a “perfect” initialization only improves their error by additional 35mm (4.4 vs 4.4), so the initialization we get from Stage 1 is already “good enough”. We also primed our neural optimizer with random initial guesses (both during training and inference). This reveals that, when provided with enough training data (H36M) and enough cameras, our simple neural optimizer was able to quickly (in 35 steps) converge to a highquality solution, only 35mm worse then when primed with outputs from Stage 1 (4.4 vs 4.4). Moreover, when the training set is too small (SkiPose) random initialization actually improves final results in the fewcamera setup (4.4 vs 4.4).
We conclude that the fullysupervised baseline of iskakov2019learnable likely approaches the limits of what can be done with a weak camera model on H36M, and that further progress might come from better heatmaps and a better camera model. Good performance of MetaPose in the illdefined twocamera setup (SkiPose), which improves even further when primed with a random initial guess, supports our hypothesis about MetaPose regularizing the illdefined problem of fewcamera pose estimation with a finitecapacity neural network.
Design decisions
The remaining experiments show that only the full model that used all proposed components (reprojection loss , teacher loss , and, optionally, bone length prior ) was able to keep the pose estimation error low with fewer cameras on both H36M (4.44.4) and SkiPose datasets (4.44.4).
Setup  Inputs  # cam  
#  Method  Loss  2D  init  4  3  2 
1*  AniPose    GT  75  106  167  
2  AniPose    GT  GT  0.3  0.5  2.7 
3  AniPose    rand  198  215  269  
4  AniPose    GT  rand  154  153  245 
5* 
Average      EP  74  76  83 
6*  Iterative  EP  45  53  55  
7  Iterative  rand  248  322  419  
8  Iterative  GT  42  49  48  
9  Iterative  GT  EP  17  20  24  
10  Iterative  GT  GT  14  16  20  
11  Iterative  WGT  EP  4  6  18  
12  Iterative  WGT  GT  1.4  1.7  2  
13* 
MetaPose  EP  40  44  46  
14  MetaPose  rand  43  47  47  
15  MetaPose  EP  31  40  51  
16  MetaPose  EP  47  53  57  
17  MetaPose  EP  39  44  50  
18  MetaPose  rand  43  49  54  
19 
Iterative  EP  41  44  46  
20  MetaPose  EP  39  42  44 
Setup  Inputs  # cam  
#  Method  Sub.  Loss  2D  init  6  4  2 
1* 
AniPose  C    GT  50  52  221  
2*  Average  C      CP  81  82  86 
3*  Iterative  C  CP  31  35  78  
4  Iterative  C  GT  28  30  40  
5  Iterative  C  GT  CP  8  8  29  
6  Iterative  C  WGT  CP  8  7  28  
7*  MetaPose  C  CP  53  53  75  
8*  MetaPose  C  CP  48  45  55  
9  MetaPose  C  rand  51  48  53  
10  MetaPose  C  CP  240  241  240  
11  MetaPose  C  CP  55  53  89  
12 
Iterative  C  CP  27  29  34  
13  MetaPose  C  CP  50  49  58  
14  MetaPose  C  CP  48  46  49  
15 
AniPose  F    GT  56  60  231  
16  Average  F      GT  102  131  197 
17  Iterative  F  GT  32  41  134  
18  MetaPose  F  GT  48  52  73  

6 Conclusions
In this paper, we propose a new modular approach to 3D pose estimation that requires only 2D supervision for training and significantly improves upon the stateoftheart by fusing perview outputs of singeview modules with a simple threelayer fullyconnected neural network. Our modular approach not only enables practitioners to analyze and improve the performance of each component in isolation
, and channel future improvements in respective subtasks into improved 3D pose estimation “for free”, but also provides a common “bridge” that enables easy interoperation of different schools of thought in 3D pose estimation – enriching both the “endtoend neural world” with better modelbased priors and improved interpretability, and the “iterative refinement world” with betterconditioned optimization problems, transferlearning, and faster inference times. We provide a detailed ablation study dissecting different sources of the remaining error, suggesting that future progress in this task might come from the adoption of a full camera model, further improvements in 2D pose localization, better pose priors and incorporating temporal signals from video data.
7 Acknowledgement
We would like to thank Bastian Wandt, Nori Kanazawa and Diego Ruspini for help with CanonPose [wandt2020canonpose], stacked hourglass pose estimator, and interfacing with AniPose, respectively.
References
8 Supplementary
timestamp  
camera  
joint  
mixture component  
input image  
GT 2D pose  
trainig dataset  
joint 3D pose and camera guess  
(3D pose, cameras’ parameters)  
initial, refined and neural estimates  
projection of onto camera  
Gaussian mixture parameters  
weak projection parameters  

monocular 3D pose estimate 
Metric  Train  Validation  Test  

init  ref  neur  init  ref  neur  init  ref  neur  
GT logprobability  5.12  5.56  5.08  
Logprobability  4.03  6.42  4.90  4.38  6.69  5.10  3.72  6.30  5.40 
PMPJPE [mm]  69  40  28  65  36  29  74  45  40 
NMPJPE [mm]  78  61  47  69  56  49  88  70  57 
MSE 2D ()  15  5  1  6  2  2  20  7  8 
(a) H36M
Metric  Train  Validation  Test  

init  ref  neur  init  ref  neur  init  ref  neur  
GT logprobability  5.58  5.57  4.85  
Logprobability  2.79  6.08  5.20  2.84  6.03  4.45  3.02  5.84  4.05 
PMPJPE [mm]  71  19  21  72  18  35  80  31  53 
NMPJPE [mm]  109  32  34  108  33  54  147  63  86 
MSE 2D ()  34  7  5  37  12  12  30  9  14 
(b) SkiPose
# cams  100%  90%  80%  70%  60%  47%  35%  25%  15%  5% 

4  40  40  41  42  43  43  44  46  48  61 
3  44  45  46  47  45  48  48  48  52  63 
2  46  45  47  47  48  47  49  49  51  65 
# cams  100%  90%  80%  70%  60%  47%  35%  25%  15%  5% 

4  44  46  46  43  53  48  71  80  75  138 
3  50  51  53  54  58  70  74  82  86  137 
2  54  54  56  54  54  54  72  70  85  476 
Videos with test predictions: http://bit.ly/iccv2206.
8.1 Pseudocode
See Algorithms 1 and 2 for train and inference pseudocode algorithms. Note that we do not need Stage 2 at inference time. ReParam and UnReParam in these listings stands for reparameterizing (and unreparameterizing) camera rotation as 6D vectors (described in Subsection 8.4 below) and scale as logscale to make the optimization problem unconstrained. RigidAlignment is described in Subsection 8.3 below.
8.2 Extended model description
Architecture
For monocular 2D pose estimation we used the stacked hourglass network [newell2016stacked] pretrained on COCO pose dataset [guler2018densepose]
. We additionally trained a linear regression adapter to convert between COCO and H36M label formats (see supplementary Figure
3 for labeling format comparison; this yielded better generalization then finetuning the pretrained network on H36M directly, as shown in supplementary Table 6). The COCOpretrained network generalized very poorly to SkiPosePTZ dataset because of the visual domain shift, so we finetuned the stacked hourglass network using ground truth 2D labels. For monocular 3D estimates used in Stage 1, we applied EpipolarPose [kocabas2019epipolar] on Human3.6M and CanonPose [wandt2020canonpose] on SkiPosePTZ. We would like to note that, despite the significant shift in the labeling format between predictions of these monocular 3D methods and the format used in datasets we used for evaluation, this does not affect the quality of camera initialization we acquired via rigid alignment. Similar to prior work [ma2020deepoptimizer], each “neural optimizer step” is trained separately, and the fresh new neural net is used at each stage, and stop gradient is applied to all inputs. We used the same architecture for the stage network across all experiments: fullyconnected 512dimensional layers followed by a fullyconnected 128dimensional, all with selu nonlinearities [klambauer2017selu], followed by a dense output of the size corresponding to the optimization space (flattened 3D pose and weak camera model parameters). We retrained each stage multiple times until the validation PMPJPE improved or we ran out of “stage training attempts”.Hyperparameters
We used Adam [kingma2014adam] optimizer with learning rate 1e2 for 100 steps for exact refinement, and 1e4 for training stages of neural optimizer for 3000 epoch. We set the “stage retraining budget” to 100 attempts in total. We used the number of dense layers per neural optimizer stage for all experiments on Human3.6M, for all experiments on SkiPose (because the dataset is much smaller), and for experiments when the neural optimizer was additionally conditioned on bone lengths. We set in main experiments, and add losses and to the neural optimizer with weights and zero others in corresponding ablations.
8.3 Closed Form Expressions for Stage 1
Below we describe how we performed rigid alignment of monocular 3D pose estimates and inferred weak camera parameters from them. Assume that we have monocular 3D predictions in frame of the camera . The parameters of the first camera are assumed to be known and fixed
whereas the rotation of other cameras are inferred using optimal rigid alingment where
The scale and shift can be acquired by comparing the original monocular in pixels to rotated back into each camera frame, for example:
(7) 
(8)  
(9) 
where is the center of the 3D pose and and the initial pose estimate is the average of aligned, rotated and predictions from other cameras. The initial guess for the pose is the average of all monocular poses rotated into the first camera frame:
(10) 
8.4 6D rotation reparameterization
We used for following parameterization: where is a normalization operation, and is a vector product. This is essentially GramSchmidt orthogonalization. Rows of the resulting matrix is guaranteed to form an orthonormal basis. This rotation representation was shown to be better suited for optimization [zhou2019continuity].
8.5 Stable Gaussian Mixture
8.6 Reference 2D performance
Tables 6 shows performance of 2D pose prediction networks and the resulting MetaPose network on different splits of different datasets. It shows that both the 2D network and MetaPose to certain degree overfit to SkiPose because of its smaller size.
8.7 Data efficiency
Results reported in Table 7 further confirm that, and suggest that we need at least 23k samples (each samples containing several cameras) to train MetaPose to get performance within 1020% of the performance on full data. Results in Table 8 confirm that we need at least 25k samples to train MetaPose in selfsupervised mode, i.e. without using 2D GT supervision.
8.8 Images
We provide selected qualitative examples (failure cases, success cases) on the test set of H36M and SkiPose (full dataset and only first two cameras) on Figures 413, videos that contain all test prediction visualizations are available at: http://bit.ly/iccv2206. Circles around joints on 2D views represent the absolute reprojection error for that joint for that view.
We additionally note that

MetaPose considerably improves over the initial guess when a lot of selfocclusion is present (see Figures 46)

MetaPose fails on extreme poses for which monocular estimation fails (e.g. somersaults) (see Figures 910)

In twocamera Ski setup, AniPose yields smaller reprojection error while producing very bad 3D pose results (Figure 13)