In this paper, we tackle the problem of estimating 3D coordinates of human joints from RGB images captured using synchronized (potentially moving) cameras with unknown positions, orientations, and intrinsic parameters using only 2D positions of these joints on captured images for supervision.
Historically, real-time capture of the human 3D pose has been undertaken only by large enterprises that could afford expensive specialized motion capture equipment [gleicher1999animation]. In principle, spatial coordinates of human body joints can be triangulated directly from camera-space observations, if all intrinsic and extrinsic parameters of these cameras (called “camera calibration” for short) are available [iskakov2019learnable, karashchuk2020anipose]. Techniques for acquiring accurate calibration in controlled environments with fixed cameras, or in short-baseline stereo captures, where self-calibration from cross-view correspondences is possible, have been extensively studied in prior work [building_rome_10.1145/2001269.2001293]. One scenario in which aforementioned constraints do not apply is the sports capture, in which close-ups of players in extreme poses are captured in front of low-texture backgrounds using far spaced moving cameras - plain backgrounds preclude calibration, as few feature correspondences can be detected across views (e.g. examples in Figure 1). Without known camera calibration, though, we have no practical way of relating position and uncertainty predictions from different views, which is critical for accurate 3D pose estimation, since, in most cases, only a subset of joints can be accurately localized from a single view due to (self-) occlusion. Considering all of the above, a practical solution needs to be able to aggregate location and uncertainty estimates from multiple views but should not rely on prohibitively expensive 3D joint or camera supervision. As we discuss below, none of the existing approaches fully satisfies these requirements (see the top half of Figure 1 for illustrations).
Fully-supervised 3D pose estimation approaches yield the lowest estimation error, but make use of known 3D camera specification during either training [xie2020metafuse] or both training and inference [iskakov2019learnable]. However, the prohibitively high cost of 3D joint annotation and full camera calibration (with moving cameras) in-the-wild makes it difficult to acquire large enough labeled datasets representative of specific environments [rhodin2018skipose, joo2020exemplar] , therefore rendering supervised methods not applicable in this setup.
Weakly-supervised monocular 3D methods, [iqbal2020weakly, kocabas2019epipolar, wandt2020canonpose] as well as 2D-to-3D lifting networks [chen2019unsupervised, wandt2019repnet], relax data constraints to enable 3D pose inference using just multi-view 2D data without calibration at train time. Unfortunately, at inference time, these methods can only be applied to a single view at a time, therefore unable to leverage cross-view information and uncertainty. A straightforward extension of these methods, performing rigid alignment of monocular predictions at inference time, yields sub-optimal estimation error, as we show in Section 5.
Classical structure-from-motion approaches to 3D pose estimation [karashchuk2020anipose] iteratively refine both the camera and the 3D pose from noisy 2D observations. However, these methods are often slower than their neural counterparts, since they have to perform several optimization steps during inference, and, more importantly, most of them do not consider uncertainty estimates and inductive biases to speed up the inference, resulting in sub-par performance and higher sensitivity to noise, especially with fewer cameras, as we show in Section 5.
To overcome these limitation we propose “MetaPose” (see Figure 1). Our method for 3D pose estimation aggregates pose predictions and uncertainty estimates across multiple views, requires no 3D joint annotations or camera parameters at both train and inference time, and adds very little latency to the resulting pipeline. Our approach uses an off-the-shelf weakly-supervised 3D network to form an initial guess about the pose and the camera setup, and iteratively refines this guess using 2D joint location probability heatmaps generated by an off-the-shelf 2D pose estimation network. To speed up the inference, and to compensate for errors in the off-the-shelf networks, we train a neural optimizer to mimic the slow iterative optimization refinement. This modular approach not only yields low estimation error with low latency, but also enables us to easily analyze the performance of individual components and plug-in new models, priors, or losses if needed. Our main contributions can be summarized as follows:
To the best of our knowledge, we are the first to show that a general-purpose feed-forward neural network can accurately estimate the 3D human pose and the camera configuration from multiple views, taking into account joint occlusions and prediction uncertainties.
We propose a 3-stage modular training pipeline that uses only 2D supervision at train time, and a fast 2-stage inference pipeline.
On the well-established Human3.6M [ionescu2013human36m] dataset our method yields the lowest pose estimation error across models that do not use 3D ground truth annotations (more than 40% improvement in PMPJPE over the SotA) when using all four cameras and degrades gracefully when restricted to just two cameras (more than 40% improvement over the best baseline).
On the challenging in-the-wild Ski-Pose PTZ [rhodin2018skipose] dataset with moving cameras, our method yields error comparable to the bundle adjustment baseline [karashchuk2020anipose] when using all six cameras, but has much faster inference time, and massively outperforms bundle adjustment when restricted to just two cameras.
We conduct a detailed ablation study dissecting sources of error into the error due to the weak camera model, imperfect initialization, imperfect 2D heatmaps, and the additional error introduced by the neural optimizer, suggesting areas where further progress can be made.
2 Related Work
In this section, we briefly cover prior work on multi-view 3D human pose estimation. We focus on approaches that exploit multi-view information, see [joo2020exemplar] for a survey of 3D human pose estimation in the wild.
Supervised methods [iskakov2019learnable, tu2020voxelpose, Chen_2020_CVPR_crossview] yield the lowest 3D pose estimation errors on multi-view single person [ionescu2013human36m] and multi-person [Joo_2019_panoptic, belagiannis20143d, Chen_2020_CVPR_crossview] datasets, but require precise camera calibration during both training and inference. Other approaches [xie2020metafuse] use datasets with full 3D annotations and a large number of annotated cameras to train methods that can adapt to novel camera setups in visually similar environments, therefore somewhat relaxing camera calibration requirements for inference. martinez2017simple use pre-trained 2D pose networks [newell2016stacked] to take advantage of existing datasets with 2D pose annotations. Epipolar transformers [epipolartransformers] use only 2D keypoint supervision, but require camera calibration to incorporate 3D information in the 2D feature extractors. In general, fully-supervised methods require some form of 3D supervision which is prohibitively expensive to collect in-the-wild, like the sports capture setting we consider in this paper.
Weak and self-supervision
Several approaches exist that do not use paired 3D ground truth data. Many augment limited 3D annotations with 2D labels [Zhou_2017_weakly_supervised, hmrKanazawa17, Mitra_2020_CVPR, zanfir2020weakly]. Fitting-based methods [hmrKanazawa17, kocabas2020vibe, kolotouros2019spin, zanfir2020weakly] jointly fit a statistical 3D human body model and 3D human pose to monocular images or videos. Analysis-by-synthesis methods [Rhodin_2018_ECCV, Kundu_2020_nvs, Jakab_2020_CVPR_unlabelled] learn to predict 3D human pose by estimating appearance in a novel view. Most related to our work are approaches that exploit the structure of multi-view image capture. EpipolarPose [kocabas2019epipolar] uses epipolar geometry to obtain 3D pose estimates from multi-view 2D predictions, and subsequently uses them to directly supervise 3D pose regression. iqbal2020weakly proposes a weakly-supervised baseline to predict pixel coordinates of joints and their depth in each view and penalized the discrepancy between rigidly aligned predictions for different views during training. The self-supervised CanonPose [wandt2020canonpose] further advances state-of-the-art by decoupling 3D pose estimation in the “canonical” frame from the estimation of the camera frame. drover2018can learn a “dictionary” mapping 2D pose projections into corresponding realistic 3D poses, using a large collection of simulated 3D-to-2D projections. RepNet [wandt2019repnet] and chen2019unsupervised train similar “2D-to-3D lifting networks” with more realistic data constraints. While all the aforementioned methods use multi-view consistency for training, they do not allow pose inference from multiple images. A straightforward extension of these methods that, at inference time, applies rigid alignment to predicted monocular 3D estimates and ignores the uncertainty in these predictions, does not perform well, as we show in Section 5.
Estimating camera and pose simultaneously is a long-standing problem in vision [rosales2001estimating]. One of the more recent successful attempts is the work of bridgeman2019multi that proposed an end-to-end network that refines the initial calibration guess using center points of multiple players in the field. In the absence of such external calibration signals, takahashi2018human performs bundle adjustment with bone length constraints, but do not report results on a public benchmark. AniPose [karashchuk2020anipose] performs joint 3D pose and camera refinement using a modified version of the robust 3D registration algorithm of zhou2016fast
. Such methods ignore predicted uncertainty for faster inference, but robustly iteratively estimates outlier 2D observations and ignores them during refinement. In Section5, we show that these classical approaches struggle in ill-defined settings, such as when we have a small number of cameras. More recently, SPIN [kolotouros2019spin] and Holopose [Guler_2019_CVPR_holopose] incorporate iterative pose refinement for monocular inputs, however, the refinement is tightly integrated into the pose estimation network. MetaPose effectively regularizes the pose estimation problem with a finite-capacity neural network resulting in both faster inference and higher precision.
Our method involves three stages; see Figure 2:
We first acquire single-view estimates of the full 3D pose using a pre-trained monocular 3D network (EpipolarPose [kocabas2019epipolar] for H36M and CanonPose [wandt2020canonpose] for SkiPose-PTZ) and infer an initial guess for the camera configuration by applying closed-form rigid alignment to these single-view 3D pose estimates.
We compute detailed single-view heatmaps using a state-of-the-art monocular 2D network (Stacked Hourglass [newell2016stacked] for both datasets) pre-trained on available 2D labels and refine our initial guess for the 3D pose and cameras by maximizing the likelihood of re-projected points given these heatmaps via gradient descent.
We approximate this iterative refinement given an initial guess and multi-view heatmaps via a forward pass of a neural network.
This modular multi-stage approach lets us prime gradient descent in stage 2 with a “good enough” initialization to start from the correct basin of the highly non-convex multi-view heatmap likelihood objective. Moreover, it lets us swap pre-trained modules for monocular 2D and 3D without re-training the entire pipeline whenever a better approach becomes available. Finally, the neural optimizer in stage 3 provides orders of magnitude faster inference than the iterative refinement.
We assume that we have access to a labeled 2D training dataset with timestamps and cameras, where each entry consists of an RGB image , and a matrix of image-space locations of joints in pixel-space . We also assume that for each timestamp there exists a set of ground truth 3D poses and corresponding camera parameters such that each projection - a matrix of locations of joints in projected onto camera in pixels - is close to ground truth image-space coordinates . We assume that we have no access to ground truth 3D poses and camera parameters during training, but we want to learn a function , that for a fresh set of multi-view images yields a good guess for the 3D pose estimate
and vectors of camera parametersin terms of the re-projection error. Ultimately, we would like the mean per-joint position error between the ground truth 3D pose and our predictions to be as small as possible, without access to the ground truth, hence the method is “weakly-supervised”. In what follows, for improved readability, we sometimes omit the index when talking about a single entry. Our notation is also summarized in Table 5 in the supplementary.
We assume that we have access to two “external” networks that can be trained with 2D supervision: 1) a monocular 2D pose model that yields per-joint heatmaps and 2) a self-supervised monocular 3D model that yields camera-frame 3D pose estimates , such that first two coordinates of each joint contains estimated pixel coordinates of that joint in the image, and the last coordinate contains estimated joint depth in “same” units (i.e. distances between all points are correct up to scale). We approximate and store each heatmap as an 2-dimensional
-component Gaussian mixture modelwith means , spherical covariances and weights
. This “compressed” format serves two main goals: it turns the joint position likelihood into a smooth function easier to optimize over, compared to, for example, pixel-level heatmap interpolation, and it reduces the dimensionality of the input to the neural optimizer, therefore making it possible to train smaller models. We use a weak-projection camera model, so each camera is defined by a tuple of rotation matrix, pixel shift and scale parameters, and the projection operator is defined as: .
3.1 Initial estimate – Stage 1.
The purpouse of this stage is to acquire an initial guess using the monocular 3D pose model . We provide exact expression for the initial guess in Supp. 8.3 and describe the intuitive idea behind it below. We choose the first camera to be the canonical frame (i.e. ), and use orthogonal Procrustes alignment (via SVD of the outer product of mean-centered poses) to find the rotation relating 3D monocular prediction for camera to the monocular 3D estimate for the first camera . Similarly, we can find optimal shifts and scales that minimize the discrepancy between , and rotated, scaled and shifted for each of other cameras. The initial guess for the 3D pose then is the average of monocular 3D pose predictions from all cameras rigidly aligned to the first camera frame by corresponding optimal rotations, scales and shifts.
3.2 Iterative refinement – Stage 2.
In the second stage we use monocular heatmaps, in the form of Gaussian mixtures with parameters , to get a refined guess starting from the initial estimate . The refinement optimization problem over compressed heatmaps is as follows:
where and are a numerically stable versions of and , and is the numerically stable log-likelihood of a point given a Gaussian mixture (Supp. 8.5). During optimization we re-parameterized scale (always positive) as log-scale, and rotation with a 6D vector as in zhou2019continuity (Supp. 8.4).
3.3 Neural optimizer – Stage 3.
In this final stage, we train an optimizer as a neural network to predict the optimal update to the current guess for the pose and cameras, similar to ma2020deepoptimizer for solving inverse problems. Specifically, the update is computed from heatmap mixture parameters , the current guess , the projections of the current guess onto each camera , and the current value of the refinement loss (we omit the dependency of on in the first line for readability):
The neural network is optimized to minimize the re-projection loss between re-projected neural estimate and ground truth 2D joint locations while ensuring the overall pose and camera estimate remains close to the one obtained via iterative refinement:
where the reprojection loss is defined as
and the teacher loss that measures the distance between (the estimate for camera and pose given by the the iterative refinement in Stage 2) and the neural estimate :
We use this distance between rotations since huynh2009metrics suggest that it is better suited for direct optimization than the L2-distance. We need both losses because draws the neural optimizer towards the neighborhood of the correct solution, and penalizes small (in terms of ) deviations from that result in big reprojection errors.
Stages 2 and 3 can be easily extended with existing priors over human poses. For example, we were able to make our predictor more “personalized” by conditioning our model on vectors of normalized bone length ratios of a specific individual [takahashi2018human]. More specifically, we added a limb length component to the refinement loss:
We passed as an input to each optimizer step , and trained the neural optimizer to minimize .
Note, during inference we only need to perform rigid alignment over monocular 3D estimates (Stage 1) to acquire an initial guess , and apply the neural optimizer to this guess in a feed-forward manner (Stage 3), no iterative refinement (Stage 2) is necessary. We provide pseudo-code for both training and inference in Supp. 8.1.
In this section, we specify datasets and metrics we used to validate the performance of the proposed method and a set of baselines and ablation experiments we conducted to evaluate the improvement in error provided by each stage and each supervision signal.
We evaluated our method on Human3.6M [ionescu2013human36m] (H36M) dataset with four fixed cameras and a more challenging SkiPose-PTZ [rhodin2018skipose] (SkiPose) dataset with six moving pan-tilt-zoom cameras. We used standard train-test evaluated protocol for H36M [iskakov2019learnable, kocabas2019epipolar] with subjects 1, 5, 6, 7, and 8 used for training, and 9 and 11 used for testing. We additionally pruned the H36M dataset by taking each 16-th frame from it, resulting in 24443 train and 8516 test examples, each example containing information from four cameras. We evaluated our method on two subsets of SkiPose: the “full” dataset used by rhodin2018skipose (1315 train / 284 test), and a “clean” subset (1035 train / 230 test) used by wandt2020canonpose that excludes views where visibility is heavily obstructed by the winter storm, each example containing information from six cameras. We additionally augment the SkiPose dataset by randomly shuffling camera order during training to prevent the neural optimizer from overfitting on the small training set. In each dataset, we used the first 64 examples from the train split as a validation set.
We report Procrustes aligned Mean Per Joint Position Error (PMPJPE) that measures the L2-error of 3D joint estimates after applying the optimal rigid alignment to the full predicted 3D pose and the ground truth 3D pose, and therefore ignores the overall pose orientation error, as well as the Normalized Mean Per Joint Position Error (NMPJPE) that measures the error after optimally scaling and shifting the prediction for the given camera frame to the ground truth pose in that camera frame, so it is sensitive to the pose orientation, but ignores the errors in “scale”. We do not report the non-normalized pose estimation error because, in the absence of intrinsic camera parameters, the correct absolute scale can not be estimated from data. We additionally report the “order-of-magnitude” of additional time () it takes to perform multi-view 3D pose inference using different methods on top of monocular pose estimation time for each view.
|Isakov et al. [iskakov2019learnable]||3D||✓||✓||20||-||-|
|EpipolarPose (EP) [kocabas2019epipolar]||S||✗||n/a||71||78||-|
|Rhodin et al. [rhodin2018skipose]||2/3D||✗||n/a||65||80||-|
|Iqbal et al. [iqbal2020weakly]||2D||✗||n/a||55||66||-|
|AniPose [karashchuk2020anipose] + GT||2D||✓||✗||75||103||5|
|Per-view EP [kocabas2019epipolar]||S||✗||n/a||86||97|
|Iterative Ref. + EP||2D||✓||✓||45||70||10|
|MetaPose + EP||2D||✓||✓||40||57|
|MetaPose () + EP||S||✓||✓||43||58|
|CanonPose (CP) [wandt2020canonpose]||S||✗||n/a||90||128||-|
|AniPose [karashchuk2020anipose] + GT||2D||✓||✗||50||62||5|
|Per-view CP [wandt2020canonpose]||S||✗||n/a||86||115|
|Iterative Ref. + CP||2D||✓||✓||31||63||10|
|MetaPose + CP||2D||✓||✓||53||86|
|MetaPose () + CP||2D||✓||✓||48||60|
On H36M we lower-bound the error with the state-of-the-art fully-supervised baseline of iskakov2019learnable that uses ground truth camera parameters to aggregate multi-view predictions during inference. We also compare the performance of our method to methods that use multi-view 2D supervision during training but only perform inference on a single view at a time: self-supervised EpipolarPose (EP) [kocabas2019epipolar] and CanonPose (CP) [wandt2020canonpose], as well as the weakly supervised baselines of iqbal2020weakly and rhodin2018skipose. We applied EpipolarPose to our own set of human bounding boxes on H36M, so our per-view results differ from those reported by kocabas2019epipolar. On SkiPose we compared our model with the only two baselines available in the literature: CanonPose [wandt2020canonpose] and rhodin2018skipose. CanonPose is preferable to EpipolarPose on the SkiPose dataset because it does not assume fixed cameras. We include results on both the “clean” and “full” subsets of SkiPose (described previously) for completeness; however CanonPose was trained and evaluated only on the “clean” subset, so in the absence of monocular 3D methods trained and evaluated on the “full” SkiPose dataset, for the sake of completeness, we trained and evaluated our method using ground truth data with Gaussian noise in place of CanonPose predictions, only when evaluating on that subset (“full”) of SkiPose.
We also compared our method against “classical” bundle adjustment initialized with ground truth extrinsic camera parameters, and fixed ground truth intrinsic parameters of all cameras, therefore putting it into unrealistically favorable conditions. We used the well-tested implementation of bundle adjustment in AniPose [karashchuk2020anipose] that uses an adapted version of the 3D registration algorithm of zhou2016fast. This approach takes point estimates of joint locations as an input (i.e. no uncertainty) and iteratively refines camera parameters and joint location estimates.
We additionally measured different sources of remaining error in our predictions by replacing each component discussed in Section 3 with either its ground truth equivalent or a completely random guess, as well as constraining supervision signals we pass to the neural optimizer. More specifically, we measured the effect of 1) varying the number of cameras used for training and evaluation; 2) priming different refinement methods with random and ground truth initialization; 3) using “fake heatmaps” centered around ground truth joints projected back into the image plane using different camera models; and 4) disabling and enabling losses and priors when training our neural optimizer. We did not measure the performance of the neural optimizer primed with the ground truth initialization because that would have just shown the ability of the neural network to approximate the identity function.
For monocular 2D pose estimation, we used the stacked hourglass network [newell2016stacked]
pre-trained on COCO pose dataset[guler2018densepose]. For monocular 3D estimation in Stage 1, we applied EpipolarPose [kocabas2019epipolar] on Human3.6M and CanonPose [wandt2020canonpose] on SkiPosePTZ. We note that differences in the joint labeling schemes used by these monocular 3D methods and our evaluation set do not affect the quality of camera initialization we acquire via rigid alignment, as long as monocular 3D estimates for all views follow the same labeling scheme. Similar to prior work [ma2020deepoptimizer], each “neural optimizer step” is trained separately, and stop gradient is applied to all inputs. We used the same architecture across all experiments: fully-connected 512-dimensional layers followed by a fully-connected 128-dimensional, all with selu nonlinearities [klambauer2017selu], followed by a dense output of the size corresponding to the optimization space (flattened 3D pose and weak camera model parameters). We re-trained each stage multiple times until the validation PMPJPE improved or we ran out of “stage training attempts”. We refer our readers to Section 8.2 in supplementary for a more detailed description of all components we used to train our neural optimizer and their reference performance on train and test.
In terms of both PMPJPE and NMPJPE, our neural optimizer, primed with outputs of EpipolarPose (EP) on H36M and CanonPose (CP) on SkiPose, outperforms the bundle-adjustment baseline (AniPose [karashchuk2020anipose]) initialized with ground truth (GT) by +35mm on H36M with four cameras and +2mm on SkiPose with six cameras. Our method also outperforms weakly-supervised baselines by +13mm on H36M and by +30mm on SkiPose. It also outperforms the iterative refiner that was used to “teach it” by 3-5mm on H36M, while being 10x faster than the iterative refiner.
On H36M, the neural optimizer is able to learn in a self-supervised fashion – using heatmap log-probability loss alone without 2D ground truth supervision, and still outperform the iterative refiner that uses the same initial guess (4.4 vs 4.4). On a smaller SkiPose dataset, self-supervised training fails (4.4).
We hypothesize that the weakly-supervised neural optimizer was able outperform the iterative refiner by learning to compensate for the errors in other pre-trained components using ground truth 2D joint locations in the reprojection loss, since removing the reprojection loss decreases its performance (4.4 vs 4.4). We believe that the finite-capacity neural optimizer additionally regularizes the pose estimation problem, allowing the self-supervised neural optimizer to outperform the iterative refiner that minimized the very same loss log probability loss (4.4 vs 4.4). This also hints at why the analogous setup fails on SkiPose – the smaller number of training examples led to overfitting and, consequently, poor performance on test. This also agrees with the measured discrepancy between train and test errors on SkiPose (Supp. 8.6). See Supp 8.8 for qualitative examples of poses output at Stages 1-3.
We additionally studied the sources of remaining error (40mm on H36M and 48mm on SkiPose).
We performed iterative refinement on “fake” heatmaps centered around reprojected ground truth joint locations (4.4 vs 4.4; 4.4 vs 4.4). This reveals that imperfections in heatmaps generated by the 2D pose estimation network contribute to at least 20mm error on both H36M and SkiPose across all views. Moreover, poor performance of the iterative refinement with ground truth heatmaps in the two-camera setup on SkiPose hints at why AniPose fails in that setting – first two cameras are too close relative to the distance to subjects (see Figure 4 in rhodin2018skipose), rendering this problem ill-defined.
We primed AniPose [karashchuk2020anipose] and our iterative refiner with ground truth and random initial guesses. This reveals that both AniPose and iterative refinement fail to converge to a good solution without a good initialization (4.4, 4.4, 4.4), but a “perfect” initialization only improves their error by additional 3-5mm (4.4 vs 4.4), so the initialization we get from Stage 1 is already “good enough”. We also primed our neural optimizer with random initial guesses (both during training and inference). This reveals that, when provided with enough training data (H36M) and enough cameras, our simple neural optimizer was able to quickly (in 3-5 steps) converge to a high-quality solution, only 3-5mm worse then when primed with outputs from Stage 1 (4.4 vs 4.4). Moreover, when the training set is too small (SkiPose) random initialization actually improves final results in the few-camera setup (4.4 vs 4.4).
We conclude that the fully-supervised baseline of iskakov2019learnable likely approaches the limits of what can be done with a weak camera model on H36M, and that further progress might come from better heatmaps and a better camera model. Good performance of MetaPose in the ill-defined two-camera setup (SkiPose), which improves even further when primed with a random initial guess, supports our hypothesis about MetaPose regularizing the ill-defined problem of few-camera pose estimation with a finite-capacity neural network.
The remaining experiments show that only the full model that used all proposed components (reprojection loss , teacher loss , and, optionally, bone length prior ) was able to keep the pose estimation error low with fewer cameras on both H36M (4.4-4.4) and SkiPose datasets (4.4-4.4).
In this paper, we propose a new modular approach to 3D pose estimation that requires only 2D supervision for training and significantly improves upon the state-of-the-art by fusing per-view outputs of singe-view modules with a simple three-layer fully-connected neural network. Our modular approach not only enables practitioners to analyze and improve the performance of each component in isolation
, and channel future improvements in respective sub-tasks into improved 3D pose estimation “for free”, but also provides a common “bridge” that enables easy inter-operation of different schools of thought in 3D pose estimation – enriching both the “end-to-end neural world” with better model-based priors and improved interpretability, and the “iterative refinement world” with better-conditioned optimization problems, transfer-learning, and faster inference times. We provide a detailed ablation study dissecting different sources of the remaining error, suggesting that future progress in this task might come from the adoption of a full camera model, further improvements in 2D pose localization, better pose priors and incorporating temporal signals from video data.
We would like to thank Bastian Wandt, Nori Kanazawa and Diego Ruspini for help with CanonPose [wandt2020canonpose], stacked hourglass pose estimator, and interfacing with AniPose, respectively.
|GT 2D pose|
|joint 3D pose and camera guess|
|(3D pose, cameras’ parameters)|
|initial, refined and neural estimates|
|projection of onto camera|
|Gaussian mixture parameters|
|weak projection parameters|
||monocular 3D pose estimate|
|MSE 2D ()||15||5||1||6||2||2||20||7||8|
|MSE 2D ()||34||7||5||37||12||12||30||9||14|
Videos with test predictions: http://bit.ly/iccv2206.
See Algorithms 1 and 2 for train and inference pseudo-code algorithms. Note that we do not need Stage 2 at inference time. ReParam and UnReParam in these listings stands for re-parameterizing (and un-reparameterizing) camera rotation as 6D vectors (described in Subsection 8.4 below) and scale as log-scale to make the optimization problem unconstrained. RigidAlignment is described in Subsection 8.3 below.
8.2 Extended model description
For monocular 2D pose estimation we used the stacked hourglass network [newell2016stacked] pre-trained on COCO pose dataset [guler2018densepose]
. We additionally trained a linear regression adapter to convert between COCO and H36M label formats (see supplementary Figure3 for labeling format comparison; this yielded better generalization then fine-tuning the pre-trained network on H36M directly, as shown in supplementary Table 6). The COCO-pretrained network generalized very poorly to SkiPosePTZ dataset because of the visual domain shift, so we fine-tuned the stacked hourglass network using ground truth 2D labels. For monocular 3D estimates used in Stage 1, we applied EpipolarPose [kocabas2019epipolar] on Human3.6M and CanonPose [wandt2020canonpose] on SkiPosePTZ. We would like to note that, despite the significant shift in the labeling format between predictions of these monocular 3D methods and the format used in datasets we used for evaluation, this does not affect the quality of camera initialization we acquired via rigid alignment. Similar to prior work [ma2020deepoptimizer], each “neural optimizer step” is trained separately, and the fresh new neural net is used at each stage, and stop gradient is applied to all inputs. We used the same architecture for the stage network across all experiments: fully-connected 512-dimensional layers followed by a fully-connected 128-dimensional, all with selu nonlinearities [klambauer2017selu], followed by a dense output of the size corresponding to the optimization space (flattened 3D pose and weak camera model parameters). We re-trained each stage multiple times until the validation PMPJPE improved or we ran out of “stage training attempts”.
We used Adam [kingma2014adam] optimizer with learning rate 1e-2 for 100 steps for exact refinement, and 1e-4 for training stages of neural optimizer for 3000 epoch. We set the “stage re-training budget” to 100 attempts in total. We used the number of dense layers per neural optimizer stage for all experiments on Human3.6M, for all experiments on SkiPose (because the dataset is much smaller), and for experiments when the neural optimizer was additionally conditioned on bone lengths. We set in main experiments, and add losses and to the neural optimizer with weights and zero others in corresponding ablations.
8.3 Closed Form Expressions for Stage 1
Below we describe how we performed rigid alignment of monocular 3D pose estimates and inferred weak camera parameters from them. Assume that we have monocular 3D predictions in frame of the camera . The parameters of the first camera are assumed to be known and fixed
whereas the rotation of other cameras are inferred using optimal rigid alingment where
The scale and shift can be acquired by comparing the original monocular in pixels to rotated back into each camera frame, for example:
where is the center of the 3D pose and and the initial pose estimate is the average of aligned, rotated and predictions from other cameras. The initial guess for the pose is the average of all monocular poses rotated into the first camera frame:
8.4 6D rotation re-parameterization
We used for following parameterization: where is a normalization operation, and is a vector product. This is essentially Gram-Schmidt orthogonalization. Rows of the resulting matrix is guaranteed to form an orthonormal basis. This rotation representation was shown to be better suited for optimization [zhou2019continuity].
8.5 Stable Gaussian Mixture
8.6 Reference 2D performance
Tables 6 shows performance of 2D pose prediction networks and the resulting MetaPose network on different splits of different datasets. It shows that both the 2D network and MetaPose to certain degree overfit to SkiPose because of its smaller size.
8.7 Data efficiency
Results reported in Table 7 further confirm that, and suggest that we need at least 2-3k samples (each samples containing several cameras) to train MetaPose to get performance within 10-20% of the performance on full data. Results in Table 8 confirm that we need at least 25k samples to train MetaPose in self-supervised mode, i.e. without using 2D GT supervision.
We provide selected qualitative examples (failure cases, success cases) on the test set of H36M and SkiPose (full dataset and only first two cameras) on Figures 4-13, videos that contain all test prediction visualizations are available at: http://bit.ly/iccv2206. Circles around joints on 2D views represent the absolute reprojection error for that joint for that view.
We additionally note that
MetaPose considerably improves over the initial guess when a lot of self-occlusion is present (see Figures 4-6)
MetaPose fails on extreme poses for which monocular estimation fails (e.g. somersaults) (see Figures 9-10)
In two-camera Ski setup, AniPose yields smaller reprojection error while producing very bad 3D pose results (Figure 13)