1 Introduction
Most recent works on human 3D pose capture has focused on monocular reconstruction, even though multiview reconstruction is much easier, since multicamera setups are perceived as being too cumbersome. The appearance of Virtual/Augmented Reality headsets with multiple integrated cameras challenges this perception and has the potential to bring back multicamera techniques to the fore, but only if multiview approaches can be made sufficiently lightweight to fit within the limits of lowcompute headsets.
Unfortunately, the stateoftheart multicamera 3D pose estimation algorithms tend to be computationally expensive because they rely on deep networks that operate on volumetric grids [15], or volumetric Pictorial Structures [23, 22], to combine features coming from different views in accordance with epipolar geometry. Fig. 1(a) illustrates these approaches.
In this paper, we demonstrate that the expense of using a 3D grid is not required. Fig. 1(b) depicts our approach. We encode each input image into latent representations, which are then efficiently transformed from image coordinates into world coordinates by conditioning on the appropriate camera transformation using feature transform layers [32]. This yields feature maps that live in a canonical frame of reference and are disentangled from the camera poses. The feature maps are fused using 1D convolutions into a unified latent representation, denoted as in Fig. 1(b), which makes it possible to reason jointly about the extracted 2D poses across camera views. We then condition this latent code on the known camera transformation to decode it back to 2D image locations using a shallow 2D CNN. The proposed fusion technique, to which we will refer to as Canonical Fusion, enables us to drastically improve the accuracy of the 2D detection compared to the results obtained from each image independently, so much so, that we can lift these 2D detections to 3D reliably using the simple Direct Linear Transform (DLT) method [11]
. Because standard DLT implementations that rely on Singular Value Decomposition (SVD) are rarely efficient on GPUs, we designed a faster alternative implementation based on the Shifted Iterations method
[24].In short, our contributions are: (1) a novel multicamera fusion technique that exploits 3D geometry in latent space to efficiently and jointly reason about different views and drastically improve the accuracy of 2D detectors, (2) a new GPUfriendly implementation of the DLT method, which is hundreds of times faster than standard implementations.
We evaluate our approach on two largescale multiview datasets, Human3.6M [14] and TotalCapture [30]: we outperform the stateoftheart methods when additional training data is not available, both in terms of speed and accuracy. When additional 2D annotations can be used [18, 2], our accuracy remains comparable to that of the stateoftheart methods, while being faster. Finally, we demonstrate that our approach can handle viewpoints that were never seen during training. In short, we can achieve realtime performance without sacrificing prediction accuracy nor viewpoint flexibility, while other approaches cannot.
2 Related Work
Pose estimation is a longstanding problem in the computer vision community. In this section, we review in detail related multiview pose estimation literature. We then focus on approaches lifting 2D detections to 3D via triangulation.
Pose estimation from multiview input images. Early attempts [19, 9, 4, 3]
tackled poseestimation from multiview inputs by optimizing simple parametric models of the human body to match handcrafted image features in each view, achieving limited success outside of the controlled settings. With the advent of deep learning, the dominant paradigm has shifted towards estimating 2D poses from each view separately, through exploiting efficient monocular pose estimation architectures
[21, 29, 31, 27], and then recovering the 3D pose from single view detections.Most approaches use 3D volumes to aggregate 2D predictions. Pavlakos et al. [22] project 2D keypoint heatmaps to 3D grids and use Pictorial Structures aggregation to estimate 3D poses. Similarly, [23] proposes to use Recurrent Pictorial Structures to efficiently refine 3D pose estimations step by step. Improving upon these approaches, [15] projects 2D heatmaps to a 3D volume using a differentiable model and regresses the estimated rootcentered 3D pose through a learnable 3D convolutional neural network. This allows them to train their system endtoend by optimizing directly the 3D metric of interest through the predictions of the 2D pose estimator network. Despite recovering 3D poses reliably, volumetric approaches are computationally demanding, and simple triangulation of 2D detections is still the defacto standard when seeking realtime performance [17, 5].
Few models have focused on developing lightweight solutions to reason about multiview inputs. In particular, [16] proposes to concatenate together precomputed 2D detections and pass them as input to a fully connected network to predict global 3D joint coordinates. Similarly, [23] refines 2D heatmap detections jointly by using a fully connected layer before aggregating them on 3D volumes.
Although, similar to our proposed approach, these methods fuse information from different views without using volumetric grids, they do not leverage camera information and thus overfit to a specific camera setting. We will show that our approach can handle different cameras flexibly and even generalize to unseen ones.
Triangulating 2D detections.
Computing the position of a point in 3Dspace given its images in views and the camera matrices of those views is one of the most studied computer vision problems. We refer the reader to [11] for an overview of existing methods.
In our work, we use the Direct Linear Triangulation (DLT) method because it is simple and differentiable. We propose a novel GPUfriendly implementation of this method, which is up to two orders of magnitude faster than existing ones that are based on SVD factorization. We provide a more detailed overview about this algorithm in Section 3.4.
Several methods lift 2D detections efficiently to 3D by means of triangulation [1, 17, 10, 5]. More closely related to our work, [15] proposes to backpropagate through an SVDbased differentiable triangulation layer by lifting 2D detections to 3D keypoints. Unlike our approach, these methods do not perform any explicit reasoning about multiview inputs and therefore struggle with large selfocclusions.
3 Method
We consider a setting in which spatially calibrated and temporally synchronized cameras capture the performance of a single individual in the scene. We denote with the set of multiview input images, each captured from a camera with known projection matrix . Our goal is to estimate its 3D pose in the absolute world coordinates; we parameterize it as a fixedsize set of 3D point locations , which correspond to the joints.
Consider as an example the input images on the left of Figure 2. Although exhibiting different appearances, the frames share the same 3D pose information up to a perspective projection and viewdependent occlusions. Building on this observation, we design our architecture (depicted in Figure 2), which learns a unified viewindependent representation of 3D pose from multiview input images. This allows us to reason efficiently about occlusions to produce accurate 2D detections, that can be then simply lifted to 3D absolute coordinates by means of triangulation. Below, we first introduce baseline methods for pose estimation from multiview inputs. We then describe our approach in detail and explain how we train our model.
3.1 Lightweight pose estimation from multiview inputs
Given input images , we use a convolutional neural network backbone to extract features from each input image separately. Denoting our encoder network as , is computed as
(1) 
Note that, at this stage, feature map contains a representation of the 3D pose of the performer that is fully entangled with camera viewpoint, expressed by the camera projection operator .
We first propose a baseline approach, similar to [17, 10], to estimate the 3D pose from multiview inputs. Here, we simply decode latent codes to 2D detections, and lift 2D detections to 3D by means of triangulation. We refer to this approach as Baseline. Although efficient, we argue that this approach is limited because it processes each view independently and therefore cannot handle selfocclusions.
An intuitive way to jointly reason across different views is to use a learnable neural network to share information across embeddings , by concatenating features from different views and processing them through convolutional layers into viewdependent features, similar in spirit to the recent models [16, 23]. In Section 4 we refer to this general approach as Fusion. Although computationally lightweight and effective, we argue that this approach is limited for two reasons: (1) it does not make use of known camera information, relying on the network to learn the spatial configuration of the multiview setting from the data itself, and (2) it cannot generalize to different camera settings by design. We will provide evidence for this in Section 4 .
3.2 Learning a viewindependent representation
To alleviate the aforementioned limitations, we propose a method to jointly reason across views, leveraging the observation that the 3D pose information contained in feature maps is the same across all views up to camera projective transforms and occlusions, as discussed above. We will refer to this approach as Canonical Fusion.
To achieve this goal, we leverage feature transform layers (FTL) [32], which was originally proposed as a technique to condition latent embeddings on a target transformation so that to learn interpretable representations. Internally, a FTL has no learnable parameter and is computationally efficient. It simply reshapes the input feature map to a pointset, applies the target transformation, and then reshapes the pointset back to its original dimension. This technique forces the learned latent feature space to preserve the structure of the transformation, resulting in practice in a disentanglement between the learned representation and the transformation. In order to make this paper more selfcontained, we review FTL in detail in the Supplementary Section.
Several approaches have used FTL for novel view synthesis to map the latent representation of images or poses from one view to another [26, 25, 7, 6]. In this work, we leverage FTL to map images from multiple views to a unified latent representation of 3D pose. In particular, we use FTL to project feature maps to a common canonical representation by explicitly conditioning them on the camera projection matrix that maps image coordinates to the world coordinates
(2) 
Now that feature maps have been mapped to the same canonical representation, they can simply be concatenated and fused into a unified representation of 3D pose via a shallow 1D convolutional neural network , i.e.
(3) 
We now force the learned representation to be disentangled from camera viewpoint by transforming the shared features to viewspecific representations by
(4) 
In Section 4 we show both qualitatively and quantitatively that the representation of 3D pose we learn is effectively disentangled from the cameraview point.
Unlike the Fusion baseline, Canonical Fusion makes explicit use of camera projection operators to simplify the task of jointly reasoning about views. The convolutional block, in fact, now does not have to figure out the geometrical disposition of the multicamera setting and can solely focus on reasoning about occlusion. Moreover, as we will show, Canonical Fusion can handle different cameras flexibly, and even generalize to unseen ones.
3.3 Decoding latent codes to 2D detections
This component of our architecture proceeds as a monocular pose estimation model that maps viewspecific representations to 2D Heatmaps via a shallow convolutional decoder , i.e.
(5) 
where is the heatmap prediction for joint in Image . Finally, we compute the 2D location of each joint by simply integrating heatmaps across spatial axes
(6) 
Note that this operation is differentiable with respect to heatmap , allowing us to backpropagate through it. In the next section, we explain in detail how we proceed to lift multiview 2D detections to 3D.
3.4 Efficient Direct Linear Transformation
In this section we focus on finding the position of a 3D point in space given a set of 2d detections . To ease the notation, we will drop apex as the derivations that follow are carried independently for each landmark.
Assuming a pinhole camera model, we can write , where is an unknown scale factor. Note that here, with a slight abuse of notation, we express both 2d detections and 3d landmarks x in homogeneous coordinates. Expanding on the components we get
(7) 
where denotes the th row of th camera projection matrix. Eliminating using the third relation in (7), we obtain
(8)  
(9) 
Finally, accumulating over all available views yields a total of linear equations in the unknown 3D position x, which we write compactly as
(10) 
Note that is a function of , as specified in Equations (8) and (9). We refer to as the DLT matrix. These equations define x up to a scale factor, and we seek a nonzero solution. In the absence of noise, Equation (10) admits a unique nontrivial solution, corresponding to the 3D intersection of the camera rays passing by each 2D observation (i.e. matrix does not have full rank). However, considering noisy 2D point observations such as the ones predicted by a neural network, Equation (10) does not admit solutions, thus we have to seek for an approximate one. A common choice, known as the Direct Linear Transform (DLT) method [11], proposes the following relaxed version of Equation (10):
(11) 
Clearly, the solution to the above optimization problem is the eigenvector of
associated to its smallest eigenvalue
. In practice, the eigenvector is computed by means of Singular Value Decomposition (SVD) [11]. We argue that this approach is suboptimal, as we in fact only care about one of the eigenvectors of .Inspired by the observation above that the smallest eigenvalue of is zero for nonnoisy observations, we derive a bound for the smallest eigenvalue of matrix in the presence of Gaussian noise. We prove this estimate in the Supplementary Section.
Theorem 1
Let be the DLT matrix associated to the nonperturbed case, i.e. . Let us assume i.i.d Gaussian noise in our 2d observations, i.e. , and let us denote as the DLT matrix associated to the perturbed system. Then, it follows that:
(12) 
In Figure 3(a) we reproduce these setting by considering Gaussian perturbations of 2D observations, and find an experimental confirmation that by having a greater 2D joint measurement error, specified by 2DMPJPE (see Equation 13 for its formal definition), the expected smallest singular value increases linearly.
The bound above, in practice, allows us to compute the smallest singular vector of
reliably by means of Shifted Inverse Iterations (SII) [24]: we can estimate with a small constant and know that the iterations will converge to the correct eigenvector. For more insight on why this is the case, we refer the reader to the Supplementary Section.SII can be implemented extremely efficiently on GPUs. As outlined in Algorithm 1, it consists of one inversion of a matrix and several matrix multiplication and vector normalizations, operations that can be trivially parallelized. In Figure 3(b) we compare our SII based implementation of DLT (estimating the smallest singular value of with ) to an SVD based one, such as the one proposed in [15]. For 2D observation errors up to pixels (which is a reasonable range in pixel images), our formulation requires as little as two iterations to achieve the same accuracy as a full SVD factorization, while being respectively times faster on CPU/GPU than its counterpart, as evidenced by our profiling in Figures 3(c,d).
3.5 Loss function
In this section, we explain how to train our model. Since our DLT implementation is differentiable with respect to 2D joint locations , we can let gradients with respect to 3D landmarks x flow all the way back to the input images , making our approach trainable endtoend. However, in practice, to make training more stable in its early stages, we found it helpful to first train our model by minimizing a 2D Mean Per Joint Position Error (MPJPE) of the form
(13) 
where denotes the ground truth 2D position of th joint in the th image. In our experiments, we pretrain our models by minimizing
for 20 epochs. Then, we finetune our model by minimizing 3D MPJPE, which is also our test metric, by
(14) 
where denotes the ground truth 3D position of th joint in the world coordinate. We evaluate the benefits of finetuning using in the Section 4.
4 Experiments
We conduct our evaluation on two available largescale multiview datasets, TotalCapture [30] and Human3.6M [14, 13]. We crop each input image around the performer, using ground truth bounding boxes provided by each dataset. Input crops are undistorted, resampled so that virtual cameras are pointing at the center of the crop and normalized to . We augment our train set by performing random rotation( degrees, note that image rotations correspond to camera rotations along the zaxis) and standard color augmentation. In our experiments, we use a ResNet152 [12]
pretrained on ImageNet
[8] as the backbone architecture for our encoder. Our fusion block consists of two convolutional layers. Our decoder consists of 4 transposed convolutional layers, followed by a convolution to produce heatmaps. More details on our architecture are provided in the Supplementary section. The networks are trained forepochs, using a Stochastic Gradient Descent optimizer where we set learning rate to
.4.1 Datasets specifications
TotalCapture: The TotalCapture dataset [30] has been recently introduced to the community.
It consists of 1.9 million frames, captured from 8 calibrated full HD video cameras recording at 60Hz.
It features 4 male and 1 female subjects, each performing five diverse performances repeated 3 times: ROM, Walking, Acting, Running, and Freestyle.
Accurate 3D human joint locations are obtained from a markerbased motion capture system.
Following previous work [30], the training set consists of “ROM1,2,3”,
“Walking1,3”, “Freestyle1,2”, “Acting1,2”, “Running1” on
subjects 1,2 and 3. The testing set consists of “Walking2 (W2)”, “Freestyle3 (FS3)”, and “Acting3 (A3)” on subjects 1, 2, 3, 4, and 5. The number following each action indicates the video of that action being used, for example Freestyle has three videos of the same action of which 1 and 2 are used for training and 3 for testing. This setup allows for testing on unseen and seen subjects but always unseen performances.
Following [23], we use the data of four cameras (1,3,5,7) to train and test our models.
However, to illustrate the generalization ability of our approach to new camera settings, we propose an experiment were we train on cameras (1,3,5,7) and test on unseen cameras (2,4,6,8).
Human 3.6M: The Human3.6M dataset [14, 13] is the largest publicly available 3D human pose estimation benchmark.
It consists of 3.6 million frames, captured from 4 synchronized 50Hz digital cameras.
Accurate 3D human joint locations are obtained from a markerbased motion capture system utilizing 10 additional IR sensors.
It contains a total of 11 subjects (5 females and 6 males) performing 15 different activities.
For evaluation, we follow the most popular protocol, by training on subjects 1, 5, 6, 7, 8 and using unseen subjects 9, 11 for testing. Similar to other methods [20, 22, 28, 16, 23], we use all available views during training and inference.
Methods  Seen Subjects (S1,S2,S3)  Unseen Subjects (S4,S5)  Mean  
Walking  Freestyle  Acting  Walking  Freestyle  Acting  
Qui et al. [23] Baseline + RPSM  28  42  30  45  74  46  41 
Qui et al. [23] Fusion + RPSM  19  28  21  32  54  33  29 
Ours, Baseline  31.8  36.4  24.0  43.0  75.7  43.0  39.3 
Ours, Fusion  14.6  35.3  20.7  28.8  71.8  37.3  31.8 
Ours, Canonical Fusion(no DLT)  10.9  32.2  16.7  27.6  67.9  35.1  28.6 
Ours, Canonical Fusion  10.6  30.4  16.3  27.0  65.0  34.2  27.5 
Methods  Seen Subjects (S1,S2,S3)  Unseen Subjects (S4,S5)  Mean  

Walking  Freestyle  Acting  Walking  Freestyle  Acting  
Ours, Baseline  28.9  53.7  42.4  46.7  75.9  51.3  48.2 
Ours, Fusion  73.9  71.5  71.5  72.0  108.4  58.4  78.9 
Ours, Canonical Fusion  22.4  47.1  27.8  39.1  75.7  43.1  38.2 
4.2 Qualitative evaluation of disentanglement
We evaluate the quality of our latent representation by showing that 3D pose information is effectively disentangled from the camera viewpoint. Recall from Section 3 that our encoder encodes input images to latent codes , which are transformed from camera coordinates to the world coordinates and latter fused into a unified representation which is meant to be disentangled from the camera viewpoint. To verify this is indeed the case, we propose to decode our representation to different 2D poses by using different camera transformations , in order to produce views of the same pose from novel camera viewpoints. We refer the reader to Figure 5 for a visualization of the synthesized poses. In the top row, we rotate one of the cameras with respect to the zaxis, presenting the network with projection operators that have been seen at train time. In the bottom row we consider a more challenging scenario, where we synthesize novel views by rotating the camera around the plane going through consecutive cameras. Despite presenting the network with unseen projection operators, our decoder is still able to synthesize correct 2D poses. This experiment shows our approach has effectively learned a representation of the 3D pose that is disentangled from camera viewpoint. We evaluate it quantitatively in Section 4.4.
Methods  Dir.  Disc.  Eat  Greet  Phone  Photo  Pose  Purch.  Sit  SitD.  Smoke  Wait  WalkD.  Walk  WalkT.  Mean 
Martinez et al. [20]  46.5  48.6  54.0  51.5  67.5  70.7  48.5  49.1  69.8  79.4  57.8  53.1  56.7  42.2  45.4  57.0 
Pavlakos et al. [22]  41.2  49.2  42.8  43.4  55.6  46.9  40.3  63.7  97.6  119.0  52.1  42.7  51.9  41.8  39.4  56.9 
Tome et al. [28]  43.3  49.6  42.0  48.8  51.1  64.3  40.3  43.3  66.0  95.2  50.2  52.2  51.1  43.9  45.3  52.8 
Kadkhodamohammadi et al. [16]  39.4  46.9  41.0  42.7  53.6  54.8  41.4  50.0  59.9  78.8  49.8  46.2  51.1  40.5  41.0  49.1 
Qiu et al. [23]  34.8  35.8  32.7  33.5  34.5  38.2  29.7  60.7  53.1  35.2  41.0  41.6  31.9  31.4  34.6  38.3 
Qui et al. [23] + RPSM  28.9  32.5  26.6  28.1  28.3  29.3  28.0  36.8  41.0  30.5  35.6  30.0  28.3  30.0  30.5  31.2 
Ours, Baseline  39.1  46.5  31.6  40.9  39.3  45.5  47.3  44.6  45.6  37.1  42.4  46.7  34.5  45.2  64.8  43.2 
Ours, Fusion  31.3  37.3  29.4  29.5  34.6  46.5  30.2  43.5  44.2  32.4  35.7  33.4  31.0  38.3  32.4  35.4 
Ours, Canonical Fusion (no DLT)  31.0  35.1  28.6  29.2  32.2  34.8  33.4  32.1  35.8  34.8  33.3  32.2  29.9  35.1  34.8  32.5 
Ours, Canonical Fusion  27.3  32.1  25.0  26.5  29.3  35.4  28.8  31.6  36.4  31.7  31.2  29.9  26.9  33.7  30.4  30.2 
4.3 Quantitative evaluation on TotalCapture
We begin by evaluating the different components of our approach and comparing to the stateoftheart volumetric method of [23] on the TotalCapture dataset. We report our results in Table 1. We observe that by using the feature fusion technique (Fusion) we get a significant improvement over our Baseline, showing that, although simple, this fusion technique is effective. Our more sophisticated Canonical Fusion (no DLT) achieves further improvement, showcasing that our method can effectively use camera projection operators to better reason about views. Finally, training our architecture by backpropagating through the triangulation layer (Canonical Fusion) allows to further improve our accuracy by . This is not surprising as we optimize directly for the target metric when training our network. Our best performing model outperforms the stateoftheart volumetric model of [23] by . Note that their method lifts 2D detections to 3D using Recurrent Pictorial Structures (RPSM), which uses a predefined skeleton, as a strong prior, to lift 2D heatmaps to 3D detections. Our method doesn’t use any priors, and still outperform theirs. Moreover, our approach is orders of magnitude faster than theirs, as we will show in Section 4.6. We show some uncurated test samples from our model in Figure 4(a).
4.4 Generalization to unseen cameras
To assess the flexibility of our approach, we evaluate its performance on images captured from unseen views. To do so, we take the trained network of Section 4.3 and test it on cameras (2,4,6,8). Note that this setting is particularly challenging not only because of the novel camera views, but also because the performer is often out of field of view in camera . For this reason, we discard frames where the performer is out of field of view when evaluating our Baseline. We report the results in Table 2. We observe that Fusion fails at generalizing to novel views (accuracy drops by mm when the network is presented with new views). This is not surprising as this fusion technique overfits by design to the camera setting. On the other hand the accuracy drop of Canonical Fusion is similar to the one of Baseline (mm). Note that our comparison favors Baseline by discarding frames when object is occluded. This experiments validates that our model is able to cope effectively with challenging unseen views.
4.5 Quantitative evaluation on Human 3.6M
We now turn to the Human36M dataset, where we first evaluate the different components of our approach, and then compare to the stateoftheart multiview methods. Note that here we consider a setting where no additional data is used to train our models. We report the results in Table 3. Considering the ablation study, we obtain results that are consistent with what we observed on the TotalCapture dataset: performing simple feature fusion (Fusion) yields a improvement over the monocular baseline. A further improvement can be reached by using Canonical Fusion (no DLT). Finally, training our architecture by backpropagating through the triangulation layer (Canonical Fusion) allows to further improve our accuracy by . We show some uncurated test samples from our model in Figure 4(b).
We then compare our model to the stateoftheart methods. Here we can compare our method to the one of [23] just by comparing fusion techniques (see Canonical Fusion (no DLT) vs Qui et al. [23] (no RPSM) in Table 3). We see that our methods outperform theirs by , which is significant and indicates the superiority of our fusion technique. Similar to what observed in Section 4.3, our best performing method is even superior to the offline volumetric of [23], which uses a strong bonelength prior (Qui et al. [23] Fusion + RPSM). Our method outperforms all other multiview approaches by a large margin. Note that in this setting we cannot compare to [15], as they do not report results without using additional data.
4.6 Exploiting additional data
Methods  Model size  Inference Time  MPJPE 

Qui et al. [23] Fusion + RPSM  2.1GB  8.4s  26.2 
Iskakov et al. [15] Algebraic  320MB  2.00s  22.6 
Iskakov et al. [15] Volumetric  643MB  2.30s  20.8 
Ours, Baseline  244MB  0.04s  34.2 
Ours, Canonical Fusion  251MB  0.04s  21.0 
To compare to the concurrent model in [15], we consider a setting in which we exploit additional training data. We adopt the same pretraining strategy as [15], that is we pretrain a monocular pose estimation network on the COCO dataset [18], and finetune jointly on Human3.6M and MPII [2] datasets. We then simply use these pretrained weights to initialize our network. We also report results for [23], which trains its detector jointly on MPII and Human3.6M. The results are reported in Table 4.
First of all, we observe that Canonical Fusion outperforms our monocular baseline by a large margin (). Similar to what was remarked in the previous section, our method also outperforms [23]. The gap, however, is somewhat larger in this case (). Our approach also outperforms the triangulation baseline of (Iskakov et al. [15] Algebraic), indicating that our fusion technique if effective in reasoning about multiview input images. Finally, we observe that our method reaches accuracy comparable to the volumetric approach of (Iskakov et al. [15] Volumetric).
To give insight on the computational efficiency of our method, in Table 4 we report the size of the trained models in memory, and also measure their inference time (we consider a set of 4 images and measure the time of a forward pass on a Pascal TITAN X GPU and report the average over 100 forward passes). Comparing model size, Canonical Fusion is much smaller than other models and introduces only a negligible computational overhead compared to our monocular Baseline. Comparing the inference time, both our models yield a realtime performance () in their unoptimized version, which is much faster than other methods. In particular, it is about 50 times faster than (Iskakov et al. [15] Algebraic) due to our efficient implementation of DLT and about 57 times faster than (Iskakov et al. [15] Volumetric) due to using DLT plus 2D CNNs instead of a 3D volumetric approach.
5 Conclusions
We propose a new multiview fusion technique for 3D pose estimation that is capable of reasoning across multiview geometry effectively, while introducing negligible computational overhead with respect to monocular methods. Combined with our novel formulation of DLT transformation, this results in a realtime approach to 3D pose estimation from multiple cameras. We report the stateoftheart performance on standard benchmarks when using no additional data, flexibility to unseen camera settings, and accuracy comparable to farmore computationally intensive volumetric methods when allowing for additional 2D annotations.
6 Acknowledgments
We would like to thank Giacomo Garegnani for the numerous and insightful discussions on singular value decomposition. This work was completed during an internship at Facebook Reality Labs and supported in part by the Swiss National Science Foundation.
References
 [1] (2013) Multiview pictorial structures for 3d human pose estimation.. In Bmvc, Vol. 2, pp. 7. Cited by: §2.

[2]
(201406)
2D human pose estimation: new benchmark and state of the art analysis.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1, §4.6.  [3] (2014) 3D pictorial structures for multiple human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1669–1676. Cited by: §2.
 [4] (2013) 3D pictorial structures for multiple view articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3618–3625. Cited by: §2.
 [5] (2017) Realtime multiperson 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. Cited by: §2, §2.
 [6] (2019) Weaklysupervised discovery of geometryaware representation for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10895–10904. Cited by: §3.2.
 [7] (2019) Monocular neural image based rendering with continuous view control. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4090–4100. Cited by: §3.2.
 [8] (2009) ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, Cited by: §4.
 [9] (2010) Optimization and filtering for human motion capture. International journal of computer vision 87 (12), pp. 75. Cited by: §2.
 [10] (2019) DeepFly3D: a deep learningbased approach for 3d limb and appendage tracking in tethered, adult drosophila. bioRxiv, pp. 640375. Cited by: §2, §3.1.
 [11] (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §1, §2, §3.4.
 [12] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.
 [13] (2011) Latent structured models for human pose estimation. In 2011 International Conference on Computer Vision, pp. 2220–2227. Cited by: §4.1, §4.
 [14] (201407) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §1, §4.1, §4.
 [15] (2019) Learnable triangulation of human pose. arXiv preprint arXiv:1905.05754. Cited by: Figure 1, §1, §2, §2, Figure 3, §3.4, §4.5, §4.6, §4.6, §4.6, Table 4.
 [16] (2018) A generalizable approach for multiview 3d human pose regression. ArXiv abs/1804.10462. Cited by: §2, §3.1, §4.1, Table 3.
 [17] (2019) 3D pose detection of closely interactive humans using multiview cameras. Sensors 19 (12), pp. 2831. Cited by: §2, §2, §3.1.
 [18] (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §4.6.
 [19] (2011) Markerless motion capture of interacting characters using multiview image segmentation. In CVPR 2011, pp. 1249–1256. Cited by: §2.
 [20] (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: §4.1, Table 3.
 [21] (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §2.
 [22] (2017) Harvesting multiple views for markerless 3D human pose annotations. In Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 1, §1, §2, §4.1, Table 3.
 [23] (201910) Cross view fusion for 3d human pose estimation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Figure 1, §1, §2, §2, §3.1, §4.1, §4.3, §4.5, §4.6, §4.6, Table 1, Table 3, Table 4.
 [24] (2010) Numerical mathematics. Vol. 37, Springer Science & Business Media. Cited by: §1, §3.4.
 [25] (2019) Neural scene decomposition for multiperson motion capture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7703–7713. Cited by: §3.2.
 [26] (2018) Unsupervised geometryaware representation for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 750–767. Cited by: §3.2.
 [27] (2019) Deep highresolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212. Cited by: §2.
 [28] (2018) Rethinking pose in 3d: multistage refinement and recovery for markerless motion capture. In 2018 International Conference on 3D Vision (3DV), pp. 474–483. Cited by: §4.1, Table 3.
 [29] (2015) Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656. Cited by: §2.
 [30] (2017) Total capture: 3d human pose estimation fusing video and inertial sensors.. In BMVC, Vol. 2, pp. 3. Cited by: §1, §4.1, §4.
 [31] (2016) Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732. Cited by: §2.
 [32] (2017) Interpretable transformations with encoderdecoder networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5726–5735. Cited by: §1, §3.2.
Comments
There are no comments yet.