Light3DPose: Real-time Multi-Person 3D PoseEstimation from Multiple Views

04/06/2020 ∙ by Alessio Elmi, et al. ∙ 0

We present an approach to perform 3D pose estimation of multiple people from a few calibrated camera views. Our architecture, leveraging the recently proposed unprojection layer, aggregates feature-maps from a 2D pose estimator backbone into a comprehensive representation of the 3D scene. Such intermediate representation is then elaborated by a fully-convolutional volumetric network and a decoding stage to extract 3D skeletons with sub-voxel accuracy. Our method achieves state of the art MPJPE on the CMU Panoptic dataset using a few unseen views and obtains competitive results even with a single input view. We also assess the transfer learning capabilities of the model by testing it against the publicly available Shelf dataset obtaining good performance metrics. The proposed method is inherently efficient: as a pure bottom-up approach, it is computationally independent of the number of people in the scene. Furthermore, even though the computational burden of the 2D part scales linearly with the number of input views, the overall architecture is able to exploit a very lightweight 2D backbone which is orders of magnitude faster than the volumetric counterpart, resulting in fast inference time. The system can run at 6 FPS, processing up to 10 camera views on a single 1080Ti GPU.



There are no comments yet.


page 1

page 2

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-person 3D pose estimation is a complex problem, with many applications in different fields of computer vision, like people tracking or augmented reality. This problem is usually tackled with a two steps approach. At first, every view is processed independently in order to produce a set of 2D poses - or possibly, some intermediate feature representation. In this stage, all the achievements in 2D pose estimation field can be exploited (see

[7] for a survey). Next, these poses have to be matched across views and eventually triangulated, in order to produce a final estimate of the 3D scene. Usually, occlusions between people - or even self-occlusions - are the main difficulties to deal with: crowded scenes and complex poses produce noisy 2D detections, which are hard to filter out or recover in the matching and triangulation phase.

Hence, the idea of creating a system which is able to handle occlusions in a global way, and that is not affected by the limitations brought by single-view inferences. Inspired by [12], [14] and [5]

, we developed a multi-person 3D reconstruction system, which takes a set of images capturing the scene from different views and outputs a set of 3D pose reconstructions in a global reference frame. Its main building block is a fully convolutional neural network, where low-level features of the input views are unprojected, fused and transformed in order to produce a 3D representation of the probabilistic space. Following the general bottom-up approach of pose estimation, we extended the notion of

part affinity field in three dimensions, making the pose reconstruction from density maps quick and agile. By doing this, we avoided all the limitations of the top-down strategies, where scalability is penalized as the number of people grows, and where inter-person occlusions and self-occlusions cannot be encoded - and recovered - by the network in a global way. On the contrary, thanks to the huge variety of pose configurations available in the CMU Panoptic dataset and with clever augmentation strategies about view-points, we could prove that our system is not affected by those limitations: our system can exploit activations and “shadows” in the feature space to estimate occlusions. Moreover, it does not depend on sophisticated algorithms of detection-view assignment, and it does not pay the computational burden of adding more views and subjects in the reconstruction process. Furthermore, we found that our system can produce good results even with just a single view suggesting that this approach can be further investigated also for monocular depth estimation tasks with multiple poses.

We conducted several experiments, which show the feasibility of our work and compare it to the other state-of-the-art approaches.

Fig. 1: This work proposes a fast and scalable approach for multi-person 3D pose estimation. Feature representations extracted from each views are aggregated and exploited to perform unified triangulation and pose estimation.
Fig. 2: An overall view of the complete processing pipeline. 2D pose backbone replicas process each view separately. Feature maps are then aggragated by the Unprojection layer into a 3D input representation of the scene. A volumetric network produces an output representation. A further decoding produces the final 3D pose estimations.

Our main contributions are the following:

  • as far as we know this is the first complete bottom-up approach adopted in this context. In particular, it is capable of handling crowded scenes with good accuracy results and computational time.

  • We show that even a very light backbone can produce good results. This implies that adding more views is almost computationally free.

  • We introduce 3D-data augmentation policies that greatly enhance the number of samples seen by the volumetric network.

  • Our post-processing strategy leads to a sub-voxel localization, overcoming the issue of a quantized 3D space.

Ii Related Works

Multi-view, multi-person 3D pose estimation tries to fuse the achievements coming from 2D pose estimation, structure from motion and monocular depth estimation research fields. All of them are very well studied and pretty active topics nowadays.

Pose estimation from a single image is usually tackled following one of these two main strategies: bottom-up or top-down approaches. The former try to infer all key-points (i.e. parts) and/or limbs simultaneously and aggregate them eventually, using specific post-process logic. These methods can claim higher speed over their competitors since neural inference is done only once. At the same time, they usually have to deal with down-scaled feature maps, which limit the accuracy in terms of localization. In this group, we cannot omit [5]. Inspired by the work of [39], they introduced the notion of part-affinity fields. Their work has been extended by [23, 28] leading to a better part association, and by [18], where stronger descriptors led to a finer sub-pixel resolution. Other insights on the resolution issue were provided by [42], with heatmap encoding/decoding refinements. On the contrary, top-down approaches [24, 6, 40], possibly combined with multi-scale strategies [16, 34], rely on object detectors to identify humans in the scene, then a single-person neural inference is performed for each of them. These techniques generally outperform their bottom-up competitors on public challenges, while suffering in scalability with increasing number of subjects. Some hybrid approaches have emerged as well [17]. Finally, we want to mention some attempts [27, 43] to reduce the computational burden of pose estimation networks.

Three-dimensional pose estimation has emerged following two different tracks. The first one aims to recover the third dimension from a monocular view [37, 30, 41, 35, 25, 15, 10]. These methods usually start from 2D pose estimations, and lift them in a second stage in order to obtain their depth. In particular, they all deal with single-pose scenarios. We mention two attempts to extend this task to a multiple poses: Moon et al. [21] adopted a top-down strategy; Rogez et al. [33] introduced pose proposals (from anchor-poses) in the spirit of the Faster R-CNN approach. The second research track takes advantage of multiple views and claims to reconstruct 3D poses in a global reference frame. Sometimes this is the initial step of detection-to-track pipelines, like in [4, 26], where temporal evolution can be exploited in order to refine predictions. Multiple-view pose reconstruction may focus on single [20, 29, 38, 32, 12] or multiple poses [31], and they can exploit geometrical constraints [1, 36], in pair with visual features [8]

. In particular, we highlight two works where multi-view projections have been combined with deep learning.

[32] exploited the epipolar geometry in order to refine 2D pose estimation model, and consequently improve the final single pose 3D reconstruction. [12] showed that 2D features of each view can be fused and processed into a volumetric representation, which is analyzed to achieve a neat 3D reconstruction of the pose - again - for a single subject. However, to the best of our knowledge there is not any attempt to extend this approach to a multi-person scenario.

Iii Method

We call Light3DPose our system. In this section, we outline the architecture of Light3DPose, followed by a detailed explanation of all its components.

We are given a detection space with fixed boundaries and a set of fixed setup cameras whose intrinsic and extrinsic parameters are known. In particular, the projections are known, where denotes the frame of the camera . The cameras are synchronized, so for any time we have a set of images , one for each camera. We will assume the time fixed throughout the paper, and omit the superscript .

The input of Light3DPose is a set of pairs where is an image and is one of the setup cameras. The number of input pairs is variable and can range from 1 to . In Section V we study both from the performance and computational sides the impact of the number of input views.

The output of Light3DPose is a set of 3D human poses , with an arbitrary number. A 3D human pose is a list of joints. Each joint is a pair composed of a point in the space and a label identifying the joint type, but when no confusion arises we identify the joint with the underlying point in . The joint type ranges in a pose layout named CMU14, described in Section IV-C.

The internal pipeline of Light3DPose is composed of three main stages (see Figure 2):

  • a 2D Views Processing stage which returns a 2D feature map for each camera;

  • an Unprojection layer [12] which aggregates the information coming from all the 2D views into a 3D features space representation;

  • a Volumetric Processing that process the aggregated 3D representation and produces the output;

and each of these stages are composed of a different number of modules.

Iii-a 2D Views Processing

This processing stage takes as input one image and produces a 2D activation . When Light3DPose processes a set of pairs , each is fed independently to the 2D Views processing stage. The different 2D View Processing stages share the same weight.

The stage is composed of two modules: a 2D Pose Backbone followed by a Reduction module.

Iii-A1 2D Backbone

The input to the 2D Backbone module is an image , and the output is a 2D feature map . The 2D backbone is a MobileNet V1 [11] with some modifications from [27]

on the latest layers. The stride of

conv4_2dw has been removed and all succeeding convolutions have been set to dilation 2. This operation makes the network global stride to be instead of which is common for classification networks. We used weights pretrained on COCO dataset from [27].

Iii-A2 Reduction Module

Input to the reduction module is the 2D feature map , and the output is a 2D feature map . The purpose of this module is to project the feature space produced by the 2D Backbone to a lower-dimensional feature space. This module is crucial in order to encode the information of the backbone into a lighter feature map, to maintain the computations performed by the Volumetric Network feasible. Our Reduction Module is essentially a residual module composed of three depth-wise convolutions + ReLUs. We borrow this architecture from [27].

Iii-B Unprojection

This processing stage represents the contact point between the 2D feature maps and the 3D model of the scene, collecting the result of the 2D Processing stage into a 3D feature map representation. This is the only stage of Light3DPose that uses the calibration parameters of the cameras, and it has no trainable parameters.

Fix integer numbers , and a positive float value . Construct a cube composed of voxels with edge of length . In one has the integer coordinate system corresponding to the index of the voxels of , and we denote by the embedding.

The input to this stage is a set of pairs , where each is the output of one of the 2D View Processing modules, and is one of the setup cameras.

The output of the unprojection stage is a 3D feature map with shape , where is the number of channels of the 2D feature map .

To compute the value of the -th feature of the voxel we use the formula:


where recall that denotes the projection associated with the camera , and by we denote the -th channel of the 2D feature map at the point with frame coordinates .

This layer is a generalization of the Unprojection introduced in [12]

where a cube is built around each person. It can be efficiently implemented using vectorized operations and a differentiable sampling operator


Iii-C Volumetric Processing

Input to this stage is the 3D feature map output of the Unprojection layer.

The output of this stage is a set of 3D human poses .

The stage is composed of three modules:

  • the Volumetric Network,

  • the Sub-voxel Joint Detection

  • the Skeleton Decoder.

The approach is similar to OpenPose [5]: the neural part of the network is trained to predict a Gaussian centered on each joint; the network should also predict a set of Part Affinity Fields (PAFs) that are used by the decoder to efficiently build the skeletons. Our method directly predicts 3D poses, thus the main differences between our volumetric processing part and OpenPose are in the use of a different neural architecture to handle 3D volumes data, and an adaptation to the 3D setting of the decoding of the output of the Volumetric network. Moreover, we introduce a Sub-voxel Peak Detector module to increase the accuracy of the joints predictions.

Iii-C1 Volumetric Network

This is the trainable neural part of the volumetric processing. The purpose is to predict a set of 3D Gaussians centered on every joint and a set of 3D PAFs for the skeleton reconstruction.

The input to this module is the 3D activation output of the unprojection layer with shape .

Output of this module is a 3D activation with shape , where , where is the number of joints of the pose layout, and is the number of PAF. This output can also be seen as a pair of collections where each is a 3D feature map corresponding to a heatmap and each is a collection of 3 (one per each of to the 3-dimensional directions of the vector) 3D features map corresponding to a vectormap.

We adopt a V2V network from [22], but we set the minimum number of channels of the earliest and latest layers to in wherever layer the original network has channels. We name this modified V2V network: V2V64. We also experimented with 32 and 96 channels architectures. Results are reported in Section V.

The output

of the module is then confronted with the ground-truth with an appropriate loss function, which is used to perform the training of Light3DPose. The dataset labels are lists of poses of persons in the 3D space. The procedure to create ground-truth heatmaps and vectormaps is a generalization to 3D space of the one in

[5], so we omit the details. We opted to use a SmoothL1 loss function and to weight equally the loss coming from the heatmap and the vectormap. We experimented different loss functions and weights between heatmap and vectormap, the results are reported in V.

Iii-C2 Sub-voxel Joint Detector

Several state-of-the-art works on single-person pose estimation are based on a variation of the Integral Regression Framework [35, 12, 34] which represents the unifying approach between heatmap and regression-based methods. The Integral Pose Regression framework assumes that the point to be localized follows a unimodal distribution. This is not the case of multiple poses scenarios, where more than one peak need to be estimated. We present an alternative formulation of such framework which, under the correct assumptions, can be used in a multi-person setup.

The sub-voxel joint detector module takes as input one heatmap ouput of the Volumetric network, and outputs a list of peaks . The module is applied to each joint heatmap , obtaining a set of peak for each joint type .

In order to simplify the notation, we discuss the 1D case, but operators can be intuitively extended to 2D or 3D. Given a learned heatmap , for each spatial location the values

represent the probability of such location of being a joint. We fix a neighbour function

that associates to each point a neighbor of it (typically, an interval of a given radius centered at ). Define the non-local maxima suppression via the formula :


where is a Dirac function. if and only if is a maximum of . Define the pixel-peaks as

For each , define the localized heatmap

Finally, define the sub-pixel peaks as

The assumption we rely on is that for every , in the neighbour there should be at most one local maximum. In general, this assumption holds if the radius is small enough w.r.t. the quantization constant . In practice, we obtain good results by choosing to be a 1 or 2 voxels radius interval centered at , see Section V.

Iii-C3 Skeletons decoder

This module takes as input the peaks of the sub-pixel joint detection and the vectormaps output of the Volumetric Network and outputs a list of 3D poses. Our algorithm is a direct extension of the one proposed by OpenPose [5], with the only difference that line integrals are computed over three-dimensional vector fields.




Up Arm

Lo Arm

Up Leg

Lo Leg




3D Augmentations
8.236 99.1 99.3 87.8 65.4 96.9 88.3 89.2
4.598 99.6 99.7 98.5 90.1 99.3 98.5 97.7
5.350 99.6 99.7 98.6 91.1 99.0 94.9 97.3
3.859 99.7 99.7 99.5 95.6 99.3 98.8 98.8
Number of Volumetric Features
32 4.760 99.6 99.7 97.1 78.9 99.5 98.6 95.9
64 3.859 99.7 99.7 99.5 95.6 99.3 98.8 98.8
96 3.975 99.7 99.7 99.5 96.2 99.3 98.7 98.9
Loss Type
L1 4.106 99.6 99.7 99.2 96.2 99.0 98.0 98.7
L2 4.125 99.6 99.7 99.5 96.6 99.4 98.9 99.0
SmoothL1 3.859 99.7 99.7 99.5 95.6 99.3 98.8 98.8
Heatmap / Vectormap Loss Ratio
1 3.859 99.7 99.7 99.5 95.6 99.3 98.8 98.8
3 4.074 99.7 99.7 99.1 96.6 99.5 98.6 98.9
10 3.935 99.7 99.7 98.0 90.9 99.5 98.8 97.9
Sub-voxel refinement
4.899 99.7 99.7 99.4 94.9 99.3 98.8 98.6
3.859 99.7 99.7 99.5 95.6 99.3 98.8 98.8
TABLE I: Ablation studies on PanopticD2D validation set for different aspects of our architecture. 3D Augmentations, Loss type and ratio between heatmap loss and vectormap loss weights.
Fig. 3: Accuracy vs number of views. 1to4-pool4-cross_view: training with 1 to 4 simultaneous views from a pool of 4; 1to4-pool10-cross_view: 1 to 4 from a pool of 10; 1to10-pool20-cross_view: 1 to 10 from a pool of 20; 1to10-all_views: 1 to 10 from all available views. Configurations 1-2-3 are cross-view.

Iv Experimental setup

Iv-a Datasets

Iv-A1 CMU Panoptic dataset [14]

it consists of 31 Full-HD and 480 VGA video streams from synchronized cameras at 29.97 FPS; various scenes (65 sequences with multiple people, social interactions, and a wide range of actions) for a total duration of 5.5 hours. The dataset includes robustly labeled 3D poses, computed using all the camera views. This dataset is perhaps the most complete, open and free to use dataset available for the task of 3D pose estimation. However, considering that they released annotations quite recently, most works in literature use it only for qualitative evaluations [8] or for single-person pose detection [12] discarding multi-person scenes. To the best of our knowledge only [31] makes use of CMU Panoptic dataset to train and evaluate multi-person 3D pose estimation. We adopt the same subset of scenes and the same train/val/test split of CMU Panoptic used in [31]: 20 scenes (343k images) of which 10, 4 and 6 scenes for training, validation and test respectively. Only HD cameras are used with data frame rates downsampled to 2 FPS. Since one of our concerns is to assess the cross-view generalization of our model, we split the dataset by scene and by view. Val and test splits use cameras 2, 13, 16, 18 while the train split uses all (or a subset of) the remaining 27 cameras. This is the same camera split used by [12]. We name this dataset: PanopticD2D.

Iv-A2 Shelf [1]

we adopt this dataset to evaluate the ability of our model to transfer to a completely unseen setup. It consists of a single scene of four people disassembling a shelf at a close range. Video streams are from five calibrated cameras. The dataset includes 3D annotated groundtruth skeletons.

Iv-B Evaluation metrics

We employed two commonly used metrics that capture different types of errors in models prediction:

  • MPJPE: Mean Per Joint Precision Error. Given a pair of skeletons, MPJPE is defined as the average of the square distance of the predicted joints from the corresponding ground-truth joints.

  • PCP: Percentage of Correct estimated Parts. We implemented this metric according to [8]. A body part is correct if the average distance of the two joints is less than a threshold from the corresponding groundtruth joints locations. The threshold is computed as the 50% of the length of the groundtruth body part.

Before computing these metrics we associate for each scene the predicted skeletons to the groundtruth skeletons using linear assignment.

Iv-C Implementation details

Iv-C1 Pose Layout

We used a simplified pose layout of 14 keypoints. Apart from the canonical 12 parts of arms and legs, we only added neck and nose. Sometimes, a layout conversion was needed across different datasets and labeling standards. Moreover, we defined 13 PAFs; starting from the neck, a tree-structure along arms, legs and nose has been defined. In our setup, increasing excessively the number of joints or PAFs would not make sense due to the limitations of our quantized space.

Iv-C2 Skeleton Decoding

Parameters have been found by performing a grid search on Panoptic D2D validation set. Eventually, we opted for an interpolation over a region of size

voxels. Then, all local maxima with a score lower than 0.3 are discarded; every PAF where the linear integral is on average lower than 0.2 is also removed. Finally, only candidate poses with more than 7 keypoints are retained.

Iv-C3 3D space quantization

we set the size of the quantization voxel to 7.5 cm. This allows us to maintain a quantization of voxels on Panoptic dataset to efficiently cover the whole scene of approximately meters, the last dimension being the vertical axis.

Iv-C4 Training recipe

Models have been trained with Adam optimizer. We set the initial learning rate to 0.002 and used a step decay policy of 0.3 every 50 epochs. All models have been trained for 200 epochs with a batch size of 8. We implemented the architecture in PyTorch.

V Ablation analysis

V-a 3D Augmentation

We applied 3D data augmentation techniques to the 3D feature space between the Unprojection layer and the Volumetric network. In particular, we implemented the followings:

V-A1 Random cube embedding

During the training, we consider to be strictly smaller, and to be randomly embedded. This corresponds to take a random crop of the 3D crop of the scene to be considered for the parameters update.

From the volumetric network point of view, this reflects into a data augmentation strategy, since moving the cube inside corresponds to a change of the observed scene and a change in the extrinsic parameters of the cameras.

We set to have voxels, and we change the embedding at the start of each epoch.

V-A2 Random rotation

we implement rotations along the vertical axis of , to allow a fast implementation. One should take care of the fact that rotation of the 3D space is not reflected into images transformation, so when a rotation is applied we cut the back-propagation graph just before the unprojection layer. In our specific architecture, this sparse back-prop signal does not drastically affect the training since the only trainable part before the volumetric network is the Reduction layer which has a limited number of parameters.

Model single multi avg avg
ACTOR [31] (2 views)* 17.21 50.24 33.72 -
ACTOR (4 views)* 8.19 20.10 14.14 -
ACTOR (10 views)* 6.13 12.21 9.17 -
Oracle [31] (using GT to select cameras)* 4.24 9.19 6.71 -
Ours (1 unseen view) 10.34 9.32 9.43 80.8
Ours (2 to 4 unseen views depending on scene) 5.30 4.09 4.22 98.2
Ours (10 views, from training view pool) 3.50 3.56 3.55 98.6
*ACTOR: number in brackets refers to maximum number of views to choose from. Oracle means: best views to triangulate are selected using groundtruth.
TABLE II: Methods comparison on Panoptic D2D test set. MPJPE and PCP metrics for scenes with single person and multiple people.
Model Actor 1 Actor 2 Actor 3 Avg Speed(s)
Belagiannis et al. [1] 66.1 65.0 83.2 71.4 -
Belagiannis et al. [3] 75.0 67.0 86.0 76.0 -
Belagiannis et al. [2] 75.3 69.7 87.6 77.5 -
Ershadi et al. [9] 93.3 75.9 94.8 88.0 -
Dong et al. [8] 98.8 94.1 97.8 96.9 .465
Ours 94.3 78.4 96.8 89.8 .146
TABLE III: Quantitative comparison on Shelf dataset. Metric is PCP.
Fig. 4: Adding more views increases computational time by a linear factor. However, only few modules are affected by this growth. The main CNN block (in red) has a complexity, both in the number of views and people. Inference time is measured on a single NVIDIA GeForce GTX 1080Ti.

V-B Architecture

In Table I we reported the results of different experiments to evaluate the contribution of our architectural choices.

V-B1 Number of volumetric features

it refers to the channels of the volumetric input: it involves the 2D feature maps, the input/output of the unprojection and the volumetric network. For 32 features we used the original V2V network whereas for 64 and 96 we modified it as described in Section III-C1. Models with 64 and 96 channels achieve similar MPJPE and PCP values but 64 is an obvious choice for being computationally lighter.

V-B2 Loss

we run experiments with different loss types and weighted differently the heatmap and vectormap losses. By evaluating separately PAFs and Peaks quality we noticed that good peaks have a stronger impact in the final metrics than good PAFs, thus we weighted more the Peak part of the loss. Results seem to suggest that the task of predicting good peaks should be tackled with a more elaborate approach than simply differentiate loss weights.

V-B3 Sub-voxel refinement

by activating it we achieve a lower MPJPE. It has almost no effect on PCP since it improves the sub-voxel localization but does not reduce false positives.

V-C Study on the number of input views

These experiments have a two-fold goal. On one side, we wanted to understand better the impact on the accuracy of a short/large number of views in the training pool; on the other hand, we wanted to check how well our augmentation strategies could compensate/emulate unseen angles. In Figure 3 we reported four experiments where we varied the number of views and the number of simultaneous angles used on each training inference. In particular, they show that even a few cameras can produce mildly good results; also, after a certain number, adding more views gives unnoticeable improvements.

Vi Comparison with state-of-the-art

In Table II we report a comparison between our method and the results in [31] (ACTOR and ORACLE). We remark that the task that authors of [31] are trying to solve is different from ours. They train an agent to find what are the best views to use to triangulate that particular scene. We consider it to be a good baseline even if the core task of [31]

is not the triangulation algorithm itself. We select 4 fixed validation views and we never train on those. Since some recordings have fewer views available, it turned out that only 36.2% of the test set has 4 views, 31.3% has 3 and 32.5% has just two angles available. The evaluation metric is the MPJPE expressed in

. The MPJPE of our method is more than 3 times lower compared to ACTOR with 4 views and on average lower than the Oracle.

We also run our model on the Shelf dataset in order to test it in a completely new environment with unseen views, camera parameters, sensors and every other detail that can bias the evaluation. Results are reported in Table III. Our method obtains good results even if not on-par with the work by Dong et al. [8]. However, their approach is much slower being based on top-down 2D backbones. We detail a speed comparison between the two methods in Section VI-A.

Fig. 5: Left: geometric triangulation using 2D poses from Lightweight OpenPose [27] (same 2D backbone weights as ours) and iterative greedy matching [36]. Right: direct 3D pose estimation with our model.
Fig. 6: Direct 3D pose predictions by our model from a single camera view. On the frame we projected in red the groundtruth, in white our predictions. In the 3D plot: predictions in color, groundtruth in dashed-black. The network “hallucinates” straight legs of non visible body parts relying on a strong learned prior.

Vi-a Inference speed

Being a pure bottom-up approach, our method can scale well when increasing the number of views and subjects. Even though our complexity is in the number of views and in the number of people, adding more cameras affects only the Backbone, Reduction and Unprojection modules, which are a small fraction of the cumulative computation burden (e.g. for 10 views they take all together only 45 ms, see Figure 4). On the other hand, post-processing the CNN output costs even less; Cao et al. [5], implemented an optimized version which takes 0.58 ms for a 9 people image. For reference we can compare our method with the one presented in [8], see Table III. Their approach starts with a person detector [19], which takes around 10 ms per view. Then, each detection is forwarded to two branches, of which the 2D pose estimation [6] is most expensive (we measured 67 ms). From here, the final 3D pose inference takes around 80 ms. We can estimate that a 5 views scenario with 5 people will take , which is about  3.2 times our implementation.

Vi-B Qualitative results

In figure 5 we show a comparison between our model which performs direct 3D estimations and the result of the geometric triangulation using the 2D skeletons predicted by Lightweight OpenPose[27] and the iterative greedy matching by [36]. Notice that our 2D backbone has exactly the same weights as the backbone of [27] since we do not train nor finetune such part of the network. This highlights the power of estimating directly 3D poses: our volumetric architecture can learn strong pose priors and implicitly discards false detections. By exploiting the 3D representation of the space, it is less prone to occlusion-related errors and it can better deal with crowded scenes. This behavior is even more evident in Figure 6 where our method correctly predicts all 3D poses from a monocular view. In particular, notice that even the legs of the blue skeleton are predicted even if they are not visible from that particular view. (View and scene from the validation set). We suppose that the model hallucinates straight up legs since most of the people in Panoptic D2D training set are standing.

Vii Conclusion

We present a method for multi-person human pose estimation from calibrated views. Our neural architecture is able to predict 3D pose representations directly from raw camera views. To the best of our knowledge, this is the first attempt to tackle such a task in a completely bottom-up fashion. The proposed method exhibits good computational scalability properties: in particular, it is essentially independent of the number of people in the scene. Moreover, it scales linearly with the number of input views.

Conducted experiments show state-of-the-art performance on the Panoptic D2D dataset as well as a good generalization on the unseen Shelf dataset. We hope that our work can open new research lines and new scenarios. The method visibly benefits from a wide variety of configurations of people, cameras, and environments during training. Simple 3D data augmentation techniques have been explored and proven effective in enhancing the performance; however, larger datasets, both real and synthetic, could significantly increase the model capabilities.


The authors would like to thank Igor Moiseev and all the Checkout Technologies team for the fruitful discussions and their friendly support.


  • [1] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic (2014) 3D pictorial structures for multiple human pose estimation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1669–1676. Cited by: §II, §IV-A2, TABLE III.
  • [2] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic (2015) 3d pictorial structures revisited: multiple human pose estimation. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1929–1942. Cited by: TABLE III.
  • [3] V. Belagiannis, X. Wang, B. Schiele, P. Fua, S. Ilic, and N. Navab (2014) Multiple human pose estimation with temporally consistent 3d pictorial structures. In European Conference on Computer Vision, pp. 742–754. Cited by: TABLE III.
  • [4] L. Bridgeman, M. Volino, J. Guillemaut, and A. Hilton (2019) Multi-person 3d pose estimation and tracking in sports. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §II.
  • [5] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. Cited by: §I, §II, §III-C1, §III-C3, §III-C, §VI-A.
  • [6] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112. Cited by: §II, §VI-A.
  • [7] Q. Dang, J. Yin, B. Wang, and W. Zheng (2019) Deep learning based 2d human pose estimation: a survey. Tsinghua Science and Technology 24 (6), pp. 663–676. Cited by: §I.
  • [8] J. Dong, W. Jiang, Q. Huang, H. Bao, and X. Zhou (2019) Fast and robust multi-person 3d pose estimation from multiple views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7792–7801. Cited by: §II, 2nd item, §IV-A1, TABLE III, §VI-A, §VI.
  • [9] S. Ershadi-Nasab, E. Noury, S. Kasaei, and E. Sanaei (2018) Multiple human 3d pose estimation from multiview images. Multimedia Tools and Applications 77 (12), pp. 15573–15601. Cited by: TABLE III.
  • [10] I. Habibie, W. Xu, D. Mehta, G. Pons-Moll, and C. Theobalt (2019) In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10905–10914. Cited by: §II.
  • [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §III-A1.
  • [12] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov (2019) Learnable triangulation of human pose. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7718–7727. Cited by: §I, §II, 2nd item, §III-B, §III-C2, §IV-A1.
  • [13] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §III-B.
  • [14] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh (2015) Panoptic studio: a massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3334–3342. Cited by: §I, §IV-A1.
  • [15] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik (2019) Learning 3d human dynamics from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5614–5623. Cited by: §II.
  • [16] L. Ke, M. Chang, H. Qi, and S. Lyu (2018) Multi-scale structure-aware network for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 713–728. Cited by: §II.
  • [17] M. Kocabas, S. Karagoz, and E. Akbas (2018) Multiposenet: fast multi-person pose estimation using pose residual network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 417–433. Cited by: §II.
  • [18] S. Kreiss, L. Bertoni, and A. Alahi (2019) Pifpaf: composite fields for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11977–11986. Cited by: §II.
  • [19] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2017) Light-head r-cnn: in defense of two-stage object detector. arXiv preprint arXiv:1711.07264. Cited by: §VI-A.
  • [20] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649. Cited by: §II.
  • [21] G. Moon, J. Y. Chang, and K. M. Lee (2019) Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10133–10142. Cited by: §II.
  • [22] G. Moon, J. Yong Chang, and K. Mu Lee (2018) V2v-posenet: voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 5079–5088. Cited by: §III-C1.
  • [23] A. Newell, Z. Huang, and J. Deng (2017) Associative embedding: end-to-end learning for joint detection and grouping. In Advances in neural information processing systems, pp. 2277–2287. Cited by: §II.
  • [24] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §II.
  • [25] A. Nibali, Z. He, S. Morgan, and L. Prendergast (2019) 3d human pose estimation with 2d marginal heatmaps. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1477–1485. Cited by: §II.
  • [26] T. Ohashi, Y. Ikegami, and Y. Nakamura (2020) Synergetic reconstruction from 2d pose and 3d motion for wide-space multi-person video motion capture in the wild. arXiv preprint arXiv:2001.05613. Cited by: §II.
  • [27] D. Osokin (2018) Real-time 2d multi-person pose estimation on cpu: lightweight openpose. In arXiv preprint arXiv:1811.12004, Cited by: §II, §III-A1, §III-A2, Fig. 5, §VI-B.
  • [28] G. Papandreou, T. Zhu, L. Chen, S. Gidaris, J. Tompson, and K. Murphy (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–286. Cited by: §II.
  • [29] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Harvesting multiple views for marker-less 3d human pose annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6988–6997. Cited by: §II.
  • [30] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762. Cited by: §II.
  • [31] A. Pirinen, E. Gärtner, and C. Sminchisescu (2019) Domes to drones: self-supervised active triangulation for 3d human pose reconstruction. In Advances in Neural Information Processing Systems, pp. 3907–3917. Cited by: §II, §IV-A1, TABLE II, §VI.
  • [32] H. Qiu, C. Wang, J. Wang, N. Wang, and W. Zeng (2019) Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4342–4351. Cited by: §II.
  • [33] G. Rogez, P. Weinzaepfel, and C. Schmid (2019) Lcr-net++: multi-person 2d and 3d pose detection in natural images. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II.
  • [34] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703. Cited by: §II, §III-C2.
  • [35] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545. Cited by: §II, §III-C2.
  • [36] J. Tanke and J. Gall (2019) Iterative greedy matching for 3d human pose tracking from multiple views. In German Conference on Pattern Recognition, pp. 537–550. Cited by: §II, Fig. 5, §VI-B.
  • [37] D. Tome, C. Russell, and L. Agapito (2017) Lifting from the deep: convolutional 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2500–2509. Cited by: §II.
  • [38] D. Tome, M. Toso, L. Agapito, and C. Russell (2018) Rethinking pose in 3d: multi-stage refinement and recovery for markerless motion capture. In 2018 international conference on 3D vision (3DV), pp. 474–483. Cited by: §II.
  • [39] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh (2016) Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4724–4732. Cited by: §II.
  • [40] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: §II.
  • [41] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang (2018) 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5255–5264. Cited by: §II.
  • [42] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu (2019) Distribution-aware coordinate representation for human pose estimation. External Links: arXiv:1910.06278 Cited by: §II.
  • [43] F. Zhang, X. Zhu, and M. Ye (2019-06) Fast human pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.