Direct Multi-view Multi-person 3D Pose Estimation

by   Tao Wang, et al.
National University of Singapore

We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. Instead of estimating 3D joint locations from costly volumetric representation or reconstructing the per-person 3D pose from multiple detected 2D poses as in previous methods, MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and let them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. To improve the accuracy of such a simple pipeline, MvP presents a hierarchical scheme to concisely represent query embeddings of multi-person skeleton joints and introduces an input-dependent query adaptation approach. Further, MvP designs a novel geometrically guided attention mechanism, called projective attention, to more precisely fuse the cross-view information for each joint. MvP also introduces a RayConv operation to integrate the view-dependent camera geometry into the feature representations for augmenting the projective attention. We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient. Notably, it achieves 92.3 Panoptic dataset, improving upon the previous best approach [36] by 9.8 is general and also extendable to recovering human mesh represented by the SMPL model, thus useful for modeling multi-person body shapes. Code and models are available at



There are no comments yet.


page 3

page 8

page 15


Cross View Fusion for 3D Human Pose Estimation

We present an approach to recover absolute 3D human poses from multi-vie...

Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo

Existing approaches for multi-view multi-person 3D pose estimation expli...

MVHM: A Large-Scale Multi-View Hand Mesh Benchmark for Accurate 3D Hand Pose Estimation

Estimating 3D hand poses from a single RGB image is challenging because ...

Real-Time Multi-View 3D Human Pose Estimation using Semantic Feedback to Smart Edge Sensors

We present a novel method for estimation of 3D human poses from a multi-...

Shape-aware Multi-Person Pose Estimation from Multi-View Images

In this paper we contribute a simple yet effective approach for estimati...

HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive Media

We introduce HUMAN4D, a large and multimodal 4D dataset that contains a ...

The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation

We introduce CenterGroup, an attention-based framework to estimate human...

Code Repositories


Direct Multi-view Multi-person 3D Human Pose Estimation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-view multi-person 3D pose estimation aims to localize 3D skeleton joints for each person instance in a scene from multi-view camera inputs. It is a fundamental task that benefits many real-world applications (such as surveillance, sportscast, gaming and mixed reality) and is mainly tackled by reconstruction-based 

Dong et al. (2019); Huang et al. (2020); Chen et al. (2020) and volumetric Tu et al. (2020) approaches in previous literature, as shown in Fig. 1 (a) and (b). The former first estimates 2D poses in each view independently and then aggregates them and reconstructs their 3D counterparts via triangulation or a 3D pictorial structure model. The volumetric approach Tu et al. (2020) builds a 3D feature volume through heatmap estimation and 2D-to-3D un-projection at first, based on which instance localization and 3D pose estimation are performed for each person instance individually. Though with notable accuracy, the above paradigms are inefficient due to highly relying on those intermediate tasks. Moreover, they estimate 3D pose for each person separately, making the computation cost grow linearly with the number of persons.

Targeted at a more simplified and efficient pipeline, we were wondering if it is possible to directly regress 3D poses from multi-view images without relying on any intermediate task? Though conceptually attractive, adopting such a direct mapping paradigm is highly non-trivial as it remains unclear how to perform skeleton joints detection and association for multiple persons within a single stage. In this work, we address these challenges by developing a novel Multi-view Pose transformer (MvP) model which significantly simplifies the multi-person 3D pose estimation. Specifically, MvP represents each skeleton joint as a learnable positional embedding, named joint query, which is fed into the model and mapped into final 3D pose estimation directly (Fig. 1 (c)), via a specifically designed attention mechanism to fuse multi-view information and globally reason over the joint predictions to assign them to the corresponding person instances.

We develop a novel hierarchical query embedding scheme to represent the multi-person joint queries. It shares joint embedding across different persons and introduces person-level query embedding to help the model in learning both person-level and joint-level priors. Benefiting from exploiting the person-joint relation, the model can more accurately localize the 3D joints. Further, we propose to update the joint queries with input-dependent scene-level information (i.e., globally pooled image features from multi-view inputs) such that the learnt joint queries can adapt to the target scene with better generalization performance.

To effectively fuse the multi-view information, we propose a geometrically-guided projective attention mechanism. Instead of applying full attention to densely aggregate features across spaces and views, it projects the estimated 3D joint into 2D anchor points for different views, and then selectively fuses the multi-view local features near to these anchors to precisely refine the 3D joint location. Additionally, we propose to encode the camera rays into the multi-view feature representations via a novel RayConv operation to integrate multi-view positional information into the projective attention. In this way, the strong multi-view geometrical priors can be exploited by projective attention to obtain more accurate 3D pose estimation.

Comprehensive experiments on 3D pose benchmarks Panoptic Joo et al. (2015), as well as Shelf and Campus Belagiannis et al. (2014) demonstrate our MvP works very well. Notably, it obtains 92.3% AP on the challenging Panoptic dataset, improving upon the previous best approach VoxelPose Tu et al. (2020) by 9.8%, while achieving nearly speed up. Moreover, the design ethos of our MvP can be easily extended to more complex tasks—we show that a simple body mesh branch with SMPL representation Loper et al. (2015) trained on top of a pre-trained MvP can achieve competitively qualitative results.

Our contributions are summarized as follows: 1) We strive for simplicity in addressing the challenging multi-view multi-person 3D pose estimation problem by casting it as a direct regression problem and accordingly develop a novel Multi-view Pose transformer (MvP) model, which achieves state-of-the-art results on the challenging Panoptic benchmark. 2) Different from query embedding designs in most transformer models, we propose a more tailored and concise hierarchical joint query embedding scheme to enable the model to effectively encode person-joint relation. Additionally, we mitigate the commonly faced generalization issue by a simple query adaptation strategy. 3) We propose a novel projective attention module along with a RayConv operation for fusing multi-view information effectively, which we believe are also inspiring for model designs in other multi-view 3D tasks.

Figure 1: Difference between our method and others for multi-view multi-person 3D pose estimation. Existing methods adopt complex multi-stage pipelines that are either (a) reconstruction-based or (b) volumetric representation based, which incur heavy computation burden. (c) Our method solves this task as a direct regression problem without relying on any intermediate task by a novel Multi-view Pose Transformer, and largely simplifies the pipeline and boosts the efficiency.

2 Related Works

3D Human Pose Estimation

3D pose estimation from monocular inputs Martinez et al. (2017); Mehta et al. (2017); Zhou et al. (2017); Popa et al. (2017); Sun et al. (2018); Nie et al. (2019); Zhang et al. (2020); Gong et al. (2021); Zhang et al. (2021) is an ill-posed problem as multiple 3D predictions may result in the same 2D projection. To alleviate such projective ambiguities, multi-view methods have been explored. Research works on single-person scenes use either multi-view geometry Hartley and Zisserman (2003) for feature fusion Qiu et al. (2019); He et al. (2020) and triangulation Iskakov et al. (2019); Remelli et al. (2020), or pictorial structure models for fast and robust 3D pose reconstruction Pavlakos et al. (2017); Qiu et al. (2019), achieving promising results. However, it is more challenging as we progress towards multi-person scenes. Current approaches mainly exploit a multi-stage pipeline for multi-person tasks, including reconstruction-based Dong et al. (2019); Chen et al. (2020); Huang et al. (2020); Kadkhodamohammadi and Padoy (2021); Lin and Lee (2021) and volumetric Tu et al. (2020) paradigms. Despite their notable accuracy, these methods suffer expensive computation cost from the intermediate tasks, such as cross-view matching and heatmap back-projection. Moreover, the total computation cost grows linearly with the number of persons in the scene, making them hardly scalable for larger scenes. Different from all previous approaches that rely on a multi-stage pipeline with computation redundancy, our method views multi-person 3D pose estimation as a direct regression problem based on a novel Multi-view Pose transformer model, enables an intermediate task-free single stage solution.

Attention and Transformers

Driven by the recent success in natural language fields, there have been growing interests in exploring the Transformers for computer vision tasks, such as image recognition 

Dosovitskiy et al. (2020b) and generation Jiang et al. (2021), as well as more complicated object detection Carion et al. (2020); Zhu et al. (2020) and video instance segmentation Wang et al. (2020). However, multi-person 3D pose estimation has not been explored along this direction. In this study, we propose a novel Multi-view Pose Transformer architecture with a joint query embedding scheme and a projective attention module to regress 3D skeleton joints from multi-view images directly, delivering a simplified and effective pipeline.

3 Multi-view Pose Transformer (MvP)

To build a direct multi-person 3D pose estimation framework from multi-view images, we introduce a novel Multi-view Pose transformer (MvP). MvP takes in the multi-view feature representations, and transforms them into groups of 3D joint locations directly (Fig. 2 (a)), delivering multi-person 3D pose results, with the following carefully designed query embedding and attention schemes for detecting and grouping the skeleton joints.

3.1 Joint Query Embedding Scheme

Inspired by transformers Vaswani et al. (2017), MvP represents each skeleton joint as a learnable positional embedding, which is fed into the transformer decoder and mapped into final 3D joint location by jointly attending to other joints and the multi-view information (Fig. 2 (a)). The learnt embeddings encode a prior knowledge about the skeleton joints and we name them as joint queries. MvP develops the following concise query embedding scheme.

Hierarchical Query Embeddings

The most straightforward way for designing joint query embeddings is to maintain a learnable query vector for each joint per person. However, we empirically find this scheme does not work well, likely because such a naive strategy cannot share the joint-level knowledge between different persons.

To tackle this problem, we develop a hierarchical query embedding scheme to explicitly encode the person-joint relation for better generalization to different scenes. The hierarchical embedding offers joint-level information sharing across different persons and reduces the learnable parameters, helping the model to learn useful knowledge from the training data, and thus generalize better. Concretely, instead of using the set of independent joint queries , we employ a set of person level queries , and a set of joint level queries to represent different persons and different skeleton joints, where denotes the feature dimension, is the number of persons, is the number of joints per person, and . Then the query of joint of person can be hierarchically formulated as


With such a hierarchical embedding scheme, the number of learnable query embedding parameters is reduced from to .

Input-dependent Query Adaptation

In the above, the learned joint query embeddings are shared for all the input images, independent of their contents, and thus may not generalize well on the novel target data. To address this limitation, we propose to augment the joint queries with input-dependent scene-level information in both model training and deployment, such that the learnt joint queries can be adaptive to the target data and generalize better. Concretely, we augment the above joint queries with a globally pooled feature vector from the multi-view image feature representations:


Here , where denotes image feature from -th view and is the total number of camera views; and denote concatenation and pooling operations, and is a learnable linear weight.

Figure 2: (a) Overview of the proposed MvP model. Upon the multi-view image features from several convolution layers, it deploys a transformer decoder with a stack of decoder layers to map the input joint queries and the multi-view features to 3D poses directly. (b) The projective attention of MvP projects 3D skeleton joints to anchor points (the green dots) on different views and samples deformable points (the red dots) surrounding these anchors to aggregate local contextual features via learned weights (the brighter color density means larger weights).

3.2 Projective Attention for Multi-view Feature Fusion

It is crucial to aggregate complementary multi-view information to transform the joint embeddings into accurate 3D joint locations. We consider the dot product attention mechanism of transformers Vaswani et al. (2017) to fuse the multi-view image features. However, naively applying such dot product attention densely over all spatial locations and camera views will incur enormous computation cost. Moreover, such dense attention is difficult to optimize and delivers poor performance empirically since it does not exploit any 3D geometric knowledge.

Therefore, we propose a geometrically-guided multi-view projective attention scheme, named projective attention. The core idea is to take the 2D projection of the estimated 3D joint location as the anchor point in each view, and only fuse the local features near those projected 2D locations from different views. Motivated by the deformable convolution Dai et al. (2017); Zhu et al. (2019), we adopt an adaptive deformable sampling strategy to gather the localized context information in each camera view, as shown in Fig. 2 (b). Other local attention operations Zhao et al. (2020); Wu et al. (2020, 2019) can also be adopted as an alternative. Formally, given joint query feature q and 3D joint position y, the projective attention is defined as


Here the view-specific feature is obtained by aggregating features from discrete offsetted sampling points from an anchor point , located by projecting the current 3D joint location y to 2D, where denotes perspective projection Hartley and Zisserman (2003) and the corresponding camera parameters. and are learnable linear weights. The attention weight a and the offset to the projected anchor point are estimated from the fusion of query feature q and the view-dependent feature at the projected anchor point , i.e., and , where and

are learnable linear weights. If the projected location and the offset are fractional, we use bilinear interpolation to obtain the corresponding feature

or .

The projective attention incorporates two geometrical cues, i.e., the corresponding 2D spatial locations across views from the 3D to 2D projection and the deformed neighborhood of the anchors from the learned offsets to gather view-adaptive contextual information. Unlike naive attention where the query feature densely interacts with the multi-view key features across all the spatial locations, the projective attention is more selective for the interaction between the query and each view—only the features from locations near to the projected anchors are aggregated, and thus is much more efficient.

Encoding Multi-view Positional Information with RayConv

The positional encoding Vaswani et al. (2017) is an important component of the transformer, which provides positional information of the input sequence. However, a simple per-view 2D positional encoding scheme cannot encode the multi-view geometrical information. To tackle this limitation, we propose to encode the camera ray directions that represent positional information in 3D space into the multi-view feature representations. Concretely, the camera ray direction , generated with the view-specific camera parameters, is concatenated channel-wisely to the corresponding image feature representation . Then a standard convolution is applied to obtain the updated feature representation , with the view-dependent geometric information:


We name the operation as RayConv. With it, the obtained feature representation is used for the projective attention by replacing in Eqn. (3). Such drop-in replacement introduces negligible computation, while injecting strong multi-view geometrical prior to augment the projective attention scheme, thus helping more precisely predict the refined 3D joint position.

3.3 Architecture

Our overall architecture (Fig. 2

(a)) is pleasantly simple. It adopts a convolution neural network, designed for 2D pose estimation 

Xiao et al. (2018), to obtain high-resolution image features from multi-view inputs . The features are then fed into the transformer decoder consisting of multiple decoder layers to predict the 3D joint locations. Each layer conducts a self-attention to perform pair-wise interaction between all the joints from all the persons in the scene; a projective attention to selectively gather the complementary multi-view information; and a feed-forward regression to predict the 3D joint positions and their confidence scores. Specifically, the transformer decoder applies a multi-layer progressive regression scheme, i.e., each decoder layer outputs 3D joint offsets to refine the input 3D joint positions from previous layer.

Extending to Body Mesh Recovery

MvP learns skeleton joints feature representations and is extendable to recovering human mesh with a parametric body mesh model Loper et al. (2015). Specifically, after average pooling on the joint features into per-person feature, a feed-forward network is used to predict the corresponding body mesh represented by the parametric SMPL model Loper et al. (2015). Similar to the joint location prediction, the SMPL parameters follow multi-layer progressive regression scheme.

3.4 Training

MvP infers a fixed set of joint locations for different persons, where . The main training challenge is how to associate the skeleton joints correctly for different person instances. Unlike the post-hoc grouping of detected skeleton joints as in bottom-up pose estimation methods Papandreou et al. (2018); Kreiss et al. (2019), MvP learns to directly predict the multi-joint 3D human pose in a group-wise fashion as shown in Fig. 2 (a). This is achieved by a grouped matching strategy during model training.

Grouped Matching

Given the predicted joint positions and associated confidence scores ,  we group every consecutive -joint predictions into per-person pose estimation , and average their corresponding confidence scores to obtain the per-person confidence scores . The same grouping strategy is used during inference.

The ground truth set of 3D poses of different person instances is smaller than the prediction set of size

, which is padded to size

with empty element . Then we find a bipartite matching between the prediction set and the ground truth set by searching for a permutation of that achieves the lowest matching cost:


We consider both the regressed 3D joint position and confidence score for the matching cost:


where , and computes the loss error. Following Carion et al. (2020); Sutskever et al. (2014), we employ the Hungarian algorithm Kuhn (1955) to compute the optimal assignment with the above matching cost.

Objective Function

We compute the Hungarian loss with the obtained optimal assignment :


Here and are losses for confidence score and pose regression, respectively. balances the two loss terms. We use focal loss Lin et al. (2017) for confidence prediction which adaptively balances the positive and negative samples. For pose regression, we compute loss for 3D joints and their projected 2D joints in different views. To learn multi-layer progressive regression, the above matching and loss are applied for each decoder layer. The total loss is thus , where denotes loss of the -th decoder layer and is the number of decoder layers. When extending MvP to body mesh recovery, we apply loss for 3D joints from the SMPL model and their 2D projections, as well as an adversarial loss following HMR Kanazawa et al. (2018); Jiang et al. (2020); Zhang et al. (2021) due to lack of GT SMPL parameters.

4 Experiments

In this section, we aim to answer following questions. 1) Can MvP provide both efficient and accurate multi-person 3D pose estimation? 2) How does the proposed attention mechanism help multi-view multi-person skeleton joints information fusing? 3) How does each individual design choice affect model performance? To this end, we conduct extensive experiments on several benchmark datasets.

Methods AP AP AP AP Recall MPJPE[mm] Time[ms]
VoxelPose Tu et al. (2020) 84.0 96.4 97.5 97.8 98.1 17.8 320
MvP (Ours) 92.3 96.6 97.5 97.7 98.2 15.8 170
Table 1: Result on the Panoptic dataset. MvP is more accurate and faster than VoxelPose.


Panoptic Joo et al. (2017) is a large-scale benchmark with 3D skeleton joint annotations. It captures daily social activities in an indoor environment. We conduct extensive experiments on Panoptic to evaluate and analyze our approach. Following VoxelPose Tu et al. (2020), we use the same data sequences except ‘160906_band3’ in the training set due to broken images. Unless otherwise stated, we use five HD cameras (3, 6, 12, 13, 23) in our experiments. All results reported in the experiments follow the same data setup. We use Average Precision (AP) and Recall Tu et al. (2020)

, as well as Mean Per Joint Position Error (MPJPE) as evaluation metrics.

Shelf and Campus Belagiannis et al. (2014) are two multi-person datasets capturing indoor and outdoor environments, respectively. We split them into training and testing sets following Belagiannis et al. (2014); Dong et al. (2019); Tu et al. (2020). We report Percentage of Correct Parts (PCP) for these two datasets.

Implementation Details

Following VoxelPose Tu et al. (2020), we adopt a pose estimation model Xiao et al. (2018) build upon ResNet-50 He et al. (2016)

for multi-view image features extraction. Unless otherwise stated, we use a stack of six transformer decoder layers. The model is trained for 40 epochs, with the Adam optimizer of learning rate

. During inference, a confidence threshold of 0.1 is used to filter out redundant predictions. Please refer to supplementary for more implementation details.

4.1 Main Results

Figure 3: Inference time versus the number of person instances. Benefiting from its direct inference framework, MvP maintains almost constant inference time regardless of the number of persons.


We first evaluate our MvP model on the challenging Panoptic dataset and compare it with the state-of-the-art VoxelPose model Tu et al. (2020). As shown in Table 1, Our MvP achieves 92.3 AP, improving upon VoxelPose by 9.8%, and achieves much lower MPJPE (15.8 v.s 17.8). Moreover, MvP only requires 170ms to process a multi-view input, about faster than VoxelPose111We count averaged per-sample inference time in millisecond on Panoptic test set. For all methods, the time is counted on GPU GeForce RTX 2080 Ti and CPU Intel i7-6900K @ 3.20GHz.. These results demonstrate both accuracy and efficiency advantages of MvP from estimating 3D poses of multiple persons in a direct regression paradigm. To further demonstrate efficiency of MvP, we compare its inference time with VoxelPose’s when processing different numbers of person instances. As shown in Fig. 3, the inference time of VoxelPose grows linearly with the number of persons in the scene due to the per-person regression paradigm. In contrast, MvP keeps constant inference time no matter how many instances in the scene. Notably, it takes only 185ms for MvP to process scenes even with 100 person instances (the blue line), demonstrating its great potential to handle crowded scenarios.

Shelf and Campus

We further compare our MvP with state-of-the-art approaches on the Shelf and Campus datasets. The reconstruction-based methods Belagiannis et al. (2015); Ershadi-Nasab et al. (2018); Dong et al. (2019) use 3D pictorial model Belagiannis et al. (2015); Dong et al. (2019) or conditional random field Ershadi-Nasab et al. (2018) within a multi-stage paradigm; and the volumetric approach VoxelPose Tu et al. (2020) highly relies on computationally intensive intermediate tasks. As shown in Table 2, our MvP achieves the best performance in all the actors on the Shelf dataset. Moreover, it obtains a comparable result on the Campus dataset as VoxelPose Tu et al. (2020) without relying on any intermediate task. These results further confirm the effectiveness of MvP for estimating 3D poses of multiple persons directly.

Methods Shelf Campus
Actor 1 Actor 2 Actor 3 Average Actor 1 Actor 2 Actor 3 Average
Belagiannis et al. Belagiannis et al. (2015) 75.3 69.7 87.6 77.5 93.5 75.7 84.4 84.5
Ershadi et al. Ershadi-Nasab et al. (2018) 93.3 75.9 94.8 88.0 94.2 92.9 84.6 90.6
Dong et al. Dong et al. (2019) 98.8 94.1 97.8 96.9 97.6 93.3 98.0 96.3
VoxelPose Tu et al. (2020) 99.3 94.1 97.6 97.0 97.6 93.8 98.8 96.7
MvP (Ours) 99.3 95.1 97.8 97.4 98.2 94.1 97.4 96.6
Table 2: Results (in PCP) on Shelf and Campus datasets.

4.2 Visualization

3D Pose and Body Mesh Estimation

We visualize some 3D pose estimations of MvP on the challenging Panoptic dataset in Fig. 4. It can be observed that MvP is robust to large pose deformation (the 1st example) and severe occlusion (the 2nd example), and can achieve geometrically plausible results w.r.t. different viewpoints (the rightmost column). Moreover, MvP is extendable to body mesh recovery and can achieve fairly good reconstruction results (the 2nd and 4th rows). All these results verify both effectiveness and extendability of MvP. Please see supplementary for more examples.

Attention Mechanism

We visualize the projective attention and the self-attention in Fig. 5. Benefiting from the 3D-to-2D projection, the projective attention can accurately locate the skeleton joint in each camera view (the green point) based on the current estimated 3D joint location. We observe it learns to gather adaptive local context information (the red points) with the deformable sampling operation. For instance, when regressing the 3D position of mid-hip (the 1st example), the projective attention selectively attends to informative joints such as the left and right hips as well as thorax, which offers sufficient contextual information for accurate estimation. We also visualize the self-attention, which learns pair-wise interaction between all the skeleton joints in the scene. From the 3D plot in Fig. 5, we can observe a certain skeleton joint mainly attends to other joints of the same person instance (more opaque). It also attends to joints from other person instances, but with less attention (more transparent). This phenomenon is reasonable as the skeleton joints of a human body are strongly correlated to each other, e.g., with certain pose priors and bone length.

Figure 4: Example 3D pose estimations from Panoptic dataset. The left four columns show the multi-view inputs and the corresponding body mesh estimations. The rightmost column shows the estimated 3D poses from two different viewpoints. Best viewed in color.
Figure 5: Visualization of projetive attention and self-attention on example skeleton joints. The attention weights are obtained with the -th decoder layer of a trained model. Projective attention (in the cropped image triplets): the green points denote the projected anchor points in each camera view, and the red points denote the offsetted spatial locations, with brighter color for stronger attention. Self-attention (in the 3D skeleton plots): example skeleton joint (green) to all the other skeleton joints (red) in the scene. The color density indicates attention weight. Best viewed in color and zoom.

4.3 Ablation

w/ 92.3 97.5 15.8
  w/o 87.5 96.2 17.4
(a) The effect of RayConv. w/o means removing RayConv.
Per-joint 67.4 84.7 41.2
Hier. 82.5 93.2 19.5
Hier.+ad. 92.3 97.5 15.8
(b) Different joint query embedding schemes.
0.0 93.1 98.5 16.3
0.1 92.3 97.5 15.8
0.2 91.1 96.2 15.5
0.4 89.2 93.7 15.0
(c) Different confidence threshold during evaluation.
2 6.3 92.5 49.6
3 63.4 95.6 22.8
4 86.8 96.8 17.5
5 91.8 97.6 16.2
6 92.3 97.5 15.8
7 92.0 97.5 15.9
(d) Number of decoder layers.
1 4.7 61.0 93.8
2 37.7 93.0 34.8
3 71.8 95.1 21.1
4 84.1 96.7 19.3
5 92.3 97.5 15.8
(e) Number of camera views.
1 88.6 96.3 18.2
2 89.3 97.5 17.4
4 92.3 97.7 15.8
8 84.4 91.1 20.3
(f) Number of deformable points .
Table 3: Ablations on Panoptic. In (b), Hier. denotes the hierarchical query embedding scheme, Hier.+ad. means further adding the adaptation strategy. Please see supplement for more ablations.

Importance of RayConv

MvP introduces RayConv to encode multi-view geometric information, i.e., camera ray directions into image feature representations. As shown in Table 2(a), if removing RayConv, the performance drops significantly—4.8 decrease in AP and 1.6 increase in MPJPE. This indicates the multi-view geometrical information is important for the model to more precisely localize the skeleton joints in 3D space. Without RayConv, the transformer decoder cannot accurately capture positional information in 3D space, resulting in performance drop.

Importance of Hierarchical Query Embedding

As shown in Table 2(b), compared with the straightforward and unstructured per-joint query embedding scheme, the proposed hierarchical query embedding boosts the performance sharply—14.1 increase in AP and 23.4 decrease in MPJPE. Its advantageous performance clearly verifies introducing the person-level queries to collaborate with the joint-level queries can better exploit human body structural information and improve model to better localize the joints. Upon the hierarchical query embedding scheme, adding the query adaptation strategy further improves the performance significantly, reaching AP of 92.3 and MPJPE of 15.8. This shows the proposed approach effectively adapts the query embeddings to the target scene and such adaptation is indeed beneficial for the generalization of MvP to novel scenes.

Different Model Designs

We also examine effects of varying the following designs of the MvP model to gain better understanding on them.

Confidence Threshold During inference, a confidence threshold is used to to filter out the low-confidence and erroneous pose predictions, and obtain the final result. Adopting a higher confidence will select the predictions in a more restrictive way. As shown in Table 2(c), a higher confidence threshold brings lower MPJPE as it selects more accurate predictions; but it also filters out some true positive predictions and thus reduces the average precision.

Number of Decoder Layers Decoder layers are used for refining the pose estimation. Stacking more decoder layers thus gives better performance (Table 2(d)). For instance, the MPJPE is as high as 49.6 when using only two decoder layers, but it is significantly reduced to 22.8 when using three decoder layers. This clearly justifies the progressive refinement strategy of our MvP model is effective. However the benefit of using more decoder layers diminishes when the number of layers is large enough, implying the model has reached the ceiling of its model capacity.

Number of Camera Views Multi-view inputs provide complementary information to each other which is extremely useful when handling some challenging environment factors in 3D pose estimation like occlusions. We vary the number of camera views to examine whether MvP can effectively fuse and leverage multi-view information to continuously improve the pose estimation quality (Table 2(e)). As expected, with more camera views, the 3D pose estimation accuracy monotonically increases, demonstrating the capacity of MvP in fusing multi-view information.

Number of Deformable Sampling Points Table 2(f) shows the effect of the number of deformable sampling points used in the projective attention. With only one deformable point, MvP already achieves a respectable result, i.e., 88.6 in AP and 17.4 in MPJPE. Using more sampling points further improves the performance, demonstrating the projective attention is effective at aggregating information from the useful locations. When , the model gives the best result. Further increasing to 8, the performance starts to drop. It is likely because using too many deformable points introduces redundant information and thus makes the model more difficult to optimize.

5 Conclusion

We introduced a direct and efficient model, named Multi-view Pose transformer (MvP), to address the challenging multi-view multi-person 3D human pose estimation problem. Different from existing methods relying on tedious intermediate tasks, MvP substantially simplifies the pipeline into a direct regression one by carefully designing the transformer-alike model architecture with a novel hierarchical joint query embedding scheme and projective attention mechanism. We conducted extensive experiments to verify its superior performance and speed over the well-established baselines.

We empirically found MvP needs sufficient data for model training since it learns the 3D geometry implicitly. In the future, we will study how to enhance the data-efficiency of MvP by leveraging the strategy like self-supervised pre-training or exploring more advanced approaches. Similar to prior works, we also found MvP suffers from performance drop for cross-camera generalization, that is, generalizing on novel camera views. We will explore approaches like disentangling camera parameters and multi-view feature learning to improve this aspect. Besides, we will explore the large-scale applications of MvP and further extend it to other relevant tasks. Thanks to its efficiency, MvP would be scalable to handle very crowded scenes with many persons. Moreover, the framework of MvP is general and thus extensible to other 3D modeling tasks like dense mesh recovery of common objects.


  • [1] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic (2014) 3D pictorial structures for multiple human pose estimation. In CVPR, Cited by: §1, §4, Qualitative Result.
  • [2] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic (2015) 3d pictorial structures revisited: multiple human pose estimation. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1929–1942. Cited by: §4.1, Table 2.
  • [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In ECCV, Cited by: §2, §3.4, Replacing Camera Ray Directions with 2D Spatial Coordinates.
  • [4] H. Chen, P. Guo, P. Li, G. H. Lee, and G. Chirikjian (2020) Multi-person 3d pose estimation in crowded scenes based on multi-view geometry. In ECCV, Cited by: §1, §2.
  • [5] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, Cited by: §3.2.
  • [6] J. Dong, W. Jiang, Q. Huang, H. Bao, and X. Zhou (2019) Fast and robust multi-person 3d pose estimation from multiple views. In CVPR, Cited by: §1, §2, §4, §4.1, Table 2.
  • [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv. Cited by: Replacing Camera Ray Directions with 2D Spatial Coordinates.
  • [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv. Cited by: §2.
  • [9] S. Ershadi-Nasab, E. Noury, S. Kasaei, and E. Sanaei (2018) Multiple human 3d pose estimation from multiview images. Multimedia Tools and Applications 77 (12), pp. 15573–15601. Cited by: §4.1, Table 2.
  • [10] K. Gong, J. Zhang, and J. Feng (2021) PoseAug: a differentiable pose augmentation framework for 3d human pose estimation. In CVPR, Cited by: §2.
  • [11] R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. 2 edition, Cambridge University Press, New York, NY, USA. External Links: ISBN 0521540518 Cited by: §2, §3.2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.
  • [13] Y. He, R. Yan, K. Fragkiadaki, and S. Yu (2020) Epipolar transformers. In CVPR, Cited by: §2.
  • [14] C. Huang, S. Jiang, Y. Li, Z. Zhang, J. Traish, C. Deng, S. Ferguson, and R. Y. Da Xu (2020) End-to-end dynamic matching network for multi-view multi-person 3d pose estimation. In ECCV, Cited by: §1, §2.
  • [15] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: Quantitative Result.
  • [16] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov (2019) Learnable triangulation of human pose. In ICCV, Cited by: §2, Quantitative Result.
  • [17] W. Jiang, N. Kolotouros, G. Pavlakos, X. Zhou, and K. Daniilidis (2020) Coherent reconstruction of multiple humans from a single image. In CVPR, Cited by: §3.4.
  • [18] Y. Jiang, S. Chang, and Z. Wang (2021) Transgan: two transformers can make one strong gan. arXiv. Cited by: §2.
  • [19] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh (2015) Panoptic studio: a massively multiview system for social motion capture. In ICCV, Cited by: §1.
  • [20] H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T. Godisart, B. Nabbe, I. Matthews, et al. (2017) Panoptic studio: a massively multiview system for social interaction capture. IEEE transactions on pattern analysis and machine intelligence 41 (1), pp. 190–204. Cited by: §4, Qualitative Result.
  • [21] A. Kadkhodamohammadi and N. Padoy (2021) A generalizable approach for multi-view 3d human pose regression. Machine Vision and Applications 32 (1), pp. 1–14. Cited by: §2.
  • [22] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In CVPR, Cited by: §3.4.
  • [23] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICCV, Cited by: More Implementation Details.
  • [24] S. Kreiss, L. Bertoni, and A. Alahi (2019) Pifpaf: composite fields for human pose estimation. In CVPR, Cited by: §3.4.
  • [25] H. W. Kuhn (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §3.4.
  • [26] J. Lin and G. H. Lee (2021) Multi-view multi-person 3d pose estimation with plane sweep stereo. In CVPR, Cited by: §2.
  • [27] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.4.
  • [28] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 1–16. Cited by: §1, §3.3.
  • [29] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: §2, Quantitative Result.
  • [30] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas, and C. Theobalt (2017) Vnect: real-time 3d human pose estimation with a single rgb camera. ACM Trans. on Graphics 36 (4), pp. 44. Cited by: §2.
  • [31] X. Nie, J. Zhang, S. Yan, and J. Feng (2019) Single-stage multi-person pose machines. In ICCV, Cited by: §2.
  • [32] G. Papandreou, T. Zhu, L. Chen, S. Gidaris, J. Tompson, and K. Murphy (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV, Cited by: §3.4.
  • [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    In NeurIPSw, Cited by: More Implementation Details.
  • [34] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Harvesting multiple views for marker-less 3d human pose annotations. In CVPR, Cited by: §2.
  • [35] A. Popa, M. Zanfir, and C. Sminchisescu (2017) Deep multitask architecture for integrated 2d and 3d human sensing. In CVPR, Cited by: §2.
  • [36] H. Qiu, C. Wang, J. Wang, N. Wang, and W. Zeng (2019) Cross view fusion for 3d human pose estimation. In ICCV, Cited by: §2.
  • [37] E. Remelli, S. Han, S. Honari, P. Fua, and R. Wang (2020) Lightweight multi-view 3d pose estimation through camera-disentangled representation. In CVPR, Cited by: §2.
  • [38] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In ECCV, Cited by: §2.
  • [39] I. Sutskever, O. Vinyals, and Q. V. Le (2014)

    Sequence to sequence learning with neural networks

    arXiv. Cited by: §3.4.
  • [40] H. Tu, C. Wang, and W. Zeng (2020) VoxelPose: towards multi-camera 3d human pose estimation in wild environment. In ECCV, Cited by: 3D-Stitching Transformer, §1, §1, §2, §4, §4, §4.1, §4.1, Table 1, Table 2, Quantitative Result.
  • [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv. Cited by: §3.1, §3.2, §3.2.
  • [42] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia (2020) End-to-end video instance segmentation with transformers. arXiv. Cited by: §2.
  • [43] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli (2019) Pay less attention with lightweight and dynamic convolutions. arXiv. Cited by: §3.2.
  • [44] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han (2020) Lite transformer with long-short range attention. arXiv. Cited by: §3.2.
  • [45] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In ECCV, Cited by: §3.3, §4, More Implementation Details.
  • [46] J. Zhang, X. Nie, and J. Feng (2020) Inference stage optimization for cross-scenario 3d human pose estimation. In NeurIPS, Cited by: §2.
  • [47] J. Zhang, D. Yu, J. H. Liew, X. Nie, and J. Feng (2021) Body meshes as points. In CVPR, Cited by: §2, §3.4.
  • [48] H. Zhao, J. Jia, and V. Koltun (2020) Exploring self-attention for image recognition. In CVPR, Cited by: §3.2.
  • [49] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei (2017) Towards 3d human pose estimation in the wild: a weakly-supervised approach. In ICCV, Cited by: §2.
  • [50] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In CVPR, Cited by: §3.2.
  • [51] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. In ICCV, Cited by: §2.

More Implementation Details

We use PyTorch Paszke et al. (2017) to implement the proposed Multi-view Pose transformer (MvP) model. Our MvP model is trained on 8 Nvidia RTX 2080 Ti GPUs, with a batch size of 1 per GPU and a total batch size of 8. We use the Adam optimizer Kingma and Ba (2015) with an initial learning rate of 1e-4 and decrease the learning rate by a factor of 0.1 at 20 epochs during training. The hyper-parameter for balancing confidence score and pose regression losses is set to 2.5. We use the image feature representations (256-d) from the de-convolution layer of the 2D pose estimator PoseResNet Xiao et al. (2018) for multi-view inputs. Additionally, we provide the code of MvP, including the implementation of model architecture, training and inference, in the folder of “./mvp” for better understanding our method.

Architecture Details

Figure 6:

(a) Illustration of the proposed hierarchical query embedding and the input-dependent query adaptation schemes. (b) Architecture of MvP’s decoder layer. It consist of a self-attention, a projective attention and a feed-forward network (FFN) with residual connections. Add means addition and Norm means normalization. Best viewed in color.

Hierarchical Joint Query Embedding

Fig. 6 (a) illustrates our proposed hierarchical query embedding scheme. As shown in Eqn. (1), each person-level query is added individually to the same set of joint-level queries to obtain the per-person customized joint queries. This scheme shares the joint-level queries across different persons and thus reduces the number of parameters (the joint embeddings) to learn, and helps the model generalize better. The generated per-person joint query embedding is further augmented by adding the scene-level feature extracted from the input images.

Decoder Layer

The decoder of MvP transformer consists of multiple decoder layers for regressing 3D joint locations progressively. Fig. 6 (b) demonstrates the detailed architecture of a decoder layer, which contains a self-attention module to perform pair-wise interaction between all the joints from multiple persons in the scene; a projective attention module to selectively gather the complementary multi-view information; and a feed-forward network (FFN) to predict the 3D joint locations and their confidence scores.

More Ablation Studies

Replacing Camera Ray Directions with 2D Spatial Coordinates

MvP encodes camera ray directions into the multi-view image feature representations via RayConv. We also compare with the simple positional embedding baseline that uses 2D coordinates as the positional information to embed, similar to the previous transformer-based models for vision tasks Carion et al. (2020); Dosovitskiy et al. (2020a). Specifically, we replace the camera ray directions with 2D spatial coordinates of the input images in RayConv. Results are shown in Table 4. We can observe using the 2D coordinates in RayConv results in much worse performance, i.e., 83.3 in AP and 18.1 in MPJPE. This result demonstrates that using such view-agnostic 2D coordinates information cannot well encode multi-view geometrical information into the model; while using camera ray directions can effectively encode the positional information of each view in 3D space, thus leading to better performance.

Positional Input AP AP MPJPE
Camera Ray Directions 92.3 97.5 15.8
2D Spatial Coordinates 83.3 93.0 18.1
Table 4: Results of replacing camera ray directions with 2D coordinates in RayConv.

Replacing Projective Attention with Dense Attention

We further investigate the effectiveness of the proposed projective attention by comparing it with the dense dot product attention, i.e., conducting attention densely over all spatial locations and camera views for multi-view information gathering. Results are given in Table 5. We observe MvP with the dense attention (MvP-Dense) delivers very poor performance (0.0 AP and 114.5 MPJPE) since it does not exploit any 3D geometries and thus is difficult to optimize. Moreover, such dense dot product attention incurs significantly higher computation cost than the proposed projective attention—MvP-Dense costs 31 G GPU memory, more than 5 larger than MvP with the projective attention, which only costs 6.1 G GPU memory.

Models AP AP MPJPE GPU Memory[G]
MvP-Dense 0.0 16.1 114.5 31.0
MvP 92.3 97.5 15.8 6.1
Table 5: Comparison between the dense attention and the proposed projective attention. MvP-Dense means replacing the projective attention with the dense attention. We report GPU memory cost with a batch size of 1 during training.

More Results

Quantitative Result

We also evaluate our MvP model on the most widely used single-person dataset Human3.6M Ionescu et al. (2014) collected in an indoor environment. We follow the standard training and evaluation protocol Martinez et al. (2017); Iskakov et al. (2019); Tu et al. (2020) and use MPJPE as evaluation metric. Our MvP model achieves 18.6 MPJPE which is comparable to state-of-the-art approaches (18.6 v.s 17.7 and 19.0) Iskakov et al. (2019); Tu et al. (2020).

Qualitative Result

Here we present more qualitative results of MvP on Panoptic Joo et al. (2017) (Fig. 7), Shelf and Campus Belagiannis et al. (2014) (Fig. 8) datasets. From Fig 7 we can observe that MvP can produce satisfactory 3D pose and body mesh estimations even in case of strong pose deformations (the 1st example) and large occlusion (the 2nd and 3rd examples). Moreover, the performance of MvP is robust even in the challenging crowded scenario, as shown in the 1st example in Fig. 8.

Figure 7: Example 3D pose estimations from Panoptic dataset. The left four columns show the multi-view inputs and the corresponding body mesh estimations from MvP. The rightmost column shows the estimated 3D poses from two different views. Best viewed in color.
Figure 8: Example results of 3D pose estimation from MvP on the Shelf (the 1st and 2nd examples) and Campus (the 3rd example) datasets. The left three columns show the multi-view inputs. The rightmost column shows the estimated 3D poses from two different views. Best viewed in color.