Direct Multi-view Multi-person 3D Human Pose Estimation
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. Instead of estimating 3D joint locations from costly volumetric representation or reconstructing the per-person 3D pose from multiple detected 2D poses as in previous methods, MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and let them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. To improve the accuracy of such a simple pipeline, MvP presents a hierarchical scheme to concisely represent query embeddings of multi-person skeleton joints and introduces an input-dependent query adaptation approach. Further, MvP designs a novel geometrically guided attention mechanism, called projective attention, to more precisely fuse the cross-view information for each joint. MvP also introduces a RayConv operation to integrate the view-dependent camera geometry into the feature representations for augmenting the projective attention. We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient. Notably, it achieves 92.3 Panoptic dataset, improving upon the previous best approach  by 9.8 is general and also extendable to recovering human mesh represented by the SMPL model, thus useful for modeling multi-person body shapes. Code and models are available at https://github.com/sail-sg/mvp.READ FULL TEXT VIEW PDF
Direct Multi-view Multi-person 3D Human Pose Estimation
Multi-view multi-person 3D pose estimation aims to localize 3D skeleton joints for each person instance in a scene from multi-view camera inputs. It is a fundamental task that benefits many real-world applications (such as surveillance, sportscast, gaming and mixed reality) and is mainly tackled by reconstruction-basedDong et al. (2019); Huang et al. (2020); Chen et al. (2020) and volumetric Tu et al. (2020) approaches in previous literature, as shown in Fig. 1 (a) and (b). The former first estimates 2D poses in each view independently and then aggregates them and reconstructs their 3D counterparts via triangulation or a 3D pictorial structure model. The volumetric approach Tu et al. (2020) builds a 3D feature volume through heatmap estimation and 2D-to-3D un-projection at first, based on which instance localization and 3D pose estimation are performed for each person instance individually. Though with notable accuracy, the above paradigms are inefficient due to highly relying on those intermediate tasks. Moreover, they estimate 3D pose for each person separately, making the computation cost grow linearly with the number of persons.
Targeted at a more simplified and efficient pipeline, we were wondering if it is possible to directly regress 3D poses from multi-view images without relying on any intermediate task? Though conceptually attractive, adopting such a direct mapping paradigm is highly non-trivial as it remains unclear how to perform skeleton joints detection and association for multiple persons within a single stage. In this work, we address these challenges by developing a novel Multi-view Pose transformer (MvP) model which significantly simplifies the multi-person 3D pose estimation. Specifically, MvP represents each skeleton joint as a learnable positional embedding, named joint query, which is fed into the model and mapped into final 3D pose estimation directly (Fig. 1 (c)), via a specifically designed attention mechanism to fuse multi-view information and globally reason over the joint predictions to assign them to the corresponding person instances.
We develop a novel hierarchical query embedding scheme to represent the multi-person joint queries. It shares joint embedding across different persons and introduces person-level query embedding to help the model in learning both person-level and joint-level priors. Benefiting from exploiting the person-joint relation, the model can more accurately localize the 3D joints. Further, we propose to update the joint queries with input-dependent scene-level information (i.e., globally pooled image features from multi-view inputs) such that the learnt joint queries can adapt to the target scene with better generalization performance.
To effectively fuse the multi-view information, we propose a geometrically-guided projective attention mechanism. Instead of applying full attention to densely aggregate features across spaces and views, it projects the estimated 3D joint into 2D anchor points for different views, and then selectively fuses the multi-view local features near to these anchors to precisely refine the 3D joint location. Additionally, we propose to encode the camera rays into the multi-view feature representations via a novel RayConv operation to integrate multi-view positional information into the projective attention. In this way, the strong multi-view geometrical priors can be exploited by projective attention to obtain more accurate 3D pose estimation.
Comprehensive experiments on 3D pose benchmarks Panoptic Joo et al. (2015), as well as Shelf and Campus Belagiannis et al. (2014) demonstrate our MvP works very well. Notably, it obtains 92.3% AP on the challenging Panoptic dataset, improving upon the previous best approach VoxelPose Tu et al. (2020) by 9.8%, while achieving nearly speed up. Moreover, the design ethos of our MvP can be easily extended to more complex tasks—we show that a simple body mesh branch with SMPL representation Loper et al. (2015) trained on top of a pre-trained MvP can achieve competitively qualitative results.
Our contributions are summarized as follows: 1) We strive for simplicity in addressing the challenging multi-view multi-person 3D pose estimation problem by casting it as a direct regression problem and accordingly develop a novel Multi-view Pose transformer (MvP) model, which achieves state-of-the-art results on the challenging Panoptic benchmark. 2) Different from query embedding designs in most transformer models, we propose a more tailored and concise hierarchical joint query embedding scheme to enable the model to effectively encode person-joint relation. Additionally, we mitigate the commonly faced generalization issue by a simple query adaptation strategy. 3) We propose a novel projective attention module along with a RayConv operation for fusing multi-view information effectively, which we believe are also inspiring for model designs in other multi-view 3D tasks.
3D pose estimation from monocular inputs Martinez et al. (2017); Mehta et al. (2017); Zhou et al. (2017); Popa et al. (2017); Sun et al. (2018); Nie et al. (2019); Zhang et al. (2020); Gong et al. (2021); Zhang et al. (2021) is an ill-posed problem as multiple 3D predictions may result in the same 2D projection. To alleviate such projective ambiguities, multi-view methods have been explored. Research works on single-person scenes use either multi-view geometry Hartley and Zisserman (2003) for feature fusion Qiu et al. (2019); He et al. (2020) and triangulation Iskakov et al. (2019); Remelli et al. (2020), or pictorial structure models for fast and robust 3D pose reconstruction Pavlakos et al. (2017); Qiu et al. (2019), achieving promising results. However, it is more challenging as we progress towards multi-person scenes. Current approaches mainly exploit a multi-stage pipeline for multi-person tasks, including reconstruction-based Dong et al. (2019); Chen et al. (2020); Huang et al. (2020); Kadkhodamohammadi and Padoy (2021); Lin and Lee (2021) and volumetric Tu et al. (2020) paradigms. Despite their notable accuracy, these methods suffer expensive computation cost from the intermediate tasks, such as cross-view matching and heatmap back-projection. Moreover, the total computation cost grows linearly with the number of persons in the scene, making them hardly scalable for larger scenes. Different from all previous approaches that rely on a multi-stage pipeline with computation redundancy, our method views multi-person 3D pose estimation as a direct regression problem based on a novel Multi-view Pose transformer model, enables an intermediate task-free single stage solution.
Driven by the recent success in natural language fields, there have been growing interests in exploring the Transformers for computer vision tasks, such as image recognitionDosovitskiy et al. (2020b) and generation Jiang et al. (2021), as well as more complicated object detection Carion et al. (2020); Zhu et al. (2020) and video instance segmentation Wang et al. (2020). However, multi-person 3D pose estimation has not been explored along this direction. In this study, we propose a novel Multi-view Pose Transformer architecture with a joint query embedding scheme and a projective attention module to regress 3D skeleton joints from multi-view images directly, delivering a simplified and effective pipeline.
To build a direct multi-person 3D pose estimation framework from multi-view images, we introduce a novel Multi-view Pose transformer (MvP). MvP takes in the multi-view feature representations, and transforms them into groups of 3D joint locations directly (Fig. 2 (a)), delivering multi-person 3D pose results, with the following carefully designed query embedding and attention schemes for detecting and grouping the skeleton joints.
Inspired by transformers Vaswani et al. (2017), MvP represents each skeleton joint as a learnable positional embedding, which is fed into the transformer decoder and mapped into final 3D joint location by jointly attending to other joints and the multi-view information (Fig. 2 (a)). The learnt embeddings encode a prior knowledge about the skeleton joints and we name them as joint queries. MvP develops the following concise query embedding scheme.
The most straightforward way for designing joint query embeddings is to maintain a learnable query vector for each joint per person. However, we empirically find this scheme does not work well, likely because such a naive strategy cannot share the joint-level knowledge between different persons.
To tackle this problem, we develop a hierarchical query embedding scheme to explicitly encode the person-joint relation for better generalization to different scenes. The hierarchical embedding offers joint-level information sharing across different persons and reduces the learnable parameters, helping the model to learn useful knowledge from the training data, and thus generalize better. Concretely, instead of using the set of independent joint queries , we employ a set of person level queries , and a set of joint level queries to represent different persons and different skeleton joints, where denotes the feature dimension, is the number of persons, is the number of joints per person, and . Then the query of joint of person can be hierarchically formulated as
With such a hierarchical embedding scheme, the number of learnable query embedding parameters is reduced from to .
In the above, the learned joint query embeddings are shared for all the input images, independent of their contents, and thus may not generalize well on the novel target data. To address this limitation, we propose to augment the joint queries with input-dependent scene-level information in both model training and deployment, such that the learnt joint queries can be adaptive to the target data and generalize better. Concretely, we augment the above joint queries with a globally pooled feature vector from the multi-view image feature representations:
Here , where denotes image feature from -th view and is the total number of camera views; and denote concatenation and pooling operations, and is a learnable linear weight.
It is crucial to aggregate complementary multi-view information to transform the joint embeddings into accurate 3D joint locations. We consider the dot product attention mechanism of transformers Vaswani et al. (2017) to fuse the multi-view image features. However, naively applying such dot product attention densely over all spatial locations and camera views will incur enormous computation cost. Moreover, such dense attention is difficult to optimize and delivers poor performance empirically since it does not exploit any 3D geometric knowledge.
Therefore, we propose a geometrically-guided multi-view projective attention scheme, named projective attention. The core idea is to take the 2D projection of the estimated 3D joint location as the anchor point in each view, and only fuse the local features near those projected 2D locations from different views. Motivated by the deformable convolution Dai et al. (2017); Zhu et al. (2019), we adopt an adaptive deformable sampling strategy to gather the localized context information in each camera view, as shown in Fig. 2 (b). Other local attention operations Zhao et al. (2020); Wu et al. (2020, 2019) can also be adopted as an alternative. Formally, given joint query feature q and 3D joint position y, the projective attention is defined as
Here the view-specific feature is obtained by aggregating features from discrete offsetted sampling points from an anchor point , located by projecting the current 3D joint location y to 2D, where denotes perspective projection Hartley and Zisserman (2003) and the corresponding camera parameters. and are learnable linear weights. The attention weight a and the offset to the projected anchor point are estimated from the fusion of query feature q and the view-dependent feature at the projected anchor point , i.e., and , where and
are learnable linear weights. If the projected location and the offset are fractional, we use bilinear interpolation to obtain the corresponding featureor .
The projective attention incorporates two geometrical cues, i.e., the corresponding 2D spatial locations across views from the 3D to 2D projection and the deformed neighborhood of the anchors from the learned offsets to gather view-adaptive contextual information. Unlike naive attention where the query feature densely interacts with the multi-view key features across all the spatial locations, the projective attention is more selective for the interaction between the query and each view—only the features from locations near to the projected anchors are aggregated, and thus is much more efficient.
The positional encoding Vaswani et al. (2017) is an important component of the transformer, which provides positional information of the input sequence. However, a simple per-view 2D positional encoding scheme cannot encode the multi-view geometrical information. To tackle this limitation, we propose to encode the camera ray directions that represent positional information in 3D space into the multi-view feature representations. Concretely, the camera ray direction , generated with the view-specific camera parameters, is concatenated channel-wisely to the corresponding image feature representation . Then a standard convolution is applied to obtain the updated feature representation , with the view-dependent geometric information:
We name the operation as RayConv. With it, the obtained feature representation is used for the projective attention by replacing in Eqn. (3). Such drop-in replacement introduces negligible computation, while injecting strong multi-view geometrical prior to augment the projective attention scheme, thus helping more precisely predict the refined 3D joint position.
Our overall architecture (Fig. 2
(a)) is pleasantly simple. It adopts a convolution neural network, designed for 2D pose estimationXiao et al. (2018), to obtain high-resolution image features from multi-view inputs . The features are then fed into the transformer decoder consisting of multiple decoder layers to predict the 3D joint locations. Each layer conducts a self-attention to perform pair-wise interaction between all the joints from all the persons in the scene; a projective attention to selectively gather the complementary multi-view information; and a feed-forward regression to predict the 3D joint positions and their confidence scores. Specifically, the transformer decoder applies a multi-layer progressive regression scheme, i.e., each decoder layer outputs 3D joint offsets to refine the input 3D joint positions from previous layer.
MvP learns skeleton joints feature representations and is extendable to recovering human mesh with a parametric body mesh model Loper et al. (2015). Specifically, after average pooling on the joint features into per-person feature, a feed-forward network is used to predict the corresponding body mesh represented by the parametric SMPL model Loper et al. (2015). Similar to the joint location prediction, the SMPL parameters follow multi-layer progressive regression scheme.
MvP infers a fixed set of joint locations for different persons, where . The main training challenge is how to associate the skeleton joints correctly for different person instances. Unlike the post-hoc grouping of detected skeleton joints as in bottom-up pose estimation methods Papandreou et al. (2018); Kreiss et al. (2019), MvP learns to directly predict the multi-joint 3D human pose in a group-wise fashion as shown in Fig. 2 (a). This is achieved by a grouped matching strategy during model training.
Given the predicted joint positions and associated confidence scores , we group every consecutive -joint predictions into per-person pose estimation , and average their corresponding confidence scores to obtain the per-person confidence scores . The same grouping strategy is used during inference.
The ground truth set of 3D poses of different person instances is smaller than the prediction set of size
, which is padded to sizewith empty element . Then we find a bipartite matching between the prediction set and the ground truth set by searching for a permutation of that achieves the lowest matching cost:
We consider both the regressed 3D joint position and confidence score for the matching cost:
where , and computes the loss error. Following Carion et al. (2020); Sutskever et al. (2014), we employ the Hungarian algorithm Kuhn (1955) to compute the optimal assignment with the above matching cost.
We compute the Hungarian loss with the obtained optimal assignment :
Here and are losses for confidence score and pose regression, respectively. balances the two loss terms. We use focal loss Lin et al. (2017) for confidence prediction which adaptively balances the positive and negative samples. For pose regression, we compute loss for 3D joints and their projected 2D joints in different views. To learn multi-layer progressive regression, the above matching and loss are applied for each decoder layer. The total loss is thus , where denotes loss of the -th decoder layer and is the number of decoder layers. When extending MvP to body mesh recovery, we apply loss for 3D joints from the SMPL model and their 2D projections, as well as an adversarial loss following HMR Kanazawa et al. (2018); Jiang et al. (2020); Zhang et al. (2021) due to lack of GT SMPL parameters.
In this section, we aim to answer following questions. 1) Can MvP provide both efficient and accurate multi-person 3D pose estimation? 2) How does the proposed attention mechanism help multi-view multi-person skeleton joints information fusing? 3) How does each individual design choice affect model performance? To this end, we conduct extensive experiments on several benchmark datasets.
|VoxelPose Tu et al. (2020)||84.0||96.4||97.5||97.8||98.1||17.8||320|
Panoptic Joo et al. (2017) is a large-scale benchmark with 3D skeleton joint annotations. It captures daily social activities in an indoor environment. We conduct extensive experiments on Panoptic to evaluate and analyze our approach. Following VoxelPose Tu et al. (2020), we use the same data sequences except ‘160906_band3’ in the training set due to broken images. Unless otherwise stated, we use five HD cameras (3, 6, 12, 13, 23) in our experiments. All results reported in the experiments follow the same data setup. We use Average Precision (AP) and Recall Tu et al. (2020)
, as well as Mean Per Joint Position Error (MPJPE) as evaluation metrics.Shelf and Campus Belagiannis et al. (2014) are two multi-person datasets capturing indoor and outdoor environments, respectively. We split them into training and testing sets following Belagiannis et al. (2014); Dong et al. (2019); Tu et al. (2020). We report Percentage of Correct Parts (PCP) for these two datasets.
We first evaluate our MvP model on the challenging Panoptic dataset and compare it with the state-of-the-art VoxelPose model Tu et al. (2020). As shown in Table 1, Our MvP achieves 92.3 AP, improving upon VoxelPose by 9.8%, and achieves much lower MPJPE (15.8 v.s 17.8). Moreover, MvP only requires 170ms to process a multi-view input, about faster than VoxelPose111We count averaged per-sample inference time in millisecond on Panoptic test set. For all methods, the time is counted on GPU GeForce RTX 2080 Ti and CPU Intel i7-6900K @ 3.20GHz.. These results demonstrate both accuracy and efficiency advantages of MvP from estimating 3D poses of multiple persons in a direct regression paradigm. To further demonstrate efficiency of MvP, we compare its inference time with VoxelPose’s when processing different numbers of person instances. As shown in Fig. 3, the inference time of VoxelPose grows linearly with the number of persons in the scene due to the per-person regression paradigm. In contrast, MvP keeps constant inference time no matter how many instances in the scene. Notably, it takes only 185ms for MvP to process scenes even with 100 person instances (the blue line), demonstrating its great potential to handle crowded scenarios.
We further compare our MvP with state-of-the-art approaches on the Shelf and Campus datasets. The reconstruction-based methods Belagiannis et al. (2015); Ershadi-Nasab et al. (2018); Dong et al. (2019) use 3D pictorial model Belagiannis et al. (2015); Dong et al. (2019) or conditional random field Ershadi-Nasab et al. (2018) within a multi-stage paradigm; and the volumetric approach VoxelPose Tu et al. (2020) highly relies on computationally intensive intermediate tasks. As shown in Table 2, our MvP achieves the best performance in all the actors on the Shelf dataset. Moreover, it obtains a comparable result on the Campus dataset as VoxelPose Tu et al. (2020) without relying on any intermediate task. These results further confirm the effectiveness of MvP for estimating 3D poses of multiple persons directly.
|Actor 1||Actor 2||Actor 3||Average||Actor 1||Actor 2||Actor 3||Average|
|Belagiannis et al. Belagiannis et al. (2015)||75.3||69.7||87.6||77.5||93.5||75.7||84.4||84.5|
|Ershadi et al. Ershadi-Nasab et al. (2018)||93.3||75.9||94.8||88.0||94.2||92.9||84.6||90.6|
|Dong et al. Dong et al. (2019)||98.8||94.1||97.8||96.9||97.6||93.3||98.0||96.3|
|VoxelPose Tu et al. (2020)||99.3||94.1||97.6||97.0||97.6||93.8||98.8||96.7|
We visualize some 3D pose estimations of MvP on the challenging Panoptic dataset in Fig. 4. It can be observed that MvP is robust to large pose deformation (the 1st example) and severe occlusion (the 2nd example), and can achieve geometrically plausible results w.r.t. different viewpoints (the rightmost column). Moreover, MvP is extendable to body mesh recovery and can achieve fairly good reconstruction results (the 2nd and 4th rows). All these results verify both effectiveness and extendability of MvP. Please see supplementary for more examples.
We visualize the projective attention and the self-attention in Fig. 5. Benefiting from the 3D-to-2D projection, the projective attention can accurately locate the skeleton joint in each camera view (the green point) based on the current estimated 3D joint location. We observe it learns to gather adaptive local context information (the red points) with the deformable sampling operation. For instance, when regressing the 3D position of mid-hip (the 1st example), the projective attention selectively attends to informative joints such as the left and right hips as well as thorax, which offers sufficient contextual information for accurate estimation. We also visualize the self-attention, which learns pair-wise interaction between all the skeleton joints in the scene. From the 3D plot in Fig. 5, we can observe a certain skeleton joint mainly attends to other joints of the same person instance (more opaque). It also attends to joints from other person instances, but with less attention (more transparent). This phenomenon is reasonable as the skeleton joints of a human body are strongly correlated to each other, e.g., with certain pose priors and bone length.
MvP introduces RayConv to encode multi-view geometric information, i.e., camera ray directions into image feature representations. As shown in Table 2(a), if removing RayConv, the performance drops significantly—4.8 decrease in AP and 1.6 increase in MPJPE. This indicates the multi-view geometrical information is important for the model to more precisely localize the skeleton joints in 3D space. Without RayConv, the transformer decoder cannot accurately capture positional information in 3D space, resulting in performance drop.
As shown in Table 2(b), compared with the straightforward and unstructured per-joint query embedding scheme, the proposed hierarchical query embedding boosts the performance sharply—14.1 increase in AP and 23.4 decrease in MPJPE. Its advantageous performance clearly verifies introducing the person-level queries to collaborate with the joint-level queries can better exploit human body structural information and improve model to better localize the joints. Upon the hierarchical query embedding scheme, adding the query adaptation strategy further improves the performance significantly, reaching AP of 92.3 and MPJPE of 15.8. This shows the proposed approach effectively adapts the query embeddings to the target scene and such adaptation is indeed beneficial for the generalization of MvP to novel scenes.
We also examine effects of varying the following designs of the MvP model to gain better understanding on them.
Confidence Threshold During inference, a confidence threshold is used to to filter out the low-confidence and erroneous pose predictions, and obtain the final result. Adopting a higher confidence will select the predictions in a more restrictive way. As shown in Table 2(c), a higher confidence threshold brings lower MPJPE as it selects more accurate predictions; but it also filters out some true positive predictions and thus reduces the average precision.
Number of Decoder Layers Decoder layers are used for refining the pose estimation. Stacking more decoder layers thus gives better performance (Table 2(d)). For instance, the MPJPE is as high as 49.6 when using only two decoder layers, but it is significantly reduced to 22.8 when using three decoder layers. This clearly justifies the progressive refinement strategy of our MvP model is effective. However the benefit of using more decoder layers diminishes when the number of layers is large enough, implying the model has reached the ceiling of its model capacity.
Number of Camera Views Multi-view inputs provide complementary information to each other which is extremely useful when handling some challenging environment factors in 3D pose estimation like occlusions. We vary the number of camera views to examine whether MvP can effectively fuse and leverage multi-view information to continuously improve the pose estimation quality (Table 2(e)). As expected, with more camera views, the 3D pose estimation accuracy monotonically increases, demonstrating the capacity of MvP in fusing multi-view information.
Number of Deformable Sampling Points Table 2(f) shows the effect of the number of deformable sampling points used in the projective attention. With only one deformable point, MvP already achieves a respectable result, i.e., 88.6 in AP and 17.4 in MPJPE. Using more sampling points further improves the performance, demonstrating the projective attention is effective at aggregating information from the useful locations. When , the model gives the best result. Further increasing to 8, the performance starts to drop. It is likely because using too many deformable points introduces redundant information and thus makes the model more difficult to optimize.
We introduced a direct and efficient model, named Multi-view Pose transformer (MvP), to address the challenging multi-view multi-person 3D human pose estimation problem. Different from existing methods relying on tedious intermediate tasks, MvP substantially simplifies the pipeline into a direct regression one by carefully designing the transformer-alike model architecture with a novel hierarchical joint query embedding scheme and projective attention mechanism. We conducted extensive experiments to verify its superior performance and speed over the well-established baselines.
We empirically found MvP needs sufficient data for model training since it learns the 3D geometry implicitly. In the future, we will study how to enhance the data-efficiency of MvP by leveraging the strategy like self-supervised pre-training or exploring more advanced approaches. Similar to prior works, we also found MvP suffers from performance drop for cross-camera generalization, that is, generalizing on novel camera views. We will explore approaches like disentangling camera parameters and multi-view feature learning to improve this aspect. Besides, we will explore the large-scale applications of MvP and further extend it to other relevant tasks. Thanks to its efficiency, MvP would be scalable to handle very crowded scenes with many persons. Moreover, the framework of MvP is general and thus extensible to other 3D modeling tasks like dense mesh recovery of common objects.
Automatic differentiation in pytorch. In NeurIPSw, Cited by: More Implementation Details.
Sequence to sequence learning with neural networks. arXiv. Cited by: §3.4.
We use PyTorch Paszke et al. (2017) to implement the proposed Multi-view Pose transformer (MvP) model. Our MvP model is trained on 8 Nvidia RTX 2080 Ti GPUs, with a batch size of 1 per GPU and a total batch size of 8. We use the Adam optimizer Kingma and Ba (2015) with an initial learning rate of 1e-4 and decrease the learning rate by a factor of 0.1 at 20 epochs during training. The hyper-parameter for balancing confidence score and pose regression losses is set to 2.5. We use the image feature representations (256-d) from the de-convolution layer of the 2D pose estimator PoseResNet Xiao et al. (2018) for multi-view inputs. Additionally, we provide the code of MvP, including the implementation of model architecture, training and inference, in the folder of “./mvp” for better understanding our method.
Fig. 6 (a) illustrates our proposed hierarchical query embedding scheme. As shown in Eqn. (1), each person-level query is added individually to the same set of joint-level queries to obtain the per-person customized joint queries. This scheme shares the joint-level queries across different persons and thus reduces the number of parameters (the joint embeddings) to learn, and helps the model generalize better. The generated per-person joint query embedding is further augmented by adding the scene-level feature extracted from the input images.
The decoder of MvP transformer consists of multiple decoder layers for regressing 3D joint locations progressively. Fig. 6 (b) demonstrates the detailed architecture of a decoder layer, which contains a self-attention module to perform pair-wise interaction between all the joints from multiple persons in the scene; a projective attention module to selectively gather the complementary multi-view information; and a feed-forward network (FFN) to predict the 3D joint locations and their confidence scores.
MvP encodes camera ray directions into the multi-view image feature representations via RayConv. We also compare with the simple positional embedding baseline that uses 2D coordinates as the positional information to embed, similar to the previous transformer-based models for vision tasks Carion et al. (2020); Dosovitskiy et al. (2020a). Specifically, we replace the camera ray directions with 2D spatial coordinates of the input images in RayConv. Results are shown in Table 4. We can observe using the 2D coordinates in RayConv results in much worse performance, i.e., 83.3 in AP and 18.1 in MPJPE. This result demonstrates that using such view-agnostic 2D coordinates information cannot well encode multi-view geometrical information into the model; while using camera ray directions can effectively encode the positional information of each view in 3D space, thus leading to better performance.
|Camera Ray Directions||92.3||97.5||15.8|
|2D Spatial Coordinates||83.3||93.0||18.1|
We further investigate the effectiveness of the proposed projective attention by comparing it with the dense dot product attention, i.e., conducting attention densely over all spatial locations and camera views for multi-view information gathering. Results are given in Table 5. We observe MvP with the dense attention (MvP-Dense) delivers very poor performance (0.0 AP and 114.5 MPJPE) since it does not exploit any 3D geometries and thus is difficult to optimize. Moreover, such dense dot product attention incurs significantly higher computation cost than the proposed projective attention—MvP-Dense costs 31 G GPU memory, more than 5 larger than MvP with the projective attention, which only costs 6.1 G GPU memory.
We also evaluate our MvP model on the most widely used single-person dataset Human3.6M Ionescu et al. (2014) collected in an indoor environment. We follow the standard training and evaluation protocol Martinez et al. (2017); Iskakov et al. (2019); Tu et al. (2020) and use MPJPE as evaluation metric. Our MvP model achieves 18.6 MPJPE which is comparable to state-of-the-art approaches (18.6 v.s 17.7 and 19.0) Iskakov et al. (2019); Tu et al. (2020).
Here we present more qualitative results of MvP on Panoptic Joo et al. (2017) (Fig. 7), Shelf and Campus Belagiannis et al. (2014) (Fig. 8) datasets. From Fig 7 we can observe that MvP can produce satisfactory 3D pose and body mesh estimations even in case of strong pose deformations (the 1st example) and large occlusion (the 2nd and 3rd examples). Moreover, the performance of MvP is robust even in the challenging crowded scenario, as shown in the 1st example in Fig. 8.