Pose2Room: Understanding 3D Scenes from Human Activities

by   Yinyu Nie, et al.

With wearable IMU sensors, one can estimate human poses from wearable devices without requiring visual input <cit.>. In this work, we pose the question: Can we reason about object structure in real-world environments solely from human trajectory information? Crucially, we observe that human motion and interactions tend to give strong information about the objects in a scene – for instance a person sitting indicates the likely presence of a chair or sofa. To this end, we propose P2R-Net to learn a probabilistic 3D model of the objects in a scene characterized by their class categories and oriented 3D bounding boxes, based on an input observed human trajectory in the environment. P2R-Net models the probability distribution of object class as well as a deep Gaussian mixture model for object boxes, enabling sampling of multiple, diverse, likely modes of object configurations from an observed human trajectory. In our experiments we demonstrate that P2R-Net can effectively learn multi-modal distributions of likely objects for human motions, and produce a variety of plausible object structures of the environment, even without any visual information.


page 3

page 6

page 7

page 15

page 17

page 18

page 19

page 20


What and Where: A Context-based Recommendation System for Object Insertion

In this work, we propose a novel topic consisting of two dual tasks: 1) ...

Learning Object Arrangements in 3D Scenes using Human Context

We consider the problem of learning object arrangements in a 3D scene. T...

Indoor Future Person Localization from an Egocentric Wearable Camera

Accurate prediction of future person location and movement trajectory fr...

MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction

Predicting human behavior is a difficult and crucial task required for m...

Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments

This paper introduces the problem of multiple object forecasting (MOF), ...

Goal-GAN: Multimodal Trajectory Prediction Based on Goal Position Estimation

In this paper, we present Goal-GAN, an interpretable and end-to-end trai...

Contextual Sense Making by Fusing Scene Classification, Detections, and Events in Full Motion Video

With the proliferation of imaging sensors, the volume of multi-modal ima...

1 Introduction

Understanding the structure of real-world 3D environments is fundamental to many computer vision tasks, and there has been a well-studied history of research into 3D reconstruction from various visual input mediums, such as RGB video

[mur2015orb, engel2014lsd, runz2020frodo, qian2020associative3d], RGB-D video [newcombe2011kinectfusion, choi2015robust, niessner2013real, whelan2015elasticfusion, dai2017bundlefusion], or single images [huang2018holistic, Nie_2020_CVPR, popov2020corenet, Zhang_2021_CVPR, engelmann2021points, kuo2020mask2cad, kuo2021patch2cad, dahnert2021panoptic]. Such approaches have shown impressive capture of geometric structures leveraging strong visual signals. We consider an unconventional view of 3D perception: in the case of a lack of any visual signal, we look at human pose data, which for instance can be estimated from wearable IMU sensors [von2017sparse, glauser2019interactive, huang2018deep], and ask “What can we learn about a 3D environment from only human pose trajectory information?”

In particular, we observe that human movement in a 3D environment often interacts both passively and actively with objects in the environment, giving strong cues about likely objects and their locations. For instance, walking around a room indicates where empty floor space is available, a sitting motion indicates high likelihood of a chair or sofa to support the sitting pose, and a single outstretched arm suggests picking up/putting down an object to furniture that supports the object. We thus propose to address a new scene estimation task: from only a sequence observation of 3D human poses, to estimate the object arrangement in the scene of the objects the person has interacted with, as a set of object class categories and 3D oriented bounding boxes.

As there are inherent ambiguities that lie in 3D object localization from only a human pose trajectory in the scene, we propose P2R-Net to learn a probabilistic model of the most likely modes of object configurations in the scene. From the sequence of poses, P2R-Net leverages the pose joint locations to vote for potential object centers that participate in the observed pose interactions. We then introduce a probabilistic decoder that learns a Gaussian mixture model for object box parameters, from which we can sample multiple diverse hypotheses of object arrangements.

In summary, we present the following contributions:

  • We propose a new perspective on 3D scene understanding by studying estimation of 3D object configurations from solely observing 3D human pose sequences of interactions in an environment, without any visual input, and predicting the object class categories and 3D oriented bounding boxes of the interacted objects in the scene.

  • To address this task, we introduce a new, end-to-end, learned probabilistic model that estimates probability distributions for the object class categories and bounding box parameters.

  • We demonstrate that our model captures complex, multi-modal distributions of likely object configurations, which can be sampled to produce diverse hypotheses that have accurate coverage over the ground truth object arrangement.

2 Related Work

Predicting Human Interactions in Scenes.

Capturing and modeling interactions between human and scenes has seen impressive progress in recent years, following significant advances in 3D reconstruction and 3D deep learning. From a visual observation of a scene, interactions and human-object relations are estimated. Several methods have been proposed for understanding the relations between scene and human poses via object functionality prediction

[grabner2011makes, zhu2015understanding, pieropan2013functional, hu2016learning] and affordance analysis [gupta20113d, savva2016pigraphs, sawatzky2017weakly, wang2019geometric, deng20213d, ruiz2018can].

By parsing the physics and semantics in human interactions, further works have been proposed towards synthesizing static human poses or human body models into 3D scenes [grabner2011makes, gupta20113d, savva2014scenegrok, kim2014shape2pose, savva2016pigraphs, hassan2019resolving, zhang2020place, zhang2020generating, hassan2021populating]. These works focus on how to place human avatars into a 3D scene with semantic and physical plausibility (e.g., support or occlusion constraints). Various approaches have additionally explored synthesizing dynamic motions for a given scene geometry. Early methods retrieve and integrate existing avatar motions from database to make them compatible with scene geometry [lee2002interactive, agrawal2016task, kapadia2016precision, lee2006motion, shum2008interaction]. Given a goal pose or a task, more recent works leverage deep learning methods to search for a possible motion path and estimate plausible contact motions [starke2019neural, chao2019learning, corona2020context, merel2020catch, wang2021synthesizing, hassan2021stochastic]. These methods all explore human-scene interaction understanding by estimating object functionalities or human interactions as poses in a given 3D scene environment. In contrast, we take a converse perspective, and aim to estimate the 3D scene arrangement from human pose trajectory observations.

Figure 2: Overview of P2R-Net. Given a pose trajectory with frames and joints, a position encoder decouples each skeleton frame into a relative position encoding (from its root joint as the hip centroid) and a position-agnostic pose. After combining them, a pose encoder learns local pose features from both body joints per skeleton (spatial encoding) and their changes in consecutive frames (temporal encoding). Root joints as seeds are then used to vote for the center of a nearby object that each pose is potentially interacting with. A probabilistic mixture network learns likely object box distributions, from which object class labels and oriented 3D boxes can be sampled.

Scene Understanding with Human Priors.

As many environments, particularly indoor scenes, have been designed for people’s daily usage, human behavioral priors can be leveraged to additionally reason about 2D or 3D scene observations. Various methods have been proposed to leverage human context as extra signal towards holistic perception to improve performance in scene understanding tasks such as semantic segmentation [delaitre2012scene], layout detection from images [fouhey2012people, shoaib2014estimating], 3D object labeling [jiang2013hallucinated], 3D object detection and segmentation [wei2016modeling], and 3D reconstruction [fowler2017towards, fowler2018human].

Additionally, several methods learn joint distributions of human interactions with 3D scenes or RGB video that can be leveraged to re-synthesize the observed scene as an arranged set of synthetic, labeled CAD models

[jiang2012learning, fisher2015activity, jiang2015modeling, savva2016pigraphs, monszpart2019imapper]. Recently, HPS (Human POSEitioning System) [guzov2021human] introduced an approach to simultaneously estimate human pose trajectory and 3D scene reconstruction from a set of wearable visual and inertial sensors on a person. We also aim to understand 3D scenes as arrangements of objects, but do not require any labeled interactions nor consider any visual (RGB, RGB-D, etc.) information as input. The recent approach of Mura et al. [mura2021walk2map] poses the task of floor plan estimation from 2D human walk trajectories, and propose an approach to predict occupancy-based floor plans that indicate structure and object footprints, but without object instance distinction and employs a fully-deterministic prediction. To the best of our knowledge, we introduce the first method to learn 3D object arrangement distributions from only 3D human pose trajectories, without any visual input.

Pose Tracking with IMUs.

Our method takes the input of human pose trajectories, which is built on the success of motion tracking techniques. Seminal work on pose estimation from wearable sensors have demonstrated effective pose estimation from wearable sensors, such as optical markers

[chai2005performance, hassan2021stochastic] or IMUs [liu2011realtime, von2017sparse, huang2018deep, kaufmann2021pose, glauser2019interactive]. Our work is motivated by the capability of reliably estimating human pose from these simple sensor setups without any visual data, from which we aim to learn about human-object interaction priors to estimate scene object configurations.

3 Method

From only a human pose trajectory as input, we aim to estimate a distribution of likely object configurations, from which we can sample plausible hypotheses of objects in the scene as sets of class category labels and oriented 3D bounding boxes. We observe that most human interactions in an environment are targeted towards specific objects, and that general motion behavior is often influenced by the object arrangement in the scene. We thus aim to discover potential objects that each pose may be interacting with.

We first extract meaningful features from the human pose sequence with a position encoder to disentangle each frame into a relative position encoding and a position-agnostic pose, as well as a pose encoder to learn the local spatio-temporal feature for each pose in consecutive frames. We then leverage these features to vote for a potential interacting object for each pose. From these votes, we learn a probabilistic mixture decoder to propose box proposals for each object, characterizing likely modes for objectness, class label, and box parameters. An illustration of our approach is shown in Figure 2.

3.1 Relative Position Encoding

We consider an input pose trajectory with frames and joints as the sequence of 3D locations . We also denote the root joint of each pose by , where the root joint of a pose is the centroid of the joints corresponding to the body hip (for the full skeleton configuration, we refer to the supplemental). To learn informative pose features, we first disentangle for each frame the absolute pose joint coordinates into a relative position encoding and a position-agnostic pose feature , which are formulated as follows:


where , are point-wise MLP layers. is the set of temporal neighbors to each root joint in , and denotes neighbor-wise average pooling. By broadcast summation, we output for further spatio-temporal pose encoding. Understanding relative positions rather than absolute positions helps to provide more generalizable features to understand common pose motion in various object interactions, as these human-object interactions typically occur locally in a scene.

3.2 Spatio-Temporal Pose Encoding

The encoding provides signal for the relative pose trajectory of a person. We then further encode these features to capture the joint movement to understand local human-object interactions. That is, from , we learn joint movement in spatio-temporal domain: (1) in the spatial domain, we learn from intra-skeleton joints to capture per-frame pose features; (2) in the temporal domain, we learn from inter-frame relations to perceive each joint’s movement.

Figure 3: Pose encoding with spatio-temporal convolutions.

Inspired by [yan2018spatial]

in 2D pose recognition, we first use a graph convolution layer to learn intra-skeleton joint features. Edges in the graph convolution are constructed following the skeleton bones, which encodes skeleton-wise spatial information. For each joint, we then use a 1-D convolution layer to capture temporal features from its inter-frame neighbors. A graph layer and an 1-D convolution layer are linked into a block with a residual connection to process the input

(see Figure 3). By stacking six blocks, we obtain a deeper spatio-temporal pose encoder with a wider receptive field in temporal domain, enabling reasoning over more temporal neighbors for object box estimation. Finally, we adopt an MLP to process all joints per skeleton to obtain pose features .

3.3 Locality-Sensitive Voting

With our extracted pose features , we then learn to vote for all the objects a person could have interacted with in a trajectory (see Figure 2). For each pose frame, we predict the center of an object it potentially interacts with. Since we do not know when interactions begin or end, each pose votes for a potential object interaction. As human motion in a scene will tend to active interaction with an object or movement to an object, we aim to learn these patterns by encouraging votes for objects close to the respective pose, encouraging locality-based consideration.

For each pose feature , we use its root joint as a seed location, and vote for an object center by learning the displacement from the seed:


where are the uniformly sampled seeds from to ensure a even spatial distribution; are the corresponding pose features of ; are MLP layers; denote the vote coordinates and features learned from .

Since there are several objects in a scene, for each seed, we vote for the center to the nearest one (see Figure 4). The nearest object is both likely to participate in a nearby interaction, and affect motion around the object if not directly participating in an interaction. This strategy helps to reduce the ambiguities in the task of scene object configuration estimation from a pose trajectory by capturing both direct and indirect effects of object location on pose movement.

For the seeds which vote for the same object, we group their votes to a cluster following [Qi_2019_ICCV]. This outputs cluster centers with aggregated cluster feature where denotes the number of vote clusters. We then use the to decode to distributions that characterize semantic 3D boxes, which is described in Section 3.4. For poses whose root joint is not close to any object during training (beyond a distance threshold ), we consider them to have little connection with the objects, and do not train them to vote for any object.

Figure 4: Voting to objects that potentially influence the motion trajectory in approaching the target.

3.4 Probabilistic Mixture Decoder

We decode vote clusters to propose oriented 3D bounding boxes for each object, along with their class label and objectness score. Each box is represented by a 3D center , 3D size and 1D orientation , where we represent the size by and orientation by for regression, similar to [yin2021center]. Since the nature of our task is inherently ambiguous (e.g., it may be unclear from observing a person sit if they are sitting on a chair or a sofa, or the size of the sofa), we propose to learn a probabilistic mixture decoder to predict the box centroid, size and orientation with multiple modes, from a vote cluster :


where denote the regression targets for center, size, and orientation;

is the learned multivariate Gaussian distribution of the

-th mode for , where is sampled from; is the number of Gaussian distributions (i.e., modes); is the learned score for the -th mode; is the weighted sum of the samples from all modes, which is the prediction of the center/size/orientation; and is their output dimension (=3, =3, =2). Note that the box center is obtained by regressing the offset from cluster center . We predict the proposal objectness and the probability distribution for class category directly from , using an MLP.

Multi-modal Prediction.

In Eq. 3, the learnable parameters are and .

is realized with an MLP followed by a sigmoid function, and

are the learned embeddings shared among all samples. During training, we sample from each mode and predict using Eq. 3. To generate diverse and plausible hypotheses during inference, we not only sample , but also sample various different modes by randomly disregarding mixture elements based on their probabilities . Then we obtain as follows:



is sampled from Bernoulli distribution with probability of

. We also sample the object classes by the predicted classification probabilities, and discard proposed object boxes with low objectness () after 3D NMS.

We can then generate hypotheses in a scene; each hypothesis is an average of samples of , which empirically strikes a good balance between diversity and accuracy of the set of hypotheses. To obtain the maximum likelihood prediction, we use and the mean value instead of and to estimate the boxes with Eq. 3.

3.5 Loss Function

The loss function consists of classification losses for objectness

and class label , and regression losses for votes , box center , size and orientation angle .

Classification Losses. Similar to [Qi_2019_ICCV], and

are supervised by cross entropy losses, wherein the objectness score is used to classify if a vote cluster center is close to (

m, positive) or far from ( m, negative) the ground truth. Proposals from the clusters with positive objectness are further supervised with box regression losses.

Regression Losses. We supervise all the predicted votes, box centers, sizes and orientations with a Huber loss. For poses that are located within to objects ( m), we use the closest object center to supervise their vote. Votes from those poses that are far from all objects are not considered. For center predictions, we use their nearest ground-truth center to calculate . Since box sizes and orientations are predicted from vote clusters, we use the counterpart from the ground-truth box that is nearest to the vote cluster for supervision. Then the final loss function is , where and are constant weights that balance the losses.

4 Experiment Setup

Dataset. We introduce a new dataset for the task of estimating scene object configuration from a human pose trajectory observation, as existing interaction-based datasets are limited to single objects, focus on 2D data without 3D information, rely on imprecise pose tracking, or lack dynamics with only static interactions. We construct our dataset using the simulation environment VirtualHome [puig2018virtualhome], which is built on the Unity3D game engine. It consists of 29 rooms, with each room containing 88 objects on average; each object is annotated with available interaction types. VirtualHome allows customization of action scripts and control over humanoid agents to execute a series of complex interactive tasks. We refer readers to [puig2018virtualhome] for the details of the scene and action types. In our work, we focus on the static, interactable objects under 17 common class categories. In each room, we select up to 10 random objects to define the scene, and script the agent to interact with each of the objects in a sequential fashion. For each object, we also select a random interaction type associated with the object class category. Then we randomly sample 14K different sequences with corresponding object boxes to construct the dataset. During training, we also randomly flip, rotate and translate the scenes and poses for data augmentation. For additional detail about data generation, we refer to the supplemental.


We train our P2R-Net end-to-end from scratch with the batch size at 32 on 4 NVIDIA 2080 Ti GPUs for 180 epochs until convergence, where Adam is used as the optimizer. The initial learning rate is at 1e-3 in the first 80 epochs, which is decayed by

every 40 epochs after that. The losses are weighted by =5, =1, ====10 to balance the loss values. During training, we use pose distance threshold =1 m. At inference time, we output box predictions after 3D NMS with an IoU threshold of 0.1. We use an objectness threshold of =0.5. For more detailed architecture and data specifications, we refer to the supplemental.

Evaluation. To evaluate our task, we consider two types of evaluation splits: a sequence-level split across different interaction sequences, and room-level split across different rooms as well as interaction sequences. Note that sequences are trained with and evaluated against only the objects that are interacted with during the input observation, resulting in different variants of each room under different interaction sequences. For , the train/test split ratio is 4:1 over the generated sequences. is a more challenging setup, with 27 train rooms and 2 test rooms, resulting in 13K/1K sequences. Since the task is inherently ambiguous, and only a single ground truth configuration of each room is available, we evaluate multi-modal predictions by several metrics: mAP@0.5 evaluates the mean average precision with the IoU threshold at 0.5 of the maximum likelihood prediction; MMD evaluates the Minimal Matching Distance [achlioptas2018learning] of the best matching prediction with the ground truth out of 10 sampled hypotheses to measure their quality; TMD evaluates the Total Mutual Diversity [wu2020multimodal] to measure the diversity of the 10 hypotheses. We provide additional detail about MMD and TMD in the supplemental.

(a) Input
(b) Pose-VoteNet
(c) Pose-VN
(d) Motion Attention
(e) Ours
(f) GT
Figure 5: Qualitative results of object detection from a pose trajectory on the sequence-level split (unseen interaction sequences).
(a) Input
(b) Pose-VoteNet
(c) Pose-VN
(d) Motion Attention
(e) Ours
(f) GT
Figure 6: Qualitative results of object detection from a pose trajectory on the room-level split (unseen interaction sequences and rooms).

5 Results and Analysis

We evaluate our approach on the task of scene object configuration estimation from a pose trajectory observation, in comparison with baselines constructed from state-of-the-art 3D detection and pose understanding methods, as well as an ablation analysis of our multi-modal prediction.

5.1 Baselines

Since there are no prior works that tackle the task of predicting the object configuration of a scene from solely a 3D pose trajectory, we construct several baselines leveraging state-of-the-art techniques as well as various approaches to estimate multi-modal distributions. We consider the following baselines: 1) Pose-VoteNet [Qi_2019_ICCV]. Since VoteNet is designed for detection from point clouds, we replace their PointNet++ encoder with our position encoder + MLPs to learn joint features for seeds. 2) Pose-VN

, Pose-VoteNet based on Vector Neurons

[deng2021vn] which replaces MLP layers in Pose-VoteNet with SO(3)-equivariant operators that can capture arbitrary rotations of poses to estimate objects. 3) Motion Attention [mao2020history]. Since our task can be also regarded as a sequence-to-sequence problem, we adopt a frame-wise attention encoder to extract repetitive pose patterns in the temporal domain which then inform a VoteNet decoder to regress boxes. Additionally, we also ablate our probabilistic mixture decoder with other alternatives: 4) Deterministic P2R-Net (P2R-Net-D), where we use VoteNet decoder [Qi_2019_ICCV] in our method for box regression to produce deterministic results; 5) Generative P2R-Net (P2R-Net-G), where our P2R-Net decoder is designed with a probabilistic generative model [wu2016learning] to decode boxes from a learned latent variable; 6) Heatmap P2R-Net (P2R-Net-H), where the box center, size and orientation are discretized into binary heatmaps, and the box regression is converted into a classification task. Detailed architecture specifications for these networks are given in the supplemental material.

Figure 7: Multi-modal predictions of P2R-Net. By sampling our probabilistic decoder multiple times, we can obtain different plausible box predictions. Here, we show three randomly sampled hypotheses and the maximum likelihood prediction for each input.
bed bench cabinet chair desk dishwasher fridge lamp sofa stove toilet computer mAP@0.5
Pose-VoteNet 2.90 15.00 33.14 18.77 58.52 32.14 0.00 6.07 62.32 49.82 0.00 3.06 25.70
Pose-VN 20.81 18.13 49.76 18.68 70.92 33.56 0.00 5.60 67.24 46.76 0.00 6.11 29.90
Motion Attention 36.42 7.54 23.35 19.50 77.71 15.59 17.13 2.35 78.61 51.03 14.81 5.50 28.39
P2R-Net-D 93.77 12.63 11.98 5.77 95.93 61.80 73.95 0.58 88.44 70.42 0.00 14.17 34.91
P2R-Net-G 91.69 7.56 36.61 10.05 93.47 67.53 77.45 1.21 92.97 64.86 5.56 18.67 37.48
P2R-Net-H 85.84 8.04 22.04 10.91 76.08 55.20 55.15 0.00 83.92 57.60 5.00 4.33 31.41
P2R-Net 94.21 10.12 54.72 8.02 93.32 56.33 59.89 3.25 90.92 57.86 61.11 19.94 42.20
Table 1: Quantitative evaluation on split . For P2R-Net-G, P2R-Net-H and ours, we use the maximum likelihood predictions to calculate mAP scores. The mAP@0.5 is averaged over all 17 classes (see the full table with all classes in the supplementary file).

5.2 Qualitative Comparisons

Comparisons on . Figure 5 visualizes predictions on the test set of unseen interaction sequences. Pose-VoteNet struggles to identify the existence of an object, leading to many missed detections, but can estimate reasonable object locations when an object is predicted. Pose-VN alleviates this problem of under-detection, but struggles to estimate object box sizes (rows 1,3). These baselines indicate the difficulty in detecting objects without sharing pose features among temporal neighbors. Motion Attention [mao2020history] addresses this by involving global context with inter-frame attention. However, it does not take advantage of the skeleton’s spatial layout in continuous frames and struggles to detect the existence of objects (row 1,2). In contrast, our method leverages both target-dependent poses and object occupancy context, that learns the implicit interrelations between poses and objects to infer object boxes, and achieves better estimate of the scene configuration.

Comparisons on . In Figure 6, we illustrate the qualitative comparisons on the test set of unseen interaction sequences in unknown rooms. In this scenario, most baselines fail to localize objects, while our method can nonetheless produce plausible object layouts.

Multi-modal Predictions. We additionally visualize various sampled hypotheses from our model in Figure 7, showing that our method is able to deduce the spatial occupancy of objects from motion trajectories, and enables diverse, plausible estimation of object locations, orientation, and sizes for interactions.

5.3 Quantitative Comparisons

We use mAP@0.5 to measure object detection accuracy for all baselines, and evaluate the accuracy and diversity of a set of output hypotheses using minimal matching distance (MMD) and total mutual diversity (TMD).

Detection Accuracy. Table 1 shows a quantitative comparison on split , where we observe that Pose-VoteNet and Pose-VN, struggle to recognize some object categories (e.g., bed, fridge and toilet). By leveraging the inter-frame connections, Motion Attention achieves improved performance in recognizing object classes, but struggles with detecting objectness and predicting object sizes. In contrast, our position and pose encoder learns both the spatial and temporal signals from motions to estimate likely object locations by leveraging the potential connections between human and objects, with our probabilistic mixture decoder better capturing likely modalities in various challenging ambiguous interactions (e.g., toilet and cabinet). In Table 3, we compare mAP scores on split , with increased relative improvement in the challenging scenario of scene object configuration estimation in new rooms.

P2R-Net-G 37.48 37.43 1.73
P2R-Net-H 31.41 21.10 2.39
P2R-Net 42.20 38.28 3.34
Table 2: Comparisons on detection accuracy, and multi-modal quality and diversity on . TMD=1 indicates no diversity.
Pose-VoteNet 10.23 - -
Pose-VN 6.12 - -
Motion Attention 6.53 - -
P2R-Net-D 31.56 - -
P2R-Net-G 31.59 31.83 1.98
P2R-Net-H 27.96 20.71 3.28
P2R-Net 35.34 32.47 4.35
Table 3: Comparisons on detection accuracy, and multi-modal quality and diversity on . TMD=1 indicates no diversity.

Quality and Diversity of Multi-modal Predictions.

We study the multi-modal predictions with our ablation variants P2R-Net-G and P2R-Net-H, and use MMD and TMD to evaluate the quality and diversity of 10 randomly sampled predictions for each method. From the 10 predictions, MMD records the best detection score (mAP@0.5), and TMD measures the average variance of the 10 semantic boxes per object. Table 

2, 3 present the MMD and TMD scores on and respectively. P2R-Net-G tends to predict with low diversity, as seen in low TMD score (TMD=1 indicates identical samples) and similar mAP@0.5 and MMD scores. P2R-Net-H shows better diversity but with lower accuracy in both of the two splits. Our final model not only achieves best detection accuracy, it also provides reasonable and diverse object configurations, with a notable performance improvement in the challenging split.

5.4 Ablations

In Table 4, we explore the effects of each individual module (relative position encoder, spatio-temporal pose encoder and probabilistic mixture decoder). We ablate P2R-Net by gradually adding components from the baseline : without relative position encoding (Pstn-Enc), where we use joints’ global coordinates relative to the room center; without spatio-temporal pose encoding (Pose-Enc), where we use MLPs to learn pose features from joint coordinates; without probabilistic mixture decoder (P-Dec), where we use 2-layer MLPs to regress box parameters.

Pstn-Enc Pose-Enc P-Dec mAP@0.5
- - - 8.71 / 3.12
- - 34.43 / 19.07
- 39.98 / 29.25
Full 42.20 / 35.34
Table 4: Ablation study of our design choices.

From the comparisons, we observe that relative position encoding plays the most significant role. It allows our model to pick up on local pose signal, as many human-object interactions present with strong locality. The spatio-temporal pose encoder then enhances the pose feature learning, and enables our model to learn the joint changes both in spatial and temporal domains. This design largely improves our generalization ability, particularly in unseen rooms (from 19.07 to 29.25). The probabilistic decoder further alleviates the ambiguity of this problem, and combining all the modules together produces the best results.

Tolerance to noise. As real data often contain noise, we additionally study the effect of Gaussian noise (std=5 cm) onto the xyz coordinates of all joints in training and testing. Table 5 shows the effect of different noise levels in evaluation. We also visualize some sampled predictions under the noise level at 10 in Figure 7(b), where our method presents compelling tolerance for very noisy inputs.

Noise level
38.58 38.77 38.36 37.16 31.71
27.16 26.13 27.72 29.18 26.18
Table 5: mAP@0.5 under varying levels of noise (=1 cm).
(a) Prediction
(c) Prediction
(d) GT
(b) GT
Figure 8: Predictions on noisy inputs (std=10 cm).
(b) GT

Limitations. Although P2R-Net achieves plausible scene estimations from only pose trajectories, it operates on several assumptions that can lead to potential limitations: (1) Objects should be interactable, e.g., our method may not detect objects that do not induce strong pose interactions like mirror or picture; (2) Interactions occur at close range, e.g., we may struggle to detect a TV from a person switching on it with a remote control. Additionally, we currently focus on estimating static object boxes. We believe an interesting avenue for future work is to characterize objects in motion (e.g., due to grabbing) or articulated objects (e.g., laptops).

Broader Impact. Our approach learns to infer scene configurations from human behavioral data, for which real-world data collection and use requires careful consideration to preserve privacy. Additionally, possible biases learned by our model could potentially be analyzed with uncertainty estimation techniques to discover model or data biases.

6 Conclusion

We have presented a first exploration to the ill-posed problem of estimating the 3D object configuration in a scene from only a 3D pose trajectory observation of a person interacting with the scene. Our proposed model P2R-Net leverages spatio-temporal features from the pose trajectory to vote for likely object positions and inform a new probabilistic mixture decoder that captures multi-modal distributions of object box parameters. We demonstrate that such a probabilistic approach can effectively model likely object configurations in scene, producing plausible object layout hypotheses from an input pose trajectory. We hope that this establishes a step towards object-based 3D understanding of environments using non-visual signal and opens up new possibilities in leveraging ego-centric motion for 3D perception and understanding.


This project is funded by the Bavarian State Ministry of Science and the Arts and coordinated by the Bavarian Research Institute for Digital Transformation (bidt), the TUM Institute of Advanced Studies (TUM-IAS), the ERC Starting Grant Scan2CAD (804724), and the German Research Foundation (DFG) Grant Making Machine Learning on Static and Dynamic 3D Data Practical.



In the supplementary material, we describe network parameters and specifications (Sec. A), details of our data generation and distribution (Sec. B

), evaluation metrics in our experiments (Sec. 

C), additional quantitative results (Sec. D), additional qualitative results (Sec. E), and qualitative results on real data (Sec F).

Appendix A Network Specifications

We detail the full list of network parameters and specifications in this section. MLP layers used in our network are uniformly denoted by , where is the neuron number in the

-th layer. Each layer is followed by a batch normalization and a ReLU layer except the final one. We also report the efficiency and memory usage during inference at the end. Our code will be publicly available.

a.1 Skeleton Configuration

In P2R-Net, the input is a pose trajectory with frames and joints as the sequence of 3D locations , where , . For each trajectory, the humanoid agent interacts with up to 10 different objects in a scene, with the frame number varying among sequences (depending on the object interactions). To enable mini-batch training, we uniformly sample frames per sequence for training.

For the human skeleton structure, we use the predefined human body model in VirtualHome [puig2018virtualhome], which uses the Unity3D human body template. We refer readers to [unity3dskeleton] for the detailed definition of the skeleton specifications. The root joint that we use is the centroid of the hips, as illustrated in Figure 9.

Figure 9: Human skeleton configuration, following VirtualHome [puig2018virtualhome].

a.2 Relative Position Encoder

We list the layer details of our relative position encoder in Figure 10. In Section 3.1, the output pose feature dimension , the number of temporal neighbors . From the input pose trajectory , it outputs the relative pose features for spatio-temporal encoding.

Figure 10: Relative position encoder.

a.3 Spatio-Temporal Pose Encoder

Figure 11 illustrates the layer details of our spatio-temporal pose encoder in Section 3.2. From the input pose feature , we use a graph convolutional layer to learn intra-skeleton joint features. Edges in the graph convolution are constructed following the skeleton bones (as in Figure 9), which encodes skeleton-wise spatial information. We use the method of [yan2018spatial]

to build the edge connections from skeleton bones. With the learned joint features on each skeleton, we then adopt a 1-D convolutional layer (feature dimension at 64, kernel size at 3, padding size at 1) to process joint features in temporal domain. The kernel size presents its receptive field on neighboring frames. We connect a graph layer and a 1-D convolutional layer into a block with a residual summation from the input (see Figure 

11). We duplicate the block six times and stack them in a sequence to construct the spatio-temporal pose encoder. After the spatio-temporal layers, we then flatten all joint features in a skeleton, which results in a -dimensional feature per pose, followed by an MLP[256] to process each pose and produce the final spatio-temporal pose features . .

Figure 11: Spatio-temporal pose encoder.

a.4 Locality-Sensitive Voting

In Section 3.3 of the main paper, we sample seeds from root joints , where . The seeds are uniformly sampled along the trajectory of root joints to ensure a even spatial distribution. are the corresponding pose features of , where , , . In Eq. 2 of the paper, we use two MLPs, and , to learn the vote locations and features from , where and share the first two MLP layers and correspondingly predict their targets from the last layer. We illustrate them in Figure 12.

From the 512 votes , we group them into clusters following [Qi_2019_ICCV], which results in cluster centers and features, where , , =128. For poses whose root joint is not close to any object during training (beyond a distance threshold =1 m), we do not consider them to vote for any object.

Figure 12: Locality-sensitive voting.

a.5 Probabilistic Mixture Decoder

In Section 3.4 of the main paper, we learn multiple Gaussian distributions for each regression target , =1,…,, where respectively denote the box center, size and orientation; is the number of distributions (=100). For each distribution, we also learn a mode score as its weight in predicting the target (see Eq.3 in the paper). , represent a cluster center and feature, from which a proposal box is predicted.

The learnable parameters are and , where ; . For each target , represents the mean values of the Gaussian distributions, and stores the corresponding covariance matrices. and are the -th item in and respectively. is the dimension for each target, where =3; =3; =2. Here we consider variables in each Gaussian are independently distributed, resulting in a diagonal covariance matrix. We formulate as learnable embeddings shared among all samples in training and testing. Since covariance matrix in is diagonal and non-negative, we use a exponential function to process it diagonal elements.

is realized with an MLP layer to predict mode scores for each target, where we use MLP[128,128,128,100] appended with a sigmoid layer.

For the proposal objectness and probability distribution of class category, we predict them directly from with MLP[128,128,+], where =2 and

=17 (the number of object categories), followed by a softmax layer to get their probability scores in [0, 1].

To generate a hypothesis for each object during inference, we sample box parameters with Eq. 4 in our paper. Each hypothesis is an average of samples of . is also a random number in . Additionally, we sample the object class based on the predicted classification probability distribution.

We discard proposed object boxes which have low objectness scores () after a 3D non-maximum suppression (), which then outputs hypotheses in a scene.

a.6 Efficiency and Memory in Inference

We train our network with batch size of 32 with 4 NVIDIA RTX 2080 GPUs using PyTorch 1.7.1, and test it with a single GPU. We report the model size, inference timing and allocated GPU memory in a single forward pass.

Model size Avg. time Peak. time Avg. memory Peak memory
2.04 M 0.092 s 0.582 s 11.64 MB 260.82 MB
Table 6: Model size, efficiency and memory usage of P2R-Net.

a.7 Specifications of Baselines

We explain the network details of each baseline as the following.


Pose-VoteNet is a variant of VoteNet [Qi_2019_ICCV] to make it able to vote for object centers from human poses. In our experiments, we replace the PointNet++ in original VoteNet with our position encoder, which produces relative pose feature . We then flatten all joint features for each pose () and learn the seed feature with MLP[256, 256, 256]. The coordinates of each seed is located at the root joint, similar with ours. For the remaining structures and loss functions, we keep them consistent with VoteNet.


To make Pose-VoteNet able to capture rotation information of poses, we augment it by replacing MLP layers in Pose-VoteNet encoder with vector neurons [deng2021vn]. Vector neurons are a set of SO(3)-equivariant operators that capture arbitrary rotations of object poses to estimate objects. For each MLP layer in Pose-VoteNet encoder, we replace it with a ‘VNLinearLeakyReLU’ layer, and the final MLP layer with a ‘VNLinear’ layer, with equal number of parameters. For the details of layer design in vector neurons, we refer readers to [deng2021vn].

Motion Attention

We replace the MLP layers (i.e., MLP[256, 256, 256]) in Pose-VoteNet encoder with the attention module in [mao2020history] to learn inter-frame pose features in the entire temporal domain. Specifically, for each pose feature in , we use it to query similar features among all frames, which assembles repetitive pose patterns to regress object boxes. For the layer details in motion attention, we refer readers to [mao2020history].


We replace our probabilistic decoder with the VoteNet decoder [Qi_2019_ICCV] along with their loss functions to produce deterministic results.


We ablate the P2R-Net decoder with a probabilistic generative model [wu2016learning, nie2021rfd], where we first lean a latent code from cluster features . By decoding from the summation of and , we can predict box parameters in a probabilistic generative way.


We discretize each regression target into a binary heatmap, where box centers are discretized into bins in [-0.3 m, 0.3 m]3, centered at cluster centers ; box sizes are discretized into bins in [0.05 m, 3m]3; box orientations are discretized into 12 bins in [-, ]. Then the box regression is converted into a classification task. We train them by cross-entropy losses. In testing, we sample the heatmaps to produce different predictions.

Appendix B Data Generation and Data Statistics

We create our dataset using the VirtualHome simulation environment [puig2018virtualhome], which is built on the Unity3D game engine. It consists of 29 rooms, where each room has 88 objects on average. Each object is annotated with available interaction types. For the detailed specification of interaction manners for different object categories, we refer to [puig2018virtualhome, vh_homepage].

VirtualHome allows users to customize action scripts to direct humanoid agents to execute a series of complex interactive tasks. In our work, we focus on the static, interactable objects under 17 common class categories (i.e., bed, bench, bookshelf, cabinet, chair, desk, dishwasher, faucet, fridge, lamp, microwave, monitor, nightstand, sofa, stove, toilet, computer).

In each room, we select up to 10 random objects in the scene, and script the agent to interact with each of the objects in a sequential fashion. For each object, we also select a random interaction type associated with the object class category. Sequences are trained with and evaluated against only the objects that are interacted with during the input observation, resulting in different variants of each room under different interaction sequences.

Figure 13: Distribution over number of objects in a pose trajectory.

Then we randomly sample 13,913 different sequences with corresponding object boxes to construct the dataset. The human pose trajectories are recorded with a frame rate of 5 frames per second. Over the sequences, the average number of objects is 7.86, and the average frame length is 509.34. The distributions of the frame length and object number in a interaction trajectory are shown in Figure 14 and Figure 13. The interaction frequency for each object class category is illustrated in Figure 15, and we also list the frequency of each interaction type in Figure 16.

Figure 14: Distribution of frame lengths in pose trajectories.
Figure 15: Frequency of object class categories among generated interactions.
Figure 16: Frequency of interaction types.

Appendix C Evaluation Metrics

In our method, we use mAP@0.5 to evaluate the detection accuracy by comparing our maximum likelihood prediction with the ground-truth. We also use Minimal Matching Distance (MMD) and Total Mutual Diversity (TMD) to respectively evaluate the quality and diversity of our multi-modal predictions. For each input sequence, we sample our probabilistic mixture decoder and produce 10 hypotheses of object arrangements. We adapt MMD from [achlioptas2018learning] to evaluate our task, and measure the best matching mAP@0.5 score with the ground-truth out of the 10 hypotheses. For TMD, we follow [wu2020multimodal] to evaluate the multi-modality of our predictions, formulated as


In Eq. 5, TMD is defined at the object-level; for each object in a scene, we have 10 hypotheses which are represented by 10 bounding boxes with the corresponding 10 predicted class labels ; is the -th hypothesis in , which can be represented by a point set with eight box corners, and is the corresponding class label; denotes the Shannon Entropy to measure the variance of class labels; measures the diversity of predicted bounding boxes among hypotheses, which is defined by the average of distance sum from a hypothesis to all other hypotheses; is the average Euclidean distance between pair-wise points from and .

=1 if all hypotheses are the same, where =0; , which indicates no diversity. In Section 5.3 of the main paper, we report the average TMD score over all objects.

Appendix D Additional Quantitative Results

We list the mAP@0.5 scores on all object categories by split and in Table 7 and Table 8 respectively. P2R-Net variants are evaluated by the maximum likelihood predictions to calculate mAP scores. Note that there are fewer categories in the test set of .

bed bench bkshlf cabnt chair desk dishws faucet fridge lamp microw monitor nstand sofa stove toilet cmpter mAP@0.5
Pose-VoteNet 2.90 15.00 73.79 33.14 18.77 58.52 32.14 0.00 0.00 6.07 6.49 23.55 51.30 62.32 49.82 0.00 3.06 25.70
Pose-VoteNet + VN 20.81 18.13 78.25 49.76 18.68 70.92 33.56 0.00 0.00 5.60 5.23 22.71 64.46 67.24 46.76 0.00 6.11 29.90
Motion Attention 36.42 7.54 66.25 23.35 19.50 77.71 15.59 0.03 17.13 2.35 6.42 29.87 30.59 78.61 51.03 14.81 5.50 28.39
P2R-Net-D 93.77 12.63 37.21 11.98 5.77 95.93 61.80 0.10 73.95 0.58 7.39 9.68 9.62 88.44 70.42 0.00 14.17 34.91
P2R-Net-G 91.69 7.56 41.27 36.61 10.05 93.47 67.53 0.10 77.45 1.21 6.36 10.51 11.32 92.97 64.86 5.56 18.67 37.48
P2R-Net-H 85.84 8.04 43.78 22.04 10.91 76.08 55.20 0.00 55.15 0.00 1.57 4.24 20.31 83.92 57.60 5.00 4.33 31.41
Ours 94.21 10.12 41.75 54.72 8.02 93.32 56.33 0.06 59.89 3.25 6.49 12.84 46.53 90.92 57.86 61.11 19.94 42.20
Table 7: Quantitative evaluation on split .
bed bench bookshelf cabinet chair desk dishwasher faucet fridge microwave nightstand stove mAP@0.5
Pose-VoteNet 0.00 4.48 22.84 2.30 0.56 4.05 14.64 0.00 0.00 0.61 0.00 73.33 10.23
Pose-VN 0.00 0.96 19.88 7.99 0.51 0.14 10.61 0.00 0.00 0.00 0.01 33.33 6.12
Motion Attention 28.16 1.81 24.19 1.43 0.00 0.00 0.00 0.00 0.00 0.00 22.77 0.00 6.53
P2R-Net-D 39.86 25.17 26.78 3.29 0.18 56.53 76.39 0.00 86.50 30.75 0.00 33.33 31.56
P2R-Net-G 50.65 17.89 13.86 0.54 0.04 56.97 87.61 0.06 70.91 13.76 0.00 66.67 31.59
P2R-Net-H 48.22 10.43 39.12 15.12 0.02 41.26 71.18 0.00 76.10 0.69 0.01 33.33 27.96
P2R-Net 81.48 12.05 17.18 6.43 0.10 72.07 100.00 0.11 63.39 4.61 0.04 66.66 35.34
Table 8: Quantitative evaluation on split .

Appendix E Additional Qualitative Results

We show additional qualitative results on test splits and in Figures 17-19 and 20, respectively. We additionally visualize various multi-modal predictions from our model on in Figure 21.

(a) Input
(b) Prediction
(c) GT
Figure 17: Additional results on estimating object layouts from a pose trajectory on the sequence-level split (unseen interaction sequences).
(a) Input
(b) Prediction
(c) GT
Figure 18: Additional results on estimating object layouts from a pose trajectory on the sequence-level split (unseen interaction sequences).
(a) Input
(b) Prediction
(c) GT
Figure 19: Additional results on estimating object layouts from a pose trajectory on the sequence-level split (unseen interaction sequences).
(a) Input
(b) Prediction
(c) GT
Figure 20: Additional results on estimating object layouts from a pose trajectory on the room-level split (unseen interaction sequences and rooms).
(a) Input
(b) Sample 1
(c) Sample 2
(d) Sample 3
(e) Max. likelihood prediction
(f) GT
Figure 21: Additional multi-modal predictions of P2R-Net. By sampling our probabilistic decoder multiple times, we can obtain various different plausible box predictions.

Appendix F Qualitative Results on Real Data

To the best of our knowledge, there are no real datasets labeled with multi-object scenes and various human-scene interaction trajectories for our training. To explore our generalization ability in real data, we qualitatively evaluate P2R-Net by training it on our dataset, and apply it to the real-world human pose trajectory data provided by [hassan2021stochastic] describing human interactions with single objects. The qualitative results are illustrated in Figure 22, where we see that our method still can provide plausible object explanations from natural and diverse real human poses.

(a) Input
(b) Multi-modal sample
(c) Max. likelihood prediction
(d) w/ CAD model from [hassan2021stochastic]
Figure 22: Multi-modal predictions on the real human pose trajectory input of [hassan2021stochastic].