In recent years, many high-quality datasets of 3D indoor scenes have emerged, such as Matterport3D , Replica  and Gibson , which employ 3D scanning and reconstruction technologies to create digital 3D environments. Also, virtual robotic agents exist inside of 3D environments such as the Gibson  and the Habitat simulator 
. These are used to develop scene understanding methods from embodied views, thus providing platforms for indoor robot navigation, AR/VR, computer games and many other applications. Despite this progress, a significant limitation of these environments is that they do not contain people. The reason such worlds contain no people is that there are no automated tools to generate realistic people, interacting realistically, with 3D scenes and manually doing this requires significant artist effort. Consequently, our goal is to automatically generate natural and realistic 3D human bodies in the scene. The generated human bodies are expected to be physically plausible (e.g. neither floating nor interpenetrating), diverse and posed naturally within the scene. This is a step towards equipping high-quality 3D scenes and simulators (e.g. Matterport3D and the Habitat ) with semantically and physically plausible 3D humans, and is essential for numerous applications such as creating synthetic datasets, VR/AR, computer games etc.
Our solution is inspired by how humans infer plausible interactions with the environment. According to the studies of , human tends to propose interaction plans depending on the structure and the semantics of objects. Afterwards, to realize the interaction plan, physical rules will apply to determine the detailed human-object configuration, while guaranteeing that the human body can neither float in the air nor collide into the objects. Therefore, our method has two steps: (1) We propose a generative model of human-scene interaction using a conditional variational autoencoder (CVAE)  framework. Given scene depth and semantics, we can sample from the CVAE to obtain various human bodies. (2) Next, we transform the generated 3D human body to the world coordinates and perform scene geometry-aware fitting, so as to refine the human-scene interaction and eliminate physically implausible configurations (e.g. floating and collision).
We argue that realistically modeling human-scene interaction requires a realistic model of the body. Previous studies on scene affordance inference and human body synthesis in the literature, like [26, 43, 53], represent the body as a 3D stick figure or coarse volume. This prevents detailed reasoning about contact such as how the leg surface contacts the sofa surface. Without a model of body shape, it is not clear whether the estimated body poses correspond to plausible human poses. To overcome these issues, we use the SMPL-X model , which takes a set of low-dimensional body pose and shape parameters and outputs a 3D body mesh with important details like the fingers. Since SMPL-X is differentiable, it enables straightforward optimization of human-scene contact and collision prevention . In addition, we incorporate the body shape variation in our approach, so that our generated human bodies have various poses and shapes.
To train our method we exploit the PROX-Qualitative dataset , which includes 3D people captured moving in 3D scenes. We extend this by rendering images, scene depth, and semantic segmentation of the scene from many virtual cameras. We conduct extensive experiments to evaluate the performance of different models for scene-aware 3D body mesh generation. For testing, we extract 7 different rooms from the Matterport3D  dataset and use a virtual agent in the Habitat Simulator  to capture scene depth and semantics from different views. Based on prior work, e.g. [26, 43], we propose three metrics to evaluate the diversity, the physical plausibility, and the semantic plausibility of our results. The experimental results show that our solution effectively generates 3D body meshes in the scene, and outperforms the modified version of a state-of-the-art body generation method 
. We will make our datasets and evaluation metrics available to establish a benchmark.
Our trained model learns about the ways in which 3D people interact with 3D scenes. We show how to leverage this in the form of a scene-dependent body pose prior and show how to use this to improve 3D body pose estimation from RGB images. In summary, our contributions are as follows: (1) We present a solution to generating 3D human bodies in scenes, using a CVAE to generate a body mesh with semantically plausible poses. We follow this with a scene geometry-aware fitting to refine the human-scene interaction. (2) We extend and modify two datasets, and propose three evaluation metrics for scene-aware human body generation. We also modify the method of  to generate body meshes as the baseline (see Sec. 4.1.2). The experimental results show that our method outperforms the baseline. (3) We show that our human-scene interaction prior is able to improve 3D pose estimation from RGB images.
2 Related work
Multiple studies focus on placing objects in an image so that they appear natural [10, 25, 27, 32]. For example, [10, 40, 42] use contextual information to predict which objects are likely to appear at a given location in the image. Lin et al.  apply homography transformations to 2D objects to approximate the perspectives of the object and background. Tan et al.  predict likely positions for people in an input image and retrieve a person that semantically fits into the scene from a database. Ouyang et al.  use a GAN framework to synthesize pedestrians in urban scenes. Lee et al.  learn where to place objects or people in a semantic map and then determine the pose and shape of the respective object. However, all these methods are limited to 2D image compositing or inpainting. Furthermore, the methods that add synthetic humans do not take interactions between the humans and world into account.
To model human-object or human-scene interactions it is beneficial to know which interactions are possible with a given object. Such opportunities for interactions are referred to as affordances 
and numerous works in computer vision have made use of this concept[6, 7, 14, 16, 19, 23, 22, 26, 36, 44, 52, 53]. Object affordance is often represented by a human pose when interacting with a given object [7, 14, 16, 19, 26, 36, 44, 52, 53]. For example, [14, 16, 53] search for valid positions of human poses in 3D scenes. Delataire et al.  learn associations between objects and human poses in order to improve object recognition. Given a 3D model of an object Kim et al.  predict human poses interacting with the given object. Given an image of an object Zhu et al.  learn a knowledge base to predict a likely human pose and a rough relative location of the object with respect to the pose. Savva et al.  learn a model connecting human poses and arrangement of objects in a 3D scene that can generate snapshots of object interaction given a corpus of 3D objects and a verb-noun pair. Monszpart et al.  use captured human motion to infer the object in the scene and their arrangement. Sava et al.  predict action heat map which highlight the likelihood of an action in the scene. Recently; Chen et al.  propose to tackle scene parsing and 3D pose estimation jointly and to leverage their coupled nature to improve scene understanding. Chao et al. 
propose to train multiple controllers to imitate simple motions from mocap, and then use a hierarchical reinforcement learning (RL) to achieve higher-level interactive tasks. Unfortunately, none of these methods use a realistic body model, limiting the naturalness and detail of the interactions.
Recently, Wang et al. 
published an affordance dataset large enough to obtain reliable estimates of the probabilities of poses and to train neural networks on affordance prediction. The data is collected from multiple sitcoms and contains images of scenes with and without humans. The images with humans contain rich behavior of humans interacting with various objects. Given an image and a location as input Wang et al. first predict the most likely pose from a set of 30 poses. This pose is deformed and scaled by a second network to fit it into the scene. Li et al. extend this work to automatically estimate where to put people and to predict 3D poses. To acquire 3D training data they map 2D poses to 3D poses and place them in 3D scenes from the SUNCG dataset [38, 48]. This synthesized dataset is cleaned by removal of all predictions intersecting with the 3D scene or without sufficient support for the body. The resulting dataset is then used to train a network that directly predicts 3D poses for RGB or RGB-D or depth images. Physical plausibility of poses is encouraged by usage of an adversarial loss. The methods of [26, 44] are limited in their generalization, since they require a large amount of paired data and manual cleaning of the pose detections. Such a large amount of data might be hard to acquire for scenes that are less frequently covered by sitcoms, or in the case of  in 3D scene datasets. Furthermore, both methods only predict poses represented as stick figures. Such a representation is hard to validate visually, lacks details, and can not directly be used to generate realistic synthetic data of humans interacting with an environment.
Our solution comprises two parts: The first generates 3D bodies using a conditional variational autoencoder (CVAE) such that the scene context is taken into account. We train this model using data of natural human-scene interaction. Given the scene depth and semantic segmentation as input, the model learns how people interact with scenes; that is, it learns the afforances of the scene and how bodies are posed to take advantage of these affordances. The second part refines the body using scene-aware fitting to ensure physical plausibility.
3D scene representation. We represent the scene from the view of an embodied agent, as in the Habitat simulator . As investigated in , among different intermediate scene representations, the depth map and semantic segmentation are the most valuable for an agent to understand the scene. Thus, we capture scene depth and semantics from diverse views as our scene representation. Given the camera extrinsic and intrinsic parameters, we can recover the 3D scene structure from the depth maps. For each view, we denote the stack of depth and semantics as , the camera perspective projection from 3D to 2D as , and its inverse operation as for 3D recovery. Our training data, , are generated from Habitat and we resize this to
for compatibility with our network; we retain the aspect ratio and pad with zeros where needed. The 3D-2D projectionnormalizes the 3D coordinates to the range of , using the camera intrinsics and the maximal depth value.
3D human body representation. We use SMPL-X  to represent the 3D human body. The SMPL-X model can be regarded as a function , which maps a group of low-dimensional body features to a 3D body mesh. The 3D body mesh has 10475 vertices and a fixed topology. In our study, we use the body shape feature , the body pose feature , and the hand pose feature . The body pose feature is represented in the latent space of Vposer , which is a variational autoencoder trained on a large-scale motion capture dataset AMASS . The global rotation , i.e., the rotation of the pelvis, is represented by a 6D continuous rotation feature , which facilitates back-propagation in our trials. The global translation
is represented by a 3D vector in meters. The global rotation and translation is with respect to the camera coordinates. Based on the camera extrinsics,, one can transform the 3D body mesh to the world coordinates.
We denote the joint body representation as ; i.e., the concatenation of individual body features. When processing the global and the local features separately as in , we denote the global translation as , and the other body features as .
3.2 Scene context-aware Human Body Generator
3.2.1 Network architecture
We employ a CVAE  framework to model the probability . When inferring all body features jointly, we propose a one-stage (S1) network. When inferring and successively, we factorize the probability as and use a two-stage (S2) network. The network architectures are illustrated in Fig. 2. Referring to , our scene encoder is fine-tuned from the first 6 convolutional layers in ResNet18 
, which is pre-trained on the 1000-class ImageNet dataset and hence is helpful for scene representation generalization. The human feature is first lifted to a high dimension (256 in our study) via a fully-connected layer, and then concatenated with the encoded scene feature. In the two-stage model, the two scene encoders are both fine-tuned from ResNet18, but do not share parameters. After the first stage, the reconstructed body global feature is further encoded, and is used in the second stage to infer the body local features.
3.2.2 Training loss
The training loss incorporates several components, and can be formulated as
where the terms denote the reconstruction loss, the Kullback–Leibler divergence, the VPoser loss, the human-scene contact loss and the human-scene collision loss, respectively. The set of’s denotes the loss weights. For simplicity, we denote as , implying the loss for human-scene interaction.
Reconstruction loss : It is given by
in which the global translation, the projected and normalized global translation, and the other body features are considered separately. We apply this reconstruction loss in both our S1 model and our S2 model.
KL-Divergence : Denoting our VAE encoder as , the KL-divergence loss is given by
Correspondingly, in our S2 model the KL-divergence loss is given by
We use the re-parameterization trick in  so that the KL divergence is closed form.
VPoser loss : Since VPoser 
attempts to encode natural poses with a normal distribution in its latent space, like in and , we employ the VPoser loss, i.e.
to guarantee that our generated bodies have natural poses.
Collision loss : Based on the model output , we generate the body mesh and transform it to world coordinates. Then, according to the negative signed distance function (SDF) of the scene, i.e. , we compute the negative SDF values at the body mesh vertices, and minimize
indicating the average of absolute values of negative SDFs on the body. Thus, inter-penetrations between the body surface and the scene surface are penalized.
Contact loss : Following , we encourage contact between the body mesh and the scene mesh. Hence, the contact loss is written
in which denotes selecting the body mesh vertices for contact according to the annotation in , denotes the scene mesh, and denotes the Geman-McClure error function  for down-weighting the influence of scene vertices far away from the body mesh.
3.3 Scene geometry-aware Fitting
Based on the scene context, our CVAE models generate various human body configurations representing plausible human-scene interactions but it is hard for the network to enforce strong physical constraints. Consequently, we refine the pose with an optimization step similar to . This applies physical constraints, which encourage contact and help avoid inter-penetration between the body surface and the scene surface, while not deviating much from the generated pose. Let the generated human body configuration be . To refine this, we minimize a fitting loss taking into account the scene geometry, i.e.
in which the ’s denote the loss weights, and the loss terms are defined above.
Our implementation is based on PyTorch v1.2. For the Chamfer distance in the contact loss we use the same implementation as [9, 15]. For training, we set in Eq. 1 for both our S1 and S2 models, in which increases linearly in an annealing scheme . When additionally using , we set
, and enable it after 75% training epochs to improve the interaction modeling. We use the Adam optimizer with the learning rate , and terminate training after 30 epochs. For the scene geometry-aware fitting, we set in all cases. Our data, code and models will be available for research purposes.
4 Experiments 111Please see supplementary materials for more experimental details.
4.1 Scene-aware 3D Body Mesh Generation
PROX-E: This dataset is extended from the PROX-Qualitative dataset , which records how different people interact with various indoor environments. In PROX-Qualitative, 3D human body meshes in individual frames are estimated by fitting the SMPL-X body model to the RGB-D data subject to scene constraints . We use this data as pseudo-ground truth in our study, and extend PROX-Qualitative in three ways: (1) We build up virtual walls, floors and ceilings, to enclose the original open scans and simulate real indoor environments. (2) We annotate the mesh semantics following the object categorization of Matterport3D . (3) We down-sample the original recordings and extract frames by every 0.5 seconds. In each frame, we set up virtual cameras with various poses to capture scene depth and semantics. The optical axis of each virtual camera points towards the human body, and then Gaussian noise is applied on the camera translation. To avoid severe occlusion, all the virtual cameras are located above half of the room height and below the virtual ceiling. Compared to PROX-Qualitative, which makes recordings only from a single view, we represent a scene with depth and semantic maps from many viewpoints. As a result, we obtain about 70K frames in total. We use ‘MPH16’, ‘MPH1Library’, ‘N0SittingBooth’ and ‘N3OpenArea’ as test scenes, and use samples from the remaining scenes for training. Our PROX-E dataset (pronounced “proxy”) is illustrated in Fig. 3.
MP3D-R: This name denotes “rooms in Matterport3D ”. From the architecture scans of the Matterport3D dataset, we extract 7 different rooms according the annotated bounding boxes. In addition, we create a virtual agent using the Habitat simulator , and manipulate it to capture snapshots from various views in each room. We employ the RGB, the depth and the semantics sensor on the agent. These sensors are of height 1.8m from the ground, and look down at the scene; these are in a similar range as the virtual cameras in PROX-E. For each snapshot, we also record the extrinsic and intrinsic parameters of the sensors. As a result, we obtain 32 snapshots in all 7 rooms. Moreover, we follow the same procedure as in PROX-Qualitative  to calculate the SDF of the scene mesh. Our MP3D-R is illustrated in Fig. 4.
To our knowledge, the most related work is Li et al. , which proposes a generative model to put 3D body stick figures into images333The data and the pre-trained model in  are based on SUNCG, and not publicly available.
. For fair comparison, we modify their method to use SMPL-X to generate body meshes in 3D scenes. Specifically, we make the following modifications: (1) We change the scene representation from RGB (or RGB-D) to depth and semantics like ours to improve generalization. (2) During training, we perform K-means to cluster the VPoser pose features of training samples to generate the pose class. (3) Thewhere module is used to infer the global translation, and the what module infers other SMPL-X parameters. (4) For training the geometry-aware discriminator, we project the body mesh vertices, rather than the stick figures, to the scene depth maps. We train the modified baseline model using PROX-E with the default architecture and loss weights in . Moreover, we combine the modified baseline method with our scene geometry-aware fitting in our experiments.
4.1.3 Evaluation: representation power
Here we investigate how well the proposed network architectures represent human-scene interaction. We use the PROX-E dataset for the evaluation. We train all models using samples from virtual cameras in training scenes, validate them using samples from real cameras in training scenes, and test them using samples from real cameras in test scenes. For quantitative evaluation, we feed individual test samples to our models, and report the mean of the reconstruction errors, and the negative evidenced lower bound (ELBO), i.e. , which is the sum of the reconstruction error and the KL divergence. For fair comparison, the reconstruction error of all models is based on in Eq. 2. As shown in Tab. 1, our models outperform the baseline model on both validation and test set by large margins. In addition, the metrics on the validation and the test sets are comparable, indicating that our virtual camera approach is effective in enriching the scene representation and preventing severe over-fitting on the seen environments.
4.1.4 Evaluation: 3D body mesh generation
Given a 3D scene, our goal is to generate diverse, physically and semantically plausible 3D human bodies. Based on [26, 43], we propose to quantitatively evaluate our method using a diversity metric and a physical metric. Additionally, we perform a user perceptual study to measure the semantic plausibility of the generated human bodies.
We perform the quantitative evaluation both on the PROX-E and the MP3D-R dataset. When testing on PROX-E, we train our models using all samples in the training scenes, and generate body meshes using the real camera snapshots in the testing scenes. For each individual model and each test scene, we randomly generate 1200 samples, and hence obtain 4800 samples. When testing on MP3D-R, we use all samples from PROX-E to train the models. For each snapshot and each individual model, we randomly generate 200 samples, and hence obtain 6400 samples.
(1) Diversity metric: This metric aims to evaluate how diverse the generated human bodies are. Specifically, we empirically perform K-means to cluster the SMPL-X parameters of all the generated human bodies to 20 clusters. Then, we compute the entropy (a.k.a Shannon index, a type of diversity index) of the cluster ID histogram of all the samples. We also compute the average size of all the clusters.
A higher value indicates that the generated human bodies are more diverse in terms of their global positions, their body shapes and poses. We argue that this metric is essential for evaluating the quality of the generated bodies and should always be considered together with other metrics. For instance, a posterior-collapsed VAE, which always generates an identical body mesh, could lead to a low diversity score but superior performance according to the physical metric and the semantic metric.
The results are shown in Tab. 2. Overall, our methods consistently outperform the baseline. Notably, our methods increase the average cluster size of the generated samples by large margins, indicating that the generated human bodies are much more diverse than those from the baseline.
(2) Physical metric:
From the physical perspective, we evaluate the collision and the contact between the body mesh and the scene mesh. Given a scene SDF and a SMPL-X body mesh, we propose a non-collision score, which is calculated as the number of body mesh vertices with positive SDF values divided by the number of all body mesh vertices (10475 for SMPL-X). Simultaneously, if any body mesh vertex has a non-positive SDF value, then we regard that the body has contact with the scene. Then, for all generated body meshes, the non-collision score is the ratio of all body vertices in the free space, and the contact ratio is the calculated as the number of body meshes with contact divided by all generated body meshes. Therefore, due to the physical constraints, a higher non-collision score and contact ratio indicate a better generation, in analogy with precision and recall in an object detection task.
The results are presented in Tab. 3. First, one can see that our proposed methods consistently outperform the baseline for the physical metric. The influence of the loss on 3D body generation is not as obvious as on the interaction modeling task (see Tab. 1). Additionally, one can see that the scene geometry-aware fitting consistently improves the physical metric, since the fitting process aims to improve the physical plausibility. Fig. 7 shows some generated examples before and after the fitting.
|cluster ID entropy||cluster size average|
|S1 + +||2.94||2.96||2.43||2.79|
|S2 + +||2.91||2.90||2.26||2.95|
|non-collision score||contact score|
|S1 + +||0.92||0.98||0.99||0.81|
|S2 + +||0.93||0.97||0.99||0.81|
(3) User study: Another important metric measures how natural and semantically meaningful the human-scene interactions are. In our study, we render our generated results on images, and upload them to Amazon Mechanical Turk (AMT) for a user study. Due to the superior performance of our S1 model without , we compare it with the baseline, as well as ground truth if it exists. On both the PROX-E and MP3D-R datasets, for each scene and each model we generate 100 bodies, and ask Turkers to give a score between 1 (strongly not natural) and 5 (strongly natural) to each individual result. The user study details are in the supplementary material. Also, for each scene in the PROX-E dataset, we randomly select 100 frames from the ground truth , and ask Turkers to evaluate them as well.
The results are presented in Tab. 4. Not surprisingly, the ground-truth samples achieve the best score from the user study. We observe that the geometry-aware fitting improves the performance both for the baseline and our model, most likely due to the improvement of the physical plausibility. Note that, although the baseline and our model achieve similar average scores, the diversity of our generated samples is much higher (Tab. 2). This indicates that, compared to the baseline, our method generates more diverse 3D human bodies, while being equally good in terms of semantic plausibility given a 3D scene.
|use study score w.r.t. meanstd|
|baseline ||3.31 1.39||3.02 1.46|
|baseline +||3.32 1.35||3.32 1.42|
|S1||3.29 1.36||3.10 1.38|
|S1 +||3.49 1.26||3.24 1.31|
|ground truth||4.04 1.03||n/a|
4.2 Scene-aware 3D Body Pose Estimation
We have proposed generative models of human-scene interaction. Beyond generating 3D human bodies in the scene, here we perform a down-stream application, and show how our method improves 3D human pose estimation from single images. Given an RGB image of a scene without people, we estimate a depth map using a pre-trained model , and perform semantic segmentation using the model of  pre-trained on ADE20K . To unify the semantics, we create a look-up table to convert the object IDs from ADE20K to Matterport3D. Next, we feed the estimated depth and semantics to our S1 model with and randomly generate 100 bodies. We compute the average of their pose feature in the VPoser latent space, and denote it as .
When performing 3D pose estimation in the same scene, we follow the optimization framework of SMPlify-X  and PROX . In contrast to these two methods, we use our derived to initialize the optimization, and change the VPoser term in [17, Eq. 7] from to . We evaluate the performance using the PROX-Quantitative dataset . We derive the 2D keypoints from the frames via AlphaPose [11, 46], and obtain a from a background image without people. Then, we use the same optimization methods and the evaluation metric in  for fair comparison. The results are shown in Tab. 5. We find that our method improves 3D pose estimation on the PROX-Quantitative dataset for all the metrics. This suggests that our model learns about the ways in which 3D people interact with 3D scenes. Leveraging it as a scene-dependent body pose prior can improve 3D body pose estimation from RGB images.
|Error (in millimeters)|
In this work, we introduce a generative framework to produce 3D human bodies that are posed naturally in the 3D environment. Our method consists of two steps: (1) A scene context-aware human body generator is proposed to learn a distribution of 3D human pose and shape, conditioned on the scene depth and semantics; (2) a geometry-aware fitting is employed to impose physical plausibility of the human-scene interaction. Our experiments demonstrate that the automatically synthesized 3D human bodies are realistic and expressive, and interact with 3D environment in a semantic and physical plausible way.
-  Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
-  Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), 2017.
-  Yu-Wei Chao, Jimei Yang, Weifeng Chen, and Jia Deng. Learning to sit: Synthesizing human-chair interactions via hierarchical control. arXiv preprint arXiv:1908.07423, 2019.
-  Yixin Chen, Siyuan Huang, Tao Yuan, Siyuan Qi, Yixin Zhu, and Song-Chun Zhu. Holistic++ scene understanding: Single-view 3D holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. arXiv preprint arXiv:1909.01507, 2019.
Xception: Deep learning with depthwise separable convolutions.In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
-  Ching-Yao Chuang, Jiaman Li, Antonio Torralba, and Sanja Fidler. Learning to act properly: Predicting and explaining affordances from images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 975–983, 2018.
-  Vincent Delaitre, David F Fouhey, Ivan Laptev, Josef Sivic, Abhinav Gupta, and Alexei A Efros. Scene semantics from long-term observation of people. In European conference on computer vision, pages 284–298. Springer, 2012.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
-  Theo Deprelle, Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. Learning elementary structures for 3d shape generation and matching. In Neurips, 2019.
-  Nikita Dvornik, Julien Mairal, and Cordelia Schmid. On the importance of visual context for data augmentation in scene understanding. arXiv preprint arXiv:1809.02492, 2018.
-  Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. RMPE: Regional multi-person pose estimation. In ICCV, 2017.
-  Stuart Geman and Donald E. McClure. Statistical methods for tomographic image reconstruction. In Proceedings of the 46th Session of the International Statistical Institute, Bulletin of the ISI, volume 52, 1987.
-  James J Gibson. The ecological approach to visual perception: classic edition. Psychology Press, 2014.
-  Helmut Grabner, Juergen Gall, and Luc Van Gool. What makes a chair a chair? In CVPR 2011, pages 1529–1536. IEEE, 2011.
-  Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan Russell, and Mathieu Aubry. 3d-coded : 3d correspondences by deep deformation. In ECCV, 2018.
-  Abhinav Gupta, Scott Satkin, Alexei A Efros, and Martial Hebert. From 3d scene geometry to human workspace. In CVPR 2011, pages 1961–1968. IEEE, 2011.
-  Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3D human pose ambiguities with 3D scene constraints. In International Conference on Computer Vision, Oct. 2019.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Vladimir G Kim, Siddhartha Chaudhuri, Leonidas Guibas, and Thomas Funkhouser. Shape2pose: Human-centric shape analysis. ACM Transactions on Graphics (TOG), 33(4):120, 2014.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32(8):951–970, 2013.
-  Hema S Koppula and Ashutosh Saxena. Physically grounded spatio-temporal object affordances. In European Conference on Computer Vision, pages 831–847. Springer, 2014.
-  Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision (3DV), pages 239–248. IEEE, 2016.
-  Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, and Jan Kautz. Context-aware synthesis and placement of object instances. In Advances in Neural Information Processing Systems, pages 10414–10424, 2018.
-  Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, and Jan Kautz. Putting humans in a scene: Learning affordance in 3d indoor environments. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9455–9464, 2018.
-  Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.
-  Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. Amass: Archive of motion capture as surface shapes. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019.
-  Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
-  Aron Monszpart, Paul Guerrero, Duygu Ceylan, Ersin Yumer, and Niloy J Mitra. imapper: Interaction-guided joint scene and human motion mapping from monocular videos. arXiv preprint arXiv:1806.07889, 2018.
-  Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou. Pedestrian-synthesis-gan: Generating pedestrian data in real scene and beyond. arXiv preprint arXiv:1804.02047, 2018.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
-  Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. SceneGrok: Inferring action maps in 3D environments. ACM Transactions on graphics (TOG), 33(6):212, 2014.
-  Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. Pigraphs: learning interaction snapshots from observations. ACM Transactions on Graphics (TOG), 35(4):139, 2016.
-  Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pages 3483–3491, 2015.
-  Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1746–1754, 2017.
-  Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M. Strasdat, Renzo De Nardi, Michael Goesele, Steven Lovegrove, and Richard Newcombe. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
-  Jin Sun and David W Jacobs. Seeing what is not there: Learning context to determine where objects are missing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5716–5724, 2017.
-  Fuwen Tan, Crispin Bernier, Benjamin Cohen, Vicente Ordonez, and Connelly Barnes. Where and who? automatic semantic-aware person composition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1519–1528. IEEE, 2018.
-  Antonio Torralba. Contextual priming for object detection. International journal of computer vision, 53(2):169–191, 2003.
-  Xiaolong Wang, Rohit Girdhar, and Abhinav Gupta. Binge watching: Scaling affordance learning from sitcoms. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2596–2605, 2017.
-  Xiaolong Wang, Rohit Girdhar, and Abhinav Gupta. Binge watching: Scaling affordance learning from sitcoms. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018.
-  Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Efficient online pose tracking. In BMVC, 2018.
-  Hsiao-chen You and Kuohsiang Chen. Applications of affordance and semantics in product design. Design Studies, 28(1):23–38, 2007.
Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin
Jin, and Thomas Funkhouser.
Physically-based rendering for indoor scene understanding using convolutional neural networks.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5287–5295, 2017.
-  Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. Does computer vision matter for action? arXiv preprint arXiv:1905.12887, 2019.
-  Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
-  Yuke Zhu, Alireza Fathi, and Li Fei-Fei. Reasoning about object affordances in a knowledge base representation. In European conference on computer vision, pages 408–424. Springer, 2014.
-  Yixin Zhu, Chenfanfu Jiang, Yibiao Zhao, Demetri Terzopoulos, and Song-Chun Zhu. Inferring forces and learning human utilities from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3823–3833, 2016.
Appendix A Experiment Details
a.1 From PROX-Qualitative to Prox-E
The PROX-Qualitative (or PROX-Q for short) dataset comprises recordings of 20 subjects in 12 indoor scenes, including 3 bedrooms, 5 living rooms, 2 sitting booths and 2 offices. The 3D scenes were scanned with a commercial Structure Sensor RGB-D camera and reconstructed by the accompanying 3D reconstruction solution Skanect. We refer to  for more details of how the PROX-Q was established. Note that the scene meshes of PROX-Q do not form valid rooms, i.e. there is no ceiling and some walls are missing. Furthermore, the meshes are not semantically segmented.
To our knowledge, PROX-Q is the largest dataset capturing real human-scene interactions at the 3D mesh level. However, due to the incomplete room scans and lack of mesh semantics, we extend PROX-Q from the following perspectives, so as to serve our purposes of human-scene interaction modeling and generation from the viewpoint of an embodied agent:
(1) Building up virtual walls, floors, ceilings.
To achieve this goal, we import the scene meshes of PROX-Q into Blender, which we use to enclose the original scene meshes to create rooms. When using a camera to capture the scene, we can always get a completed depth map. The completed depth maps are illustrated in Fig. 3.
(2) Semantic annotation of the scene meshes.
The mesh semantics follow the Matterport3D dataset , which incorporates 40 categories of common indoor objects 444One can see the object categorization via: https://github.com/niessner/Matterport/blob/master/metadata/mpcat40.tsv. Our annotation is performed manually, and the mesh vertex color denotes the object labels. Therefore, we are able to capture the depth and the semantics from multiple views.
(3) Setting up virtual cameras.
The original PROX-Q dataset only incorporates video recordings from a single view in each scene. This implies that we can only have 12 depth-semantics pairs to use, and hence can lead to severe overfitting when using PROX-Q for training. To overcome this drawback, for each individual frame captured by the real camera, we create a set of virtual cameras in the scene to capture the human behavior. The virtual cameras are posed according to the room structure and the human body position. Specifically, we create a 3D grid according to the room size. The range of width and length is determined by the size of the room. The range of height is between the pelvis of the human body and the ceiling height that we have created. For each camera, the X-axis is parallel to the ground, and the Z-axis is towards the human body center. Next, we add Gaussian noise on the camera translations, and discard views with no human bodies or strong body occlusions; i.e., in the image the body part around the pelvis ( 10 pixels) is not occluded by any object in the scene. We argue that such noise is essential. Otherwise the generated human bodies will always be located in the center of the depth-semantics maps. Furthermore, we only keep the virtual cameras with the distance to the human body between 1.65m and 6.5m, so that the projected body sizes to the virtual cameras are similar to the body sizes captured by real cameras. Fig. S1 shows a set of virtual cameras before and after applying the Gaussian noise to the camera translations. Moreover, the resolution of depth and semantics is set to 480270, and the camera intrinsic parameters are
which is a default setting in Open3D [Q. Zhou, J. Park and V. koltun, 2018] after specifying the depth/semantics resolution.
|scan ID||region ID||room type|
a.2 Creating the MP3D-R dataset
Our MP3D-R dataset is extracted from the Matterport3D dataset . We extract the 7 rooms by annotating bounding boxes of regions, as shown in Tab. S1. These 7 rooms have room types that are similar to PROX-E.
When trimming the rooms according to the annotation, we expand the annotated bounding box size by 0.5 meters to ensure that walls, ceilings and floors are incorporated. Note that, the Habitat simulator and the original Matterport3D dataset have different gravity directions. The Habitat simulator assumes that the gravity direction is along . Thus, after loading the scene meshes from Matterport3D, we rotate the scene mesh by degree w.r.t. the X-axis to match the bounding box annotation from Habitat. Fig. S2 shows some retrieved room meshes with the world coordinate origins.
Next, we use the Habitat simulator  to create a virtual agent in the room. In each scene, we first put the agent in the room center, and then manipulate that virtual agent to cruise around the room. According to ranges of virtual cameras in PROX-E, we set the height of agent sensor to 1.8 meters from the ground. For each snapshot, we record the RGB image, the scene depth, the scene semantics, as well as the camera extrinsic parameters. The frame resolution and the camera intrinsics are identical to our settings in PROX-E.
Following the pipeline for creating PROX-Q, we also compute scene signed distance functions (SDFs) of the MP3D-R scenes. For each room, we first use Poisson surface reconstruction to convert the meshes to be watertight. Fig. S3 shows an example of the reconstructed scene mesh. Next, similar to [17, Sec. 3.6] we compute the SDF in a uniform voxel grid of size which spans a padded bounding box of the scene mesh.
a.3 Details of the baseline method
To our knowledge, the most related work is Li et al. , which aims to put humans in a scene and infer the affordance of an 3D indoor environments. In this work, the authors first propose an efficient and fully-automatic 3D human pose synthesizer to generate stick figures, using a pose prior learned from a large-scale 2D dataset  and the physical constraints from the target 3D scenes. With this pose synthesizer, the authors create a dataset incorporating synthesized human-scene interactions. Next, based on the synthesized dataset, the authors develop a generative model for 3D affordance prediction, which is able to generate body stick figures based on the scene images.
Compared to the method of Li et al. , our solution has the following key differences: (1) Our PROX-E dataset contains real human-scene interactions rather than synthesized ones. This is highly beneficial to model the distribution of human-scene interactions in the real world. (2) Our solution is to generate body meshes rather than 3D body stick figures (See Fig. 5 in ). Therefore, the results can be directly used in applications like VR, AR and others. (3) We use the SMPL-X model  in our work, hence our methods can generate various body shapes and fine-grained hand poses, beyond the body global configurations and local poses. (4) SMPL-X can be regarded as a differentable function mapping from human body features to human body meshes, so the physical constraints applied on the body mesh surfaces can be back-propagated to the body features like in . (5) We use scene depth and semantics to represent the scene, rather than using RGB (or RGBD) images as in . In our study, the RGB images are only available from the real cameras of PROX-E, hence using RGB images can increases the risk of overfitting. In addition, the benefits of scene depth and semantics are revealed in .
Therefore, in our work, we modify the method of Li at al.  as mentioned in Sec. 4.1.2, so that their model can generate body meshes like our methods, and a fair comparison can be conducted. We treat the modified version of  as our baseline. We train the baseline model with the PROX-E dataset like our methods. After generating body meshes in test scenes, we also apply our scene geometry-aware fitting to refine the results of the baseline model. The qualitative results of the baseline with fitting are shown in MP3D-R. We argue that our modification is necessary and favorable to the baseline to produce high quality 3D human bodies. For the quantitative comparison, please refer to Tab. 2, Tab. 3 and Tab. 4
a.4 Details of the user study
To evaluate how naturally the human body meshes are posed in the scenes, we perform a user study via Amazon Mechanical Turk (AMT). For each generated body pose, we render images of the body-and-scene mesh from two different views. Fig. S4 shows our AMT user interface. We propose a hypothesis that the human body interacts with the scene in a very natural manner, and then let users judge this hypothesis. Their judgements are recorded on a 5-point Likert scale.
, we do not show pairs of results from different methods in the user interface. Instead, we have multiple methods to compare, and hence let the Turker evaluate each individual result in order to keep the user interface clear. Also, we report the standard errors of the user study results in addition to the mean values, which indicate how reliable the semantic scores are. One can see in Tab.4 that the scene geometry-aware fitting can reduce the the standard error, indicating that users tend to give more consistent judgements. The ground truth has the lowest standard error, which indicates that Turkers are able to judge when the human-scene interaction is natural.
Appendix B More Qualitative Results and Failure Cases
We find that failure cases can be categorized to two cases: First, the generative model is not always reliable in test scenes, since samples from the model are not always plausible. Some results sampled from the generative model cannot match the geometric structures in the test scenes, and hence the body floats in the air, or collides with the scene mesh. See Fig. S9 for examples. Such failure cases can occur in both the baseline and our methods. Second, although the scene geometry-aware fitting can effectively resolve floating and collision, its optimization process cannot simulate all real physics such as gravity and elasticity. Therefore, it could hurt the human-scene interaction semantics of the results produced by the generative model. Fig. S10 shows some examples of such failure cases, which contain abnormal body global configurations and human-scene contact caused by our scene geometry-aware fitting.
Appendix C Details of Scene-aware 3D Body Pose Estimation
the notations of which are referred to . In our work, we only modify the Vposer regularizer, i.e., , and leave the other terms unchanged. Specifically, we change it to