ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation

07/09/2020 ∙ by Chuang Gan, et al. ∙ 10

We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. With TDW, users can simulate high-fidelity sensory data and physical interactions between mobile agents and objects in a wide variety of rich 3D environments. TDW has several unique properties: 1) realtime near photo-realistic image rendering quality; 2) a library of objects and environments with materials for high-quality rendering, and routines enabling user customization of the asset library; 3) generative procedures for efficiently building classes of new environments 4) high-fidelity audio rendering; 5) believable and realistic physical interactions for a wide variety of material types, including cloths, liquid, and deformable objects; 6) a range of "avatar" types that serve as embodiments of AI agents, with the option for user avatar customization; and 7) support for human interactions with VR devices. TDW also provides a rich API enabling multiple agents to interact within a simulation and return a range of sensor and physics data representing the state of the world. We present initial experiments enabled by the platform around emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, multi-agent interactions, models that "learn like a child", and attention studies in humans and neural networks. The simulation platform will be made publicly available.



There are no comments yet.


page 3

page 6

page 9

page 10

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A major challenge for developing and benchmarking approaches to physical scene understanding is the logistical difficulty of operating a mobile agent in the real world. Moreover, in order to ensure proper generalization, a control system must be evaluated in a wide variety of environments, and its perceptual models require large amounts of labeled data for training, which is laborious (and expensive) to acquire. Furthermore, physical quantities such as mass are not readily apparent to human observers and therefore difficult to label. Therefore, sensorimotor control models are often developed and benchmarked in simulation. By generating scenes synthetically, we gain complete control over data generation, with full access to all generative parameters.

ThreeDWorld (TDW) is a general-purpose virtual world simulation platform that is designed to support multi-modal physical interactions between objects and agents. Using TDW’s flexible and general framework, we can train embodied agents to perform tasks in a 3D physical world and collect behavioral data in a simulated environment that mimics the sensory and interactive richness of the real world. Transfer of the resultant models to the physical world is more likely to succeed when the simulator provides a wide range of realistic 3D environments. TDW is defined in part by the following properties and capabilities:

  • Experimental stimuli using multiple modalities: By affording equal status to visual and auditory modalities, we can create synthetic imagery at near-photoreal levels while also generating realistic sounds and acoustic environment simulations.

  • Complete generality: TDW does not impose any constraints on the use-cases it can support. TDW can generate room environments filled with furniture, complex outdoor scenes or customized object configurations for experiments involving physical prediction and inference, object recognition or simulating infant play behavior.

  • Believable and realistic physical interactions: The system utilizes several physics engines including a uniform particle-based representation that allows objects of different types (rigid, soft-body, cloth and fluids) to collide and interact in physically realistic ways.

  • Direct interaction with 3D objects: Using API commands, we can perform actions such as applying physical forces to move objects, or change their material, physics behavior or acoustic properties (e.g. metal objects that bounce like rubber and sound like wood)

  • Indirect API interaction via avatars: Avatars act as the embodiment of an AI agent; for example, one avatar type uses articulated arms transport objects around the environment

  • Human interaction in VR: Using controllers in conjunction with an Oculus Rift S headset, users can directly grasp and manipulate objects in a 3D virtual environment.

Figure 1: TDW’s general, flexible design supports a broad range of use-cases at a high level of multi-modal fidelity: a-c) Indoor and outdoor scene rendering; d) Advanced physics – cloth draping over a rigid body; e) Avatar picking up object; f) Multi-agent scene – ”parent” and ”baby” avatars interacting; g) Human user interacting with virtual objects in VR; h) Multi-modal scene – speaker icons show playback locations of synthesized impact sounds.

In this paper we describe the TDW platform and its key distinguishing features, as well as several example applications that illustrate its use as a data-generation tool for helping researchers advance progress in AI. A download of TDW’s codebase is available at:; the benchmark datasets described below are available at:, and

Related Simulation Environments Recently, several simulation platforms have been developed to support research into embodied AI, scene understanding, and physical inference. These include AI2-THORKolve et al. (2017), HoMEWu et al. (2018), VirtualHomePuig et al. (2018), HabitatSavva et al. (2019), GibsonXia et al. (2018), iGibson Xia et al. (2020), Sapien Xiang et al. (2020) PyBullet Coumans and Bai (2016), MuJuCo Todorov et al. (2012), and Deepmind Lab Beattie et al. (2016). However none of them have TDW’s range of features and diversity of potential use cases.

TDW is unique in its support of: a) Real-time near-photorealistic rendering of both indoor and outdoor environments; b) A physics-based model for generating situational sounds from object-object interactions (Fig. 1h); c) Creating custom environments procedurally and populating them with custom object configurations for specialized use-cases; d) Realistic interactions between objects, due to the unique combination of high-res object geometry and fast-but-accurate high-res rigid bodies (denoted “R” in Table 1); e) Complex non-rigid physics, based on the NVIDIA Flex engine; f) A range of user-selectable embodied agent representations; g) A user-extensible model library. We believe this suite of features is critical for training the next generation of intelligent agents.

As shown in Table 1, TDW differs from these frameworks in its support for different types of:

  • Photorealistic scenes: indoor (I) and outdoor (O)

  • Physics simulation: just rigid body (R) or improved fast-but-accurate rigid body (R), soft body (S), cloth (C) and fluids (F)

  • Acoustic simulation: environmental (E) and physics-based (P)

  • User interaction: direct API-based (D), avatar-based (A) and human-centric using VR (H)

  • Model library support: built-in (L) and user-extensible (E)

Platform Scene
(I,O) Physics
(R/R,S,C,F) Acoustic
(E,P) Interaction
(D,A,H) Models
Deepmind Lab Beattie et al. (2016) D, A
MuJuCo Todorov et al. (2012) R, C, S D, A
PyBullet Coumans and Bai (2016) R D, A
HoME Wu et al. (2018) R E
VirtualHome Puig et al. (2018) I D, A
Gibson Xia et al. (2018) I
iGibson Xia et al. (2020) I R D, A L
Sapien Xiang et al. (2020) I R D, A L
Habitat Savva et al. (2019) I
AI2-THOR Kolve et al. (2017) I R D L
ThreeDWorld I, O R, C, S, F E, P D, A, H L, E
Table 1: Comparison of TDW’s capabilities with those of related virtual simulation frameworks.

Ongoing research with TDW.

The variation in use-case projects currently using TDW reflects the platform’s flexibility and generality (for details, see Section 4): 1) A learned visual feature representation, trained on a TDW image classification dataset comparable to ImageNet, transferred to fine-grained image classification and object detection tasks; 2) A synthetic dataset of impact sounds generated via TDW’s audio impact synthesis and used to test material and mass classification, using TDW’s ability to handle complex physical collisions and non-rigid deformations; 3) Agent learning to predict physical dynamics in novel settings; 4) Intrinsically-motivated agents based on TDW’s combination of high-quality image rendering and flexible avatar models exhibiting rudimentary aspects of self-awareness, curiosity and novelty seeking; 5) Sophisticated multi-agent interactions and social behaviors enabled by TDW’s support for multiple avatars; 6) Experiments on animate attention in which both human observers in VR and a neural network agent embody concepts of intrinsic curiosity find animacy to be more “interesting”.

2 ThreeDWorld Platform

System overview and API. The TDW simulation consists of two basic components: (i) the Build, a compiled executable running on the Unity3D Engine, which is responsible for image rendering, audio synthesis and physics simulations; and (ii) the Controller, an external Python interface to communicate with the build. Users can define their own tasks through it. Running a simulation follows a cycle in which: 1) The controller sends commands to the build; 2) The build executes those commands and sends simulation output data back to the controller. TDW commands can be sent in a list per simulation step rather than one at a time, enabling arbitrarily complex behavior.

Photo-realistic Rendering. TDW uses Unity’s underlying game-engine technology for image rendering, adding a custom lighting approach to achieve near-photo realistic rendering quality for both indoor and outdoor scenes.

Lighting Model. TDW uses two types of lighting; a single light source simulates direct light coming from the sun, while indirect environment lighting comes from “skyboxes” that utilize High Dynamic Range (HDRI) images. For details, see Figure 1(a-c) and the Supplementary Material. Additional post-processing is applied to the virtual camera including exposure compensation, tone mapping and dynamic depth-of-field. Example video: This further enhances the final image, achieving a degree of photorealism instrumental in enabling visual recognition transfer.

3D Model Library. To maximize control over image quality we have created a library of 3D model “assets” optimized from high-resolution 3D models. Using Physically-Based Rendering (PBR) materials, these models respond to light in a physically-correct manner. The library contains around 2000 objects spanning 200 categories such as furniture and appliances, animals, vehicles, toys etc.

Procedural Generation of New Environments. In TDW, a run-time virtual world, or “scene”, is created using our 3D model library assets. Environment models (interior or exterior) are populated with object models in various ways, from completely procedural (i.e. rule-based) to thematically organized (i.e. explicitly scripted). TDW places no restrictions on which models can be used with which environments, which allows for unlimited numbers and types of scene configurations.

High-fidelity Audio Rendering. Multi-modal rendering that includes flexible and efficient audio support is a key aspect of TDW that differentiates it from other frameworks.

Generation of Impact Sounds. TDW’s physics-driven audio support includes PyImpact, a Python library that uses modal synthesis to generate impact sounds Traer et al. (2019)

. PyImpact uses information about physical events such as material types, as well as velocities, normal vectors and masses of colliding objects to synthesize sounds that are played at the time of impact. This “round-trip” process is real-time and carries no appreciable latency. Example video:

Environmental Audio and Reverberation. For all placed sounds, TDW provides high-quality simulated environment-specific reverberation for interior environments, with 3D spatialization and attenuation by distance and/or occlusion. Reverberation automatically varies with the geometry of the space, the virtual materials applied to walls, floor and ceiling, and the percentage of room volume occupied by solid objects such as furniture.

Physical Simulation. In TDW, object behavior and interactions are handled by a physics engine. TDW supports two physics engines, providing both rigid-body physics and more advanced soft-body, cloth and fluid simulations.

Figure 2: Green outlines around objects indicate auto-computed convex colliders for fast but accurate rigid-body physics.

Rigid-body physics. Unity’s rigid body physics engine (PhysX) handles basic physics behavior involving collisions between rigid bodies. To achieve accurate but efficient collisions, we use the powerful V-HACD algorithm Mamou and Ghorbel (2009) to compute “form-fitting” convex hull colliders around each library object’s mesh, used to simplify collision calculations (see Figure 2). In addition, an object’s mass is automatically calculated from its volume and material density upon import. However, using API commands it is also possible to dynamically adjust mass or friction, as well as visual material appearance, on a per-object basis enabling potential disconnection of visual appearance from physical behavior (e.g. objects that look like concrete but bounce like rubber).

Advanced Physics Simulations. TDW’s second physics engine – Nvidia Flex – uses a particle-based representation to manage collisions between different object types. TDW supports rigid body, soft body (deformable), cloth and fluid simulations Figure 1(d). This unified representation helps machine learning models use underlying physics and rendered images to learn a physical and visual representation of the world through interactions with objects in the world.

Interactions and Avatars. TDW provides three paradigms for interacting with 3D objects: 1) Direct control of object behavior using API commands. 2) Indirect control through an “avatar” or embodiment of an AI agent. 3) Direct interaction by a human user, in virtual reality (VR).

Direct Control. Default object behavior in TDW is completely physics-based via commands in the API; there is no scripted animation of any kind. Using physics-based commands, users can move an object by applying an impulse force of a given magnitude along its forward vector, dynamically alter its mass or friction, or alter gravity altogether. Commands also exist to dynamically change an object’s visual material or its acoustic properties.

Avatar Agents. Avatars serve as the embodiment of AI agents. Avatar types include:

  • Disembodied cameras for generating first-person rendered images, segmentation and depth maps.

  • Basic embodied agents whose avatars are geometric primitives such as spheres or capsules that can move around the environment and are often used for basic algorithm prototyping.

  • More complex embodied avatars with user-defined physical structures and associated physically-mapped action spaces. Complex robotic or biomechanical bodies can be modelled this way, including articulated arms that can open boxes or pick up and place objects.

Avatars can perform typical AI agent functions such as focusing attention on objects in the scene, moving around the environment while responding to physics, and basic path finding and obstacle avoidance Fig. 1e). Multiple avatars can interact with each other in a scene; for example a “baby” agent can learn about the world by observing the behavior of a “parent” agent (Fig. 1f).

Human Interactions with VR devices. The TDW environment also supports users interacting directly with 3D objects in the scene using virtual reality (VR). Users see a 3D representation of their hands that tracks the actions of their own hands (Fig. 1g). Using API commands, objects are made “graspable” such that any collision between object and virtual hands allows the user to pick it up, place it or throw it. Example video: This functionality enables the collection of human behavior data, and could inspire more human-like machine learning models.

3 Example Applications

Figure 3: Examples from the TDW pre-training dataset, to be released as part of the TDW package.

Visual and Sound Recognition Transfer. We quantitatively examine how well feature representations learned using TDW-generated images and audio data transfer to real world scenarios.

Visual recognition transfer. First, we generated a TDW image classification dataset comparable in size to ImageNet; 1.3M images were generated by randomly placing one of TDW’s 2,000 object models in an environment with random conditions (weather, time of day) and taking a snapshot while pointing the randomly positioned virtual camera at the object (see Supplement for full details). Some examples of this dataset can be found in Figure 3. We pre-trained four ResNet-50 models He et al. (2016) on ImageNet Deng et al. (2009), SceneNet Handa et al. (2016), AI2-Thor Kolve et al. (2017) and the TDW-image dataset respectively. We then evaluated the learned representations by fine-tuning on downstream fine-grained image classification tasks using Aircraft Maji et al. (2013), Birds Van Horn et al. (2015), CUB Wah et al. (2011),Cars Krause et al. (2013), Dogs Khosla et al. (2011), Flowers Nilsback and Zisserman (2006), and Food datasets Bossard et al. (2014). Table 2 shows that the feature representations learned from TDW-generated images are substantially better than the ones learned from SceneNet Handa et al. (2016), AI2-Thor Kolve et al. (2017) , and have begun to approach the quality of those learned from ImageNet. These experiments suggest that though significant work remains to be done, TDW has taken meaningful steps towards mimicking the use of large-scale real-world datasets in model pre-training. We will release our pre-training dataset as part of the TDW package.

Dataset Aircraft Bird Car Cub Dog Flower Food Mean
ImageNet 0.74 0.70 0.86 0.72 0.72 0.92 0.83 0.78
SceneNet 0.06 0.43 0.30 0.27 0.38 0.62 0.77 0.40
AI2-THOR 0.57 0.59 0.69 0.56 0.56 0.62 0.79 0.63

0.73 0.69 0.86 0.7 0.67 0.89 0.81 0.76

Table 2: Visual representations transfer for fine-grained image classifications.

Sound recognition transfer. We also created an audio dataset to test material classification from impact sounds. We recorded 300 sound clips of 5 different materials (cardboard, wood, metal, ceramic, and glass; between 4 and 15 different objects for each material) each struck by a selection of pellets (of wood, plastic, metal; of a range of sizes for each material) dropped from a range of heights between 2 and 75cm. The pellets themselves resonated negligible sound compared to the objects but because each pellet preferentially excited different resonant modes, the impact sounds depend upon the mass and material of the pellets, and the location and force of impact, as well as the material, shape, and size of the resonant objectsTraer et al. (2019). Given the variability in other factors, material classification from this dataset is nontrivial. We trained material classification models on simulated audio from both TDW and the sound-20K datasetZhang et al. (2017)

. We tested their ability to classify object material from the real-world audio. As shown in Table 

4, the model trained on TDW-audio dataset achieves more than 30% accuracy gains over that trained on Sound20k dataset. This improvement is plausibly because TDW produces a more diverse range of sounds than Sound20K and prevents the network overfitting to specific features of the synthetic audio set. Example output:

Dataset Accuracy
Sound-20K 0.34
TDW 0.66
Table 4: Comparison of the effect of physical scene understanding on material and mass classification.
Method Material Mass
Vision only 0.72 0.42
Audio only 0.92 0.78
Vision + Audio 0.96 0.83
Table 3: Sound perception transfer on material recognition.

Multi-modal physical scene understanding. We used the TDW graphics engine, physics simulation and the sound synthesis technique described in Sec 2 to generate videos and impact sounds of objects dropped on flat surfaces (table tops and benches). The surfaces were rendered to have the visual appearance of one of the 5 materials for which we can support audio synthesis. The high degree of variation over object and material appearance, as well as physical properties such as trajectories and elasticity, prevents the network from memorizing features (i.e. that objects bounce more on metal than cardboard). We trained networks with simulated TDW scenes to identify the mass of the dropped object and the table material from visual, audio, and audio-visual information. These networks were then tested on simulated scenes from an independent test set. The results (Table 4) show that audio is more diagnostic than video for both classification tasks, and that the best performance requires audiovisual (i.e. multi-modal) information, underscoring the utility of realistic multi-modal rendering.

Training and Testing Physical Understanding.

End-to-end differentiable forward predictors of physical dynamics have emerged as being of great importance for enabling deep-learning based approaches to model-based planning and control applications 

Lerer et al. (2016); Mottaghi et al. (2016); Fragkiadaki et al. (2015); Battaglia et al. (2016); Agrawal et al. (2016); Shao* et al. (2014); Fire and Zhu (2016); Pearl (2009); Ye et al. (2018), and for computational cognitive science models of physical understanding Battaglia et al. (2013); Chang et al. (2016)

. While traditional physics engines constructed for computer graphics (such as PhysX and Flex) have made great strides, such routines are often hard-wired and thus challenging to integrate as components of larger learnable systems. On the other hand, the quality and scalability of end-to-end learned physics predictors has been limited, in part by the availability of effective evaluation benchmarks and training data. This area has thus afforded a compelling use case for TDW and its advanced physical simulation capabilities.

Advanced Physical Prediction Benchmark. Using the TDW platform, we have created a comprehensive benchmark for training and evaluation of physically-realistic forward prediction algorithms, which will be released as part of the TDW package. This dataset contains a large and varied collection of physical scene trajectories, including all data from visual, depth, audio, and force sensors, high-level semantic label information for each frame, as well as latent generative parameters and code controllers for all situations. This dataset goes well beyond existing related benchmarks, such as IntPhys Riochet et al. (2018), providing scenarios with large numbers of complex real-world object geometries, photo-realistic textures, as well as a variety of rigid, soft-body, cloth, and fluid materials. Example scenarios from this dataset are seen in Fig 4a, and shown in this video:, and are grouped into subsets highlighting important issues in physical scene understanding, including:

  • Object Permanence: Object Permanence is a core feature of human intuitive physics Spelke (1990), and agents must learn that objects continue to exist when out of sight.

  • Shadows: TDW’s lighting models allows agents to distinguish both object intrinsic properties (e.g. reflectance, texture) and extrinsic ones (what color it appears), which is key to understanding that appearance can change depending on context, while underlying physical properties do not.

  • Sliding vs Rolling: Predicting the difference between an object rolling or sliding – an easy task for adult humans – requires a sophisticated mental model of physics. Agents must understand how object geometry affects motion, plus some rudimentary aspects of friction.

  • Stability: Most real-world tasks involve some understanding of object stability and balance. Unlike simulation frameworks where object interactions have predetermined stable outcomes, using TDW agents can learn to understand how geometry and mass distribution are affected by gravity.

  • Simple Collisions: Agents must understand how momentum and geometry affects collisions to know that what happens when objects come into contact affects how we interact with them.

  • Complex Collisions: Momentum and high resolution object geometry help agents understand that large surfaces, like objects, can take part in collisions but are unlikely to move.

  • Draping & Folding: By modeling how cloth and rigid bodies behave differently, TDW allows agents to learn that soft materials are manipulated into different forms depending on what they are in contact with.

  • Submerging: Fluid behavior is different than solid object behavior, and interactions where fluid takes on the shape of a container and objects displace fluid are important for many real-world tasks.

Figure 4: Physics Evaluation and Training. a) Scenarios for training and evaluating advanced physical understanding in end-to-end differentiable physics predictors. These are part of a benchmark dataset that will be released along with TDW. Each panel of four images is in order of top-left, top-right, bottom-left, bottom-right. b) Hierarchical Relational Network (HRN) architecture from Mrowca et al. (2018). c)

Quantative comparison of accuracy of physical predictions over time for HRN compared to no-collision ablation (green), simple multi-layer perceptron (magenta) and Interaction Network 

Battaglia et al. (2016) (red).
d) Examples of prediction rollouts for a variety of physical scenarios.

Learning A Differential Physics Model. The Hierarchical Relation Network (HRN) is a recently-published end-to-end differentiable neural network based on hierarchical graph convolution, that learns to predict physical dynamics in this representation Mrowca et al. (2018). The HRN relies on a hierarchical particle-based object representation that covers a wide variety of types of three-dimensional objects, including both arbitrary rigid geometrical shapes, deformable materials, cloth, and fluids.

Training the HRN at scale is enabled by TDW’s integration of both Flex and PhysX, providing a complex and varied environment for learning rich physical interactions. By virtue of combining the physical scenarios described by these two physics engines, the HRN’s learned solution goes beyond either one alone. For example, the HRN can predict physical interactions with objects composed of non-homogenous materials (e.g. a cone whose rigidity varies from stiff at the base to wobbly at the top). Future work can build on these ideas to train more sophisticated physics models, and use them in complex model-based learning environments with real-world objects.

Figure 5: Multi-Agent and VR Capabilities. a) Illustration of TDW’s VR capabilities in an experiment measuring spontaneous patterns of attention to agents executing spatiotemporal kinematics typical of real-world inanimate and animate agents. By design, the stimuli are devoid of surface features, so that both humans and intrinsically-motivated neural network agents must discover which agents are interesting and thus worth paying attention to, based on the behavior of the actor agents. Example timecourses (panel b) and aggregate attention (panel c) for different agents, from humans over real time, and from intrinsically-motivated neural network agents over learning time.

Social Agents and Virtual Reality. Moving beyond single agents, social interactions are a critical aspect of human life, but an area where current approaches in AI and robotics are especially limited. Improvements in creating AI agents that model and mimic social behavior, and learn efficiently from social interactions, are thus an important area for cutting-edge technical development. Using the flexibility of the multi-avatar API, we have created implementations of a variety of multi-agent interactive settings using TDW (Fig. 1f). These include scenarios in which an “observer” agent is placed in a room with multiple inanimate objects, together with several differentially-controlled “actor” agents (Fig. 5a). The actor agents are controlled by either hard-coded or interactive policies implementing behaviors such as object manipulation, chasing and hiding, and motion imitation. The observer seeks to maximize its ability to predict the behaviors of the actors, allocating its attention based on a metric of “progress curiosity” Baranes and Oudeyer (2013)

that seeks to estimate which observations are most likely to increase the observer’s ability to make actor predictions. Intriguingly, in recent work using TDW, these socially-curious agents have been shown to outperform a variety of existing alternative curiosity metrics in producing better predictions, both in terms of final performance and in substantially reducing the sample complexity required to learn actor behavior patterns 

Kim et al. (2020).

The VR integration in TDW enables humans to directly observe and manipulate objects in responsive virtual environments. Figure 5 illustrates an experiment investigating the patterns of attention that human observers exhibit in an environment with multiple animate agents and static objects Johnson (2003); Frankenhuis et al. (2013). Observers wear a GPU-powered Oculus Rift S, while watching a virtual display containing multiple robots. Head movements from the Oculus are mapped to a sensor camera within TDW, and camera images are paired with meta-data about the image-segmented objects, in order to determine which set of robots people are gazing at. Interestingly, the socially-curious neural network agents produce an aggregate attentional gaze pattern that is quite similar to that of human adults measured in the VR environment (Fig. 5b), arising from the agent’s discovery of the inherent relative “interestingness” of animacy, without having to build it in to the network architecture Kim et al. (2020). These results are just one illustration of TDW’s extensive VR capabilities in bridging AI and human behaviors.

4 Future Directions

We are actively working to develop new capabilities for robotic systems integration and articulable object interaction for higher-level task planning and execution. Articulable Objects. Currently only a small number of TDW objects are modifiable by user interaction, and we are actively expanding the number of library models that support such behaviors, including containers with lids that open, chests with removable drawers and doors with functional handles. Humanoid Avatars. Interacting with actionable objects or performing fine-motor control tasks such as solving a jigsaw puzzle requires avatars with a fully-articulated body and hands. We plan to develop a set of humanoid avatar types that fulfill these requirements, driven by motion capture data and with fully-articulated hands controlled by a separate gesture control system. Robotic Systems Integration. We plan to extend our avatar types to allow real-world transfer to robotic arms, through a new physics-based articulation system and support for standard URDF robot specification files.

Broader Impact

As we have illustrated, TDW is a completely general and flexible simulation platform, and as such can benefit research that sits at the intersection of neuroscience, cognitive science, psychology, engineering and machine learning / AI. We feel the broad scope of the platform will support research into understanding how the brain processes a range of sensory data – visual, auditory and even tactile – as well as physical inference and scene understanding. We envision TDW and PyImpact supporting research into human – and machine – audio perception, that can lead to a better understanding of the computational principles underlying human audition. This understanding can, for example, ultimately help to create better assistive technology for the hearing-impaired. We recognize that the diversity of “audio materials” used in PyImpact is not yet adequate to meet this longer-term goal, but we are actively addressing that and plan to increase the scope significantly. We also believe the wide range of physics behaviors and interaction scenarios TDW supports will greatly benefit research into understanding how we as humans learn so much about the world, so rapidly and flexibly, given minimal input data. While we have made significant strides in the accuracy of physics behavior in TDW, TDW is not yet able to adequately support robotic simulation tasks. To support visual object recognition and image understanding we constantly strive to make TDW’s image generation as photoreal as possible using today’s real-time 3D technology. However, we are not yet at the level we would like to be. We plan to continue improving our rendering and image generation capability, taking advantage of any relevant technology advances (e.g. real-time hardware-assisted ray tracing) while continuing to explore the relative importance of object variance, background variability and overall image quality to vision transfer results.


  • [1] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine (2016) Learning to poke by poking: experiential learning of intuitive physics. CoRR abs/1606.07419. External Links: Link, 1606.07419 Cited by: Appendix D, §3.
  • [2] A. Baranes and P. Oudeyer (2013) Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems 61 (1), pp. 49–73. Cited by: §3.
  • [3] P. Battaglia, J. Hamrick, and J. Tenenbaum (2013-10) Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences of the United States of America 110, pp. . External Links: Document Cited by: Appendix D, §3.
  • [4] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. CoRR abs/1612.00222. External Links: Link, 1612.00222 Cited by: Appendix D, Figure 4, §3.
  • [5] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, et al. (2016) Deepmind lab. arXiv preprint arXiv:1612.03801. Cited by: Appendix D, Appendix D, Table 1, §1.
  • [6] L. Bossard, M. Guillaumin, and L. Van Gool (2014)

    Food-101–mining discriminative components with random forests

    In ECCV, pp. 446–461. Cited by: §3.
  • [7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: Appendix D.
  • [8] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum (2016) A compositional object-based approach to learning physical dynamics. CoRR abs/1612.00341. External Links: Link, 1612.00341 Cited by: Appendix D, §3.
  • [9] C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman (2020) Audio-visual embodied navigation. ECCV. Cited by: Appendix D.
  • [10] E. Coumans and Y. Bai (2016) Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository. Cited by: Appendix D, Table 1, §1.
  • [11] V. Deanand, S. Tulsiani, and A. Gupta (2020) See, hear, explore: curiosity via audio-visual association. arXiv preprint arXiv:2007.03669. Cited by: Appendix D.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §3.
  • [13] A. Fire and S. Zhu (2016) Learning perceptual causality from video. ACM Transactions on Intelligent Systems and Technology (TIST) 7 (2), pp. 23. Cited by: Appendix D, §3.
  • [14] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik (2015-11) Learning visual predictive models of physics for playing billiards. pp. . Cited by: Appendix D, §3.
  • [15] W. E. Frankenhuis, B. House, H. C. Barrett, and S. P. Johnson (2013) Infants’ perception of chasing. Cognition 126 (2), pp. 224–233. Cited by: §3.
  • [16] C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba (2020) Music gesture for visual sound separation. In CVPR, pp. 10478–10487. Cited by: Appendix D.
  • [17] C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B. Tenenbaum (2020) Look, listen, and act: towards audio-visual embodied navigation. ICRA. Cited by: Appendix D.
  • [18] C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba (2019) Self-supervised moving vehicle tracking with stereo sound. In ICCV, pp. 7053–7062. Cited by: Appendix D.
  • [19] A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. Cipolla (2016) Understanding real world indoor scenes with synthetic data. In CVPR, pp. 4077–4085. Cited by: §3.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.
  • [21] D. L. James, J. Barbič, and D. K. Pai (2006) Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources. In ACM Transactions on Graphics (TOG), Vol. 25, pp. 987–995. Cited by: Appendix D.
  • [22] S. C. Johnson (2003) Detecting agents. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 358 (1431), pp. 549–559. Cited by: §3.
  • [23] A. Khosla, N. Jayadevaprakash, B. Yao, and F. Li (2011) Novel dataset for fine-grained image categorization: stanford dogs. In CVPR-FGVC, Vol. 2. Cited by: §3.
  • [24] K. H. Kim, M. Sano, J. De Freitas, N. Haber, and D. L. K. Yamins (2020) Active world model learning in agent-rich environments with progress curiosity. In Proceedings of the International Conference on Machine Learning, pp. . Cited by: §3, §3.
  • [25] E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) Ai2-thor: an interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474. Cited by: Appendix D, Table 1, §1, §3.
  • [26] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In CVPRW, pp. 554–561. Cited by: §3.
  • [27] M. Kuhlo and E. Eggert (2010) Architectural rendering with 3ds max and v-ray. Elsevier. Cited by: Appendix D.
  • [28] A. Lerer, S. Gross, and R. Fergus (2016) Learning physical intuition of block towers by example. CoRR abs/1603.01312. External Links: Link, 1603.01312 Cited by: §3.
  • [29] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §3.
  • [30] K. Mamou and F. Ghorbel (2009) A simple and efficient approach for 3d mesh approximate convex decomposition. In 2009 16th IEEE international conference on image processing (ICIP), pp. 3501–3504. Cited by: §2.
  • [31] R. Mottaghi, M. Rastegari, A. Gupta, and A. Farhadi (2016) ”What happens if…” learning to predict the effect of forces in images. CoRR abs/1603.05600. External Links: Link, 1603.05600 Cited by: Appendix D, §3.
  • [32] D. Mrowca, C. Zhuang, E. Wang, N. Haber, L. F. Fei-Fei, J. Tenenbaum, and D. L. Yamins (2018) Flexible neural representation for physics prediction. In Advances in Neural Information Processing Systems, pp. 8799–8810. Cited by: §A.4, Figure 4, §3.
  • [33] M. Nilsback and A. Zisserman (2006) A visual vocabulary for flower classification. In CVPR, Vol. 2, pp. 1447–1454. Cited by: §3.
  • [34] J. Pearl (2009) Causality. Cambridge University Press. Cited by: Appendix D, §3.
  • [35] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba (2018) Virtualhome: simulating household activities via programs. In CVPR, pp. 8494–8502. Cited by: Appendix D, Table 1, §1.
  • [36] R. Riochet, M. Y. Castro, M. Bernard, A. Lerer, R. Fergus, V. Izard, and E. Dupoux (2018) Intphys: a framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616. Cited by: §3.
  • [37] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al. (2019) Habitat: a platform for embodied ai research. ICCV. Cited by: Appendix D, Table 1, §1.
  • [38] T. Shao*, A. Monszpart*, Y. Zheng, B. Koo, W. Xu, K. Zhou, and N. Mitra (2014) Imagining the unseen: stability-based cuboid arrangements for scene understanding. ACM SIGGRAPH Asia 2014. Note: * Joint first authors Cited by: Appendix D, §3.
  • [39] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser (2017) Semantic scene completion from a single depth image. CVPR. Cited by: Appendix D.
  • [40] E. S. Spelke (1990) Principles of object perception. Cognitive science 14 (1), pp. 29–56. Cited by: 1st item.
  • [41] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: Appendix D, Table 1, §1.
  • [42] J. Traer, M. Cusimano, and J. H. McDermott (2019) A perceptually inspired generative model of rigid-body contact sounds. Digital Audio Effects (DAFx). Cited by: Appendix D, §2, §3.
  • [43] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In CVPR, pp. 595–604. Cited by: §3.
  • [44] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §3.
  • [45] Y. Wang, C. Gan, M. H. Siegel, Z. Zhang, J. Wu, and J. B. Tenenbaum (2017) A computational model for combinatorial generalization in physical auditory perception. CCN. Cited by: Appendix D.
  • [46] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian (2018) Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209. Cited by: Appendix D, Table 1, §1.
  • [47] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese (2020) Interactive gibson benchmark: a benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters 5 (2), pp. 713–720. Cited by: Appendix D, Table 1, §1.
  • [48] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson env: real-world perception for embodied agents. In CVPR, pp. 9068–9079. Cited by: Appendix D, Table 1, §1.
  • [49] F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y. Yuan, H. Wang, et al. (2020) SAPIEN: a simulated part-based interactive environment. CVPR. Cited by: Appendix D, Table 1, §1.
  • [50] T. Ye, X. Wang, J. Davidson, and A. Gupta (2018-09) Interpretable intuitive physics model. In The European Conference on Computer Vision (ECCV), Cited by: Appendix D, §3.
  • [51] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum (2020) CLEVRER: collision events for video representation and reasoning. ICLR. Cited by: Appendix D.
  • [52] Z. Zhang, Q. Li, Z. Huang, J. Wu, J. Tenenbaum, and B. Freeman (2017) Shape and material from sound. In NIPS, pp. 1278–1288. Cited by: Appendix D, §3.
  • [53] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018) The sound of pixels. In ECCV, pp. 570–586. Cited by: Appendix D.

Appendix A Model Details

a.1 Visual Recognition Transfer

We use ResNet-18 network architecture as a backbone for all the visual perception transfer experiments. For the pre-training, we set the initial learning rate as 0.1 with cosine decay and trained for 100 epochs. Furthermore, we took the pre-trained weights as initialization and fine-tuned on fine-grained image recognition tasks. Concretely, we use the initial learning rate with 0.01 with cosine decay and trained for 10 epochs on the fine-grained image recognition datasets.

a.2 Sound Recognition Transfer

For the sound recognition transfer experiment, we first convert the raw audio waveform to the sound spectrogram representation and feed them to a VGG-16 pre-trained on AudioSet for material classification. For the training, we set the initial learning rate as 0.01 with cosine decay and trained for 50 epochs.

a.3 Multi-modal Physical Scene Understanding

The training and test data sets for multi-modal physical scene understanding were generated with TDW and PyImpact audio synthesis and have the same material and mass categories for classifications. However, the test-set videos contained objects, tables, motion patterns, and impact sounds that were different from any video in the training set. Across all videos, the identity, size, initial location, and initial angular momentum of the dropped object were randomized to ensure every video had a unique pattern of motion and bounces. The shape, size, and orientation of the table were randomized, as were the surface texture renderings (e.g., a wooden table could be rendered as ”cedar,” ”pine,” ”oak,” ”teak,” etc.), to ensure every table appearance was unique. PyImpact uses a random sampling of resonant modes to create an impact sound, such that the impacts in every video had a unique spectro-temporal structure - only the statistics of the resonant modes are constrained to indicate material.

For the video only baseline, we first extract visual features from each video fame using ResNet-18 pre-trained on ImageNet. To aggregate a video-level representation of a 2048 dimensional feature vector, we apply an average pooling over 25 video frames. For the audio-only baseline, we first convert the raw audio waveform to the sound spectrogram representation. Then we take the sound spectrogram as input for a VGG-16 pre-trained on AudioSet. Each audio-clip is then represented as a 4096-dimensional feature vector. Finally, we take the visual feature only, sound feature only, the concatenation of visual and sound feature as input to train 2-layer MLP classifier for material and mass classification.

a.4 Learning Physic Dynamic Model

In this experiment, we assume that the Hierarchical Relation Network (HRN) model [32] has access to the Flex particle representation of each object, which is provided at every simulation step by the environment. From this particle representation, we construct a hierarchical particle relationship scene graph representation . Graph nodes correspond to either particles or groupings of other nodes and are arranged in a hierarchy, whereas edges represent constraints between nodes. The HRN as the dynamics model takes a history of hierarchical graphs as input and predicts the future particle states . The model first computes collision effects between particles (), effects of external forces (), and effects of past particles on current particles () using pairwise graph convolutions. The effects are then propagated through the particle hierarchy using a hierarchical graph convolution module whose message passing order is depicted in Figure 2a). First effects are propagated from leaf to ancestor particles (L2A), then within siblings (WG), and finally from ancestors to descendants (A2D). Finally, the fully-connected module computes the next particle states from the summed effects and past particle states.

Appendix B Dataset Details

b.1 TDW-image Dataset

To generate images, the controller runs each model through two loops. The first loop captures camera and object positions, rotations, etc. Then, these cached positions are played back in the second loop to generate images. Image capture is divided this way because the first loop will ”reject” a lot of images with poor composition; this rejection system doesn’t require image data, and so sending image data would slow down the entire controller.

The controller relies on IdPassGrayscale data to determine whether an image has good composition. This data reduces the rendered frame of a segementation color pass to a single pixel and returns the grayscale value of that pixel. To start the positional loop, the entire window is resized to 32 32 and render quality is set to minimal, in order to speed up the overall process. There are then two grayscale passes: One without occluding objects (by moving the camera and object high above the scene) and one with occluding scenery, but the exact same relative positions and rotations. The difference in grayscale must exceed 0.55 for the camera and object positions and rotations to be “accepted”. This data is then cached. In a third pass, the screen is resized back to 256 256, images and high-quality rendering are enabled, and the controller uses the cached positional/rotational data to iterate rapidly through the dataset.

b.2 Advanced Physical Prediction Benchmark

Individual descriptions of each of the physics dataset scenarios as mentioned in the paper and shown in the Supplementary Material video. Note that additional scenarios are included here that were not mentioned in the paper; some are included in the video.

Binary Collisions Randomly-selected ”toys” are created with random physics values. A force of randomized magnitude is applied to one toy, aimed at another.

Complex Collisions Multiple objects are dropped onto the floor from a height, with randomized starting positions and orientations.

Object Occlusion Random ”big” and ”small” models are added. The small object is at random distance and angle from the big object. The camera is placed at a random distance and rotated such that the ”big” model occludes the ”small” model in some frames. Note – not included in video.

Object Permanence A ball rolls behind an occluding object and then reemerges. The occluder is randomly chosen from a list. The ball has a random starting distance, visual material, physics values, and initial force.

Shadows A ball is added in a scene with a randomized lighting setup. The ball has a random initial position, force vector, physics values, and visual materials. The force vectors are such that the ball typically rolls through differently-lit areas, i.e. a bright spot to a shadowy spot.

Stability A stack of 4-7 objects is created. The objects are all simple shapes with random colors. The stack is built according to a ”stability” algorithm; some algorithms yield more balanced stacks than others. The stack falls down, or doesn’t.

Containment A small object is contained and rattles around in a larger object, such as a basket or bowl. The small object has random physics values. The bowl has random force vectors.

Sliding/Rolling Objects are placed on a table. A random force is applied at a random point on the table. The objects slide or roll down.

Bouncing Four ”ramp” objects are placed randomly in a room. Two to six ”toy” objects are added to the room in mid-air and given random physics values and force vectors, such that they will bounce around the scene. Note – not included in video.

Draping/Folding A cloth falls, 80 percent of the time onto another rigid body object. The cloth has random physics values.

Dragging A rigid object is dragged or moved by pulling on a cloth under it. The cloth and the object have random physics values. The cloth is pulled in by a random force vector.

Squishing Squishy objects deform and are restored to original shape depending on applied forces (e.g. squished when something else is on top of them or when they impact a barrier). Note – not included in video.

Submerging Objects sink or float in fluid. Values for viscosity, adhesion and cohesion vary by fluid type, as does the visual appearance of the fluid. Fluids represented in the video include water, chocolate, honey, oil and glycerin.

Appendix C TDW Lighting Model

The lighting model for both interior and exterior environments utilizes a single primary light source that simulates the sun and provides direct lighting, affecting the casting of shadows. In most interior environments, additional point or spot lights are also used to simulate the light coming from lighting fixtures in the space.

General environment (indirect) lighting comes from “skyboxes” that utilize High Dynamic Range images (HDRI). Skyboxes are conceptually similar to a planetarium projection, while HDRI images are a special type of photographic digital image that contain more information than a standard digital image. Photographed at real-world locations, they capture lighting information for a given latitude and hour of the day. This technique is widely used in movie special-effects, when integrating live-action photography with CGI elements.

TDW’s implementation of HDRI lighting automatically adjusts:

  • The elevation of the “sun” light source to match the time of day in the original image; this affects the length of shadows.

  • The intensity of the “sun” light, to match the shadow strength in the original image.

  • The rotation angle of the “sun” light, to match the direction shadows are pointing in the original image .

By rotating the HDRI image, we can realistically simulate different viewing positions, with corresponding changes in lighting, reflections and shadowing in the scene (see the Supplementary Material video for an example).

TDW currently provides over 100 HDRI images captured at various locations around the world and at different times of the day, from sunrise to sunset. These images are evenly divided between indoor and outdoor locations.

Appendix D Related Simulation Environments

Recently, several simulation platforms have been developed to support research into embodied AI, scene understanding, and physical inference. These include AI2-THOR[25], HoME[46], VirtualHome[35], Habitat[37], Gibson[48], iGibson [47], Sapien [49] PyBullet [10], MuJuCo [41], and Deepmind Lab [5]. However none of them approach TDW’s range of features and diversity of potential use cases.

Rendering and Scene Types. Research in computer graphics (CG) has developed extremely photorealistic rendering pipelines [27]. However, the most advanced techniques (e.g. ray tracing), have yet to be fully integrated into real-time rendering engines. Some popular simulation platforms, including Deepmind Lab [5] and OpenAI Gym [7], do not target realism in their rendering or physics and are better suited to prototyping than exploring realistic situations. Others use a variety of approaches for more realistic visual scene creation – scanned from actual environments (Gibson, Habitat), artist-created (AI2-THOR) or using existing datasets such as SUNCG [39] (HoME). However all are limited to the single paradigm of rooms in a building, populated by furniture, whereas TDW supports real-time near-photorealistic rendering of both indoor and outdoor environments. Only TDW allows users to create custom environments procedurally, as well as populate them with custom object configurations for specialized use-cases. For example, it is equally straightforward with TDW to arrange a living room full of furniture (see Fig. 1a-b), to generate photorealistic images of outdoor scenes (Fig. 1c) to train networks for transfer to real-world images, or to construct a “Rube Goldberg” machine for physical inference experiments (Fig. 1h).

Physical Dynamics. Several stand-alone physics engines are widely used in AI training, including PyBullet which supports rigid object interactions, and MuJuCo which supports a range of accurate and complex physical interactions. However, these engines do not generate high-quality images or audio output. Conversely, platforms with real-world scanned environments, such as Gibson and Habitat, do not support free interaction with objects. HoME does not provide photorealistic rendering but does support rigid-body interactions with scene objects, using either simplified (but inaccurate) ”box-collider” bounding-box approximations or the highly inefficient full object mesh. AI2-THOR provides better rendering than HoME or VirtualHome, with similar rigid-body physics to HoME. In contrast, TDW automatically computes convex hull colliders that provide mesh-level accuracy with box-collider-like performance (Fig. 2). This fast-but-accurate high-res rigid body (denoted “RF” in Table 1) appears unique among integrated training platforms. Also unique is TDW’s support for complex non-rigid physics, based on the NVIDIA FLeX engine (Fig. 1d). Taken together, TDW is substantially more full-featured for supporting future development in rapidly-expanding research areas such as learning scene dynamics for physical reasoning [51, 50] and model-predictive planning and control  [3, 31, 14, 4, 8, 1, 38, 13, 34].

Audio. As with CG, advanced work in computer simulation has developed powerful methods for physics-based sound synthesis [21] based on object material and object-environment interactions. In general, however, such physics-based audio synthesis has not been integrated into real-time simulation platforms. HoME is the only other platform to provide audio output, generated by user-specified pre-placed sounds. TDW, on the other hand, implements a physics-based model to generate situational sounds from object-object interactions (Fig. 1h). TDW’s PyImpact Python library computes impact sounds via modal synthesis with mode properties sampled from distributions conditioned upon properties of the sounding object [42]. The mode distributions were measured from recordings of impacts. The stochastic sound generation prevents overfitting to specific audio sequences. In human perceptual experiments, listeners could not distinguish our synthetic impact sounds from real impact sounds, and could accurately judge physical properties from the synthetic audio[42]. For this reason, TDW is substantially more useful for multi-modal inference problems such as learning shape and material from sound [45, 52], audio-visual learning [18, 53, 16] and multi-modality embodied navigation [17, 9, 11].

Interaction and API All the simulation platforms discussed so far require some form of API to control an agent, receive state of the world data or interact with scene objects. However not all support interaction with objects within that environment. Habitat focuses on navigation within indoor scenes, and its Python API is comparable to TDW’s but lacks capabilities for interaction with scene objects via physics (Fig. 1e), or multi-modal sound and visual rendering (Fig. 1h). VirtualHome, iGibson and AI2-THOR’s interaction capabilities are closer to TDW’s. In VirtualHome and AI2-THOR, interactions with objects are explicitly animated, not controlled by physics. TDW’s API, with its multiple paradigms for true physics-based interaction with scene objects, provides a set of tools that enable the broadest range of use cases of any available simulation platform.

Appendix E System overview and API

e.1 Core components

  • The build is the 3D environment application. It is available as a compiled executable.

  • The controller is an external Python script created by the user, which communicates with the build.

  • The S3 server is a remote server. It contains the binary files of each model, material, etc. that can be added to the build at runtime.

  • The records databases are a set of local .json metadata files with records corresponding to each asset bundle.

  • A librarian is a Python wrapper class to easily query metadata in a records database file.

e.2 The simulation pattern

  • The controller communicates with the build by sending a list of commands.

  • The build receives the list of serialized Commands, deserializes them, and executes them.

  • The build advances 1 physics frame (simulation step).

Output data is always sent as a list, with the last element of the list being the frame number:

[data, data, data, frame]

e.3 The controller

All controllers are sub-classes of the Controller class. Controllers send and receive data via the communicate function:

from tdw.controller import Controller

c = Controller()

# resp will be a list with one element: [frame]
resp = c.communicate({"$type": "load_scene", "scene_name": "ProcGenScene"})

Commands can be sent in lists of arbitrary length, allowing for arbitrarily complex instructions per frame. The user must explicitly request any other output data:

from tdw.controller import Controller
from tdw.tdw_utils import TDWUtils
from tdw.librarian import ModelLibrarian
from tdw.output_data import OutputData, Bounds, Images

lib = ModelLibrarian("models_full.json")
# Get the record for the table.
table_record = lib.get_record("small_table_green_marble")

c = Controller()

table_id = 0

# 1. Load the scene.
# 2. Create an empty room (using a wrapper function)
# 3. Add the table.
# 4. Request Bounds data.
resp = c.communicate([{"$type": "load_scene",
                       "scene_name": "ProcGenScene"},
                      TDWUtils.create_empty_room(12, 12),
                      {"$type": "add_object",
                       "url": table_record.get_url(),
                       "scale_factor": table_record.scale_factor,
                       "position": {"x": 0, "y": 0, "z": 0},
                       "rotation": {"x": 0, "y": 0, "z": 0},
                       "category": table_record.wcategory,
                       "id": table_id},
                      {"$type": "send_bounds",
                       "frequency": "once"}])

The resp object is a list of byte arrays that can be deserialized into output data:

# Get the top of the table.
top_y = 0
for r in resp[:-1]:
    r_id = OutputData.get_data_type_id(r)
    # Find the bounds data.
    if r_id == "boun":
        b = Bounds(r)
        # Find the table in the bounds data.
        for i in range(b.get_num()):
            if b.get_id(i) == table_id:
                top_y = b.get_top(i)

The variable top_y an be used to place an object on the table:

box_record = lib.get_record("iron_box")
box_id = 1
c.communicate({"$type": "add_object",
               "url": box_record.get_url(),
               "scale_factor": box_record.scale_factor,
               "position": {"x": 0, "y": top_y, "z": 0},
               "rotation": {"x": 0, "y": 0, "z": 0},
               "category": box_record.wcategory,
               "id": 1})

Then, an “avatar” can be added to the scene. In this case, the avatar is a just a camera. The avatar can then send an image:

avatar_id = "a"
resp = c.communicate([{"$type": "create_avatar",
                       "type": "A_Img_Caps_Kinematic",
                       "avatar_id": avatar_id},
                      {"$type": "teleport_avatar_to",
                       "position": {"x": 1, "y": 2.5, "z": 2}},
                      {"$type": "look_at",
                       "avatar_id": avatar_id,
                       "object_id": box_id},
                      {"$type": "set_pass_masks",
                       "avatar_id": avatar_id,
                       "pass_masks": ["_img"]},
                      {"$type": "send_images",
                       "frequency": "once",
                       "avatar_id": avatar_id}])

# Get the image.
for r in resp[:-1]:
    r_id = OutputData.get_data_type_id(r)
    # Find the image data.
    if r_id == "imag":
        img = Images(r)

This image is a numpy array that can be either saved to disk or fed directly into a ML system.Put together, the example code will create this image:

e.4 Benchmarks

CPU: Intel i7-7700K @4.2GHz GPU: NVIDIA GeForce GTX 1080

Benchmark Quality Size FPS
Object data N/A N/A 850
Images low 256x256 380
Images high 256x256 168

e.5 Command API Backend

e.5.1 Implementation Overview

Every command in the Command API is a subclass of Command.

/// <summary>
/// Abstract class for a message sent from the controller to the build.
/// </summary>
public abstract class Command
    /// <summary>
    /// True if command is done.
    /// </summary>
    protected bool isDone = false;

    /// <summary>
    /// Do the action.
    /// </summary>
    public abstract void Do();

    /// <summary>
    /// Returns true if this command is done.
    /// </summary>
    public bool IsDone()
        return isDone;

Every command must override Command.Do(). Because some commands require multiple frames to finish, they announce that they are “done” via Command.IsDone().

/// This is an example command.
/// </summary>
public class ExampleCommand : Command
    /// This integer will be output to the console.
    /// </summary>
    public int integer;

    public override void Do()
        isDone = true;

Commands are automatically serialized and deserialized as JSON dictionaries In a user-made controller script, ExampleCommand looks like this:

{"$type": "example_command", "integer": 15}

If the user sends that JSON object from the controller, the build will deserialize it to an ExampleCommand-type object and call ExampleCommand.Do(), which will output 15 to the console.

e.5.2 Type Inheritance

The Command API relies heavily on type inheritance, which is handled automatically by the JSON converter. Accordingly, new commands can easily be created without affecting the rest of the API, and bugs affecting multiple commands are easy to identify and fix.

/// <summary>
/// Manipulate an object that is already in the scene.
/// </summary>
public abstract class ObjectCommand : Command
    /// <summary>
    /// The unique object ID.
    /// </summary>
    public int id;

    public override void Do()
        isDone = true;

    /// <summary>
    /// Apply command to the object.
    /// </summary>
    /// <param name="co">The model associated with the ID.</param>
    protected abstract void DoObject(CachedObject co);

    /// <summary>
    /// Returns a cached model, given the ID.
    /// </summary>
    protected CachedObject GetObject()
        // Additional code here.

/// <summary>
/// Set the object’s rotation such that its forward directional vector points
/// towards another position.
/// </summary>
public class ObjectLookAtPosition : ObjectCommand
    /// <summary>
    /// The target position that the object will look at.
    /// </summary>
    public Vector3 position;

    protected override void DoObject(CachedObject co)

The TDW backend includes a suite of auto-documentation scripts that scrape the <summary> comments to generate a markdown API page complete with example JSON per command, like this: