SAPIEN: A SimulAted Part-based Interactive ENvironment

by   Fanbo Xiang, et al.

Building home assistant robots has long been a pursuit for vision and robotics researchers. To achieve this task, a simulated environment with physically realistic simulation, sufficient articulated objects, and transferability to the real robot is indispensable. Existing environments achieve these requirements for robotics simulation with different levels of simplification and focus. We take one step further in constructing an environment that supports household tasks for training robot learning algorithm. Our work, SAPIEN, is a realistic and physics-rich simulated environment that hosts a large-scale set for articulated objects. Our SAPIEN enables various robotic vision and interaction tasks that require detailed part-level understanding.We evaluate state-of-the-art vision algorithms for part detection and motion attribute recognition as well as demonstrate robotic interaction tasks using heuristic approaches and reinforcement learning algorithms. We hope that our SAPIEN can open a lot of research directions yet to be explored, including learning cognition through interaction, part motion discovery, and construction of robotics-ready simulated game environment.


page 1

page 5

page 6

page 7

page 12

page 14


HoME: a Household Multimodal Environment

We introduce HoME: a Household Multimodal Environment for artificial age...

Autonomous Planning Based on Spatial Concepts to Tidy Up Home Environments with Service Robots

Tidy-up tasks by service robots in home environments are challenging in ...

The utilization of spherical camera in simulation for service robotics

Safety is one of the most critical factors in robotics, especially when ...

Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items

Interactive 3D simulations have enabled breakthroughs in robotics and co...

myGym: Modular Toolkit for Visuomotor Robotic Tasks

We introduce a novel virtual robotic toolkit myGym, developed for reinfo...

A Framework for Visually Realistic Multi-robot Simulation in Natural Environment

This paper presents a generalized framework for the simulation of multip...

Towards developing a realistic robotics simulation environment of an indoor vegetable greenhouse

This article presents a method for developing a realistic robotics simul...

1 Introduction

Figure 1: Robot-object Interaction in SAPIEN. We show the ray-traced scene (top) and robot camera views (bottom): RGB image, surface normals, depth and semantic segmentation of motion parts, while a robot is learning to operate a dishwasher.
Environment Level Physics Rendering Tasks Interface
Habitat [43]* Scene Static+ Real Photo Navigation, Vision Python, C++
AI2-THOR [25]* Scene-Object Dynamic Unity Navigation+, Vision Python, Unity
OpenAI Gym MuJoCo [2] Scene-Object Dynamic OpenGL(fixed) Learning, Robotics Python
RLBench[22] Scene-Object Dynamic V-REP, PyRep Learning, Vision, Robotics Python, V-REP
SAPIEN Scene-Object-Part Dynamic Customizable Learning, Vision, Robotics Python, C++
Table 1: Comparison to other Simulation Environments. Habitat [43] is a representative for navigation environments, which include Gibson [56, 55], Minos [42]; they primarily use static physics but are starting to add interactions very recently. AI2-THOR [25] is a representative for game-like interactive environments; these environments usually support navigation with limited object interactions. OpenAI Gym [2] and RLBench [22] provide interactive environments, but the use of commercial software limits their customizability.

To achieve human-level perception and interaction with the 3D world, home-assistant robots must have the capability to use perception to interact with 3D objects [11, 59]. For a robot to help put away groceries, it must be able to open the refrigerator by locating the door handle, pulling the door and fetching the target objects.

One direct way to address the problem is to train robots by interacting with the real environment [29, 4, 26]. However, training robots in the real world could be very time consuming, costly, unstable, and potentially unsafe. Moreover, a slight perturbation in hardware or environment setup can result in different outcomes in the real world, thus inhibiting reproducible research. Researchers, therefore, have long been pursuing simulated environments for tasks such as navigation  [42, 56, 43, 1, 3, 54, 14, 56] and control [24, 41, 49, 10].

Constructing simulated environments for robot learning with transferability to the real world is a non-trivial task. It faces challenges from four major aspects: 1) The environment needs to reproduce the real-world physics to some level. As it is still infeasible to simulate real-world physics exactly, any physical simulator needs to decide the level-of-details and accuracy it operates on. Some approximate physics by simulating rigid bodies and joints[35, 49, 10]; some handle soft deformable objects [49, 10]; and others simulate fluid [49, 44]. 2) The environment should incorporate the simulation of real robots, being able to reproduce the behaviors of real robotics manipulators, sensors and controllers [34]. Only this can enable seamless transfer to the real-world after training. 3) The environment needs to produce physically accurate renderings to mitigate the visual domain gap. 4) Most importantly, the environment requires sufficient content, scenes and objects for the robot to interact with, since data diversity is always critical for training and evaluating learning-based algorithms. The content also determines how much we shall address challenges in the previous tasks: data with soft objects such as cloth requires deformable body simulation; translucent objects require special rendering techniques, and specific robot requires a specific interface.

Existing environments achieve these requirements for robotics simulation with different levels of simplification and focus. For example, OpenAI Gym [2] provides an interactive and easy-to-use interface; Gibson [56] and AI Habitat [43] use photorealistic rendering for semantic navigation tasks. A more detailed discussion of popular environment features can be found in Sec 2. These environments can support the benchmarking and training of down-stream tasks such as navigation, low-level control, and grasping. However, from the perspective of tasks, there still lacks environments that target at object manipulation of daily objects, a basic skill of household robots. In a household environment, a great portion of daily objects are articulated and require manipulation: bottles with caps, ovens with doors, electronics with switches and buttons. Notably, RLBench [22] (unpublished) provides well-defined robotics tasks and realistic controller interface with detailed manipulation demonstration, but it lacks diversity in its simulated scenarios.

We take one step further in constructing an environment that supports the manipulation of diverse articulated objects. Our system, SAPIEN, is a realistic and physics-rich simulated environment that hosts a large set for articulated objects. At the core of SAPIEN are three main components: 1) SAPIEN Engine, an interaction-rich and physics-realistic simulation environment integrating PhysX physical engine and ROS control interface; this engine supports accurate simulation of rigid body and joint constraints for simulation of articulated objects. 2) SAPIEN Asset, including PartNet-Mobility dataset, which contains 14K movable parts over 2,346 3D articulated models from 46 common indoor object categories, richly annotated with kinematic part motions and dynamic interactive attributes; 3) SAPIEN Renderer, with both fast-frame-rate OpenGL rasterizer and more photorealistic ray-tracing options. We demonstrate that our SAPIEN enables a large variety of robotic perception and interaction tasks by benchmarking state-of-the-art vision algorithms for part detection and motion attribute recognition. We also show a variety of robotic interaction tasks that SAPIEN supports by demonstrating heuristic approaches and reinforcement learning algorithms.

2 Related Work

Simulation Environments.

In recent years, there has been a proliferation of indoor simulation environments primarily designed for navigation, visual recognition and reasoning [42, 54, 56, 43, 1]. Static environments, based on synthetic scenes [54] or real-world RGB-D scans [1] and reconstructions [42, 56, 43], are able to provide images that closely resembles reality, minimizing the domain gap in the visual aspect. However, they usually offer very limited or no object interactions, failing to capture the dynamic and interactive nature of the real world.

In order to allow for more interactive features to the environment, researchers leverage partial functionalities of game engines or physics engine to provide photorealistic rendering together with interactions [39, 33, 25, 37, 3, 57, 13]. When agents interact with objects in these environments, it is via high level state changes triggered by explicit commands (e.g. “open refrigerator”), or proximity (e.g. refrigerator door opens when the robot or robot arm is next to the trigger region). In addition, the underlying physics is often over-simplified such as direct exertion of force and torque. While they enable research on high-level object interactions, they cannot close the gap between high-level instructions and the low-level dynamics for not including accurate simulation of articulated robots and objects by design. This limits the use of such simulators for learning of detailed low-level robot-object interactions.

Finally, there are environments that integrate full-featured physics engines. These environments are favored in continuous control and reinforcement learning tasks. OpenAI Gym [2], RLLAB [12], DeepMind Control Suite [48] and DoorGym [50] integrate MuJoCo physical engine to provide RL environments. Arena [46], a platform that supports multi-agent environments, is built on top of Unity [23]. PyBullet [10], a real-time physics engine with Python interface, powers a series of projects focusing on robotics tasks [60, 27]. Gazebo  [24], a high-level visualization and modeling package, is widely used in robotics community  [31, 20]. Recently, RLBench [22], a benchmark and physical environment for robot learning, uses V-REP [41]

as the backend to provide diverse tasks for robot manipulation. Our environment, SAPIEN engine, is directly based on the open-source Nvidia PhysX API 

[35], which has comparable performance and interface with PyBullet, avoiding the unnecessary complication introduced by game engine infrastructures, or any barriers from commercial software such as MuJoCo and V-REP. Table 1 provides a brief summary of several representative environments.

One bottleneck of these robotic simulators is their limited rendering capability, which causes a gap between simulation and the real world. Another constraint of many of these environments, including RLBench [22] and DoorGym [50], is that they are very task-centric, designed to work for only a few predefined tasks. Our SAPIEN simulator, equipped with 2,346 3D interactive models from 46 object categories and flexible rendering pipelines, provides robot agents a virtual environment for learning a large set of complex, diverse and customizable robotic interaction tasks.

Dataset #Categories #Models #Motion Parts
Shape2Motion[52] 45 2,440 6,762
RPM-Net[58] 43 949 1,420
Hu et al. [18] - 368 368
RBO*[30] 14 14 21
Ours 46 2,346 14,068
Table 2: Comparison of Articulated Part Datasets. *RBO is collected in real-world with long video sequences.

Simulation Content.

Navigation environments typically use datasets providing real-world RGB-D scans [56, 6, 47], and/or high-quality synthetic scenes [45]. Simulation environments that leverage game engines [39, 33, 3, 13, 25] come with manually designed or procedurally generated game scenes. For environments with detailed physics and reinforcement learning support [2, 48, 12], they usually support very few scenarios with simple objects and robot agents. Notably, RLBench [22] provides a relatively large robot learning dataset with varied tasks. To address the lack-of-content problem, our work provides a large-scale simulation-ready dataset, PartNet-Mobility dataset, that is constructed from 3D model datasets including PartNet [32] and ShapeNet [7].

There are also shape part datasets with part articulation annotations. Table 2 summarizes recent part mobility datasets. The RBO dataset [30] is a collection of 358 RGB-D video sequences of humans manipulating 14 objects which are reconstructed as articulated 3D meshes. The meshes have detailed part motion parameters and have realistic textures. Other datasets annotate 3D synthetic CAD models with articulation information. Hu et al[18] introduced a dataset of 368 mobility part articulations with diverse types. RPM-Net [58] provides another dataset with 969 objects and 1,420 mobility units. Shape2Motion [52] provides a dataset of 2,440 objects and 6,762 movable parts for mobility analysis, but it does not provide RGB textures and motion limits that hinders physical simulation. Compared to these datasets, our dataset contains comparable number of objects (2,346), but with much more movable part annotations (14,068). Besides, our models have textures and motion range limits, which are crucial for the dataset to be simulatable.

3 SAPIEN Simulation Environment

Figure 2: SAPIEN Simulator Overview. The left box shows SAPIEN Renderer, which takes custom shaders and scene information to produce images such as RGB-D and segmentation. The middle box shows SAPIEN Engine, which integrates PhysX simulator and ROS control interface that enables various robot actions and utilities. The right box shows SAPIEN Asset, which contains the large-scale PartNet-Mobility dataset that provides simulatable models with part-level mobility.

SAPIEN aims to integrate state-of-the-art physical simulators, modern graphics rendering engines, and user-friendly robotic interfaces into a unified framework (Figure 2), to support a diverse set of robotic perception and interaction tasks. We develop the environment with C++ for efficiency and provide Python wrapper API for ease-of-use at the user end. Below we detailedly introduce the three main components: SAPIEN engine, SAPIEN asset and SAPIEN renderer.

3.1 SAPIEN Engine

We use the open-source Nvidia PhysX physical engine to provide detailed robot-object interaction simulation. The system provides Robot Operating System (ROS) supports that are easy-to-use for end-stream robotic research. We provide both synchronous and asynchronous modes of simulation to support reinforcement learning training and robotics tasks.

Physical Simulation.

We choose PhysX 4.1 [35] to provide rigid body kinematics and dynamics simulation, since it is open-source, simplistic, and provides functionalities designed for robotics. To simulate articulated bodies, we provide 3 different body-joint systems: kinematic joint system, dynamic joint system, and PhysX articulation. The kinematic joint system provides kinematic objects with parent-child relations, suitable for simulating very heavy objects that are not affected by small forces. Dynamic joint systems use PhysX joints to drive rigid bodies towards constraints, suitable for simulating complicated objects that do not require accurate control. PhysX articulation is a system specifically designed for robot simulation. It natively supports accurate force control, P-D control and inverse dynamics with the cost of relatively low speed.

ROS Interfaces.

Robot Operating System (ROS) [40] is a generic and widely-used framework for building robot applications. Our ROS interface bridges the gap between ROS and physical simulator, as well as provides a set of high-level APIs for interacting with robots in the physics world. It supports three levels of abstractions: direct force control, ROS controllers and motion planning interface.

In the lowest level control, forces and torques are directly applied on joints, similar to OpenAI Gym [2]. This control method is simple and intuitive, but rather difficult to transfer to real environments, since real-world dynamics are quite different from the simulated ones, and the continuous nature present in real-robots are fundamentally different from the discretized approximation in simulations. For high-level control, we provide joint space and Cartesian coordinate space control APIs. We build various controllers (Figure 2) based upon [8] and implement standard interface. A typical use case is to move the robot arm to a desired 6-DoF pose with specific path constraints. Thus, at the highest level, we provide motion planning support based on the popular MoveIt framework [9], which can generate motion plans that effectively move the robot around without collision.

Synchronous and Asynchronous Modes.

Our SAPIEN Engine (see Figure 2 middle) can support both synchronous and asynchronous simulation modes. In synchronous mode, the simulation step is controlled by the client, which is common in training reinforcement learning algorithm [2]

. For example, the agent receives observations from simulated environments and uses a customized policy model, often a neural network, to generate the corresponding action. Then the simulation runs forward for a step. In this synchronous mode, the simulation and client algorithms are integrated together.

However, for real-world robotics, the simulation and client response need to be asynchronous [24] and separated. The simulation should run independently, like the real world, while the client uses the same API as a real robot to interact with the simulation backend. To build such a framework, we create multiple sensors and controllers following the ROS API. After simulation starts, the client receives information from sensors and uses the controller interface (see Figure 2) to command robots via TCP/IP communication. The timestamp is synchronized from simulation to the client side, acting as a proxy for the real-world clock time. Under the framework, the simulated robots can use the same code as their real counterparts because most real robot controllers and sensors have exactly the same interface as our simulator API. This provides one important advantage: it enables robot researchers to migrate between simulated robots and real robots without any extra setup.

Figure 3: SAPIEN Enables Many Robotic Interaction Tasks. From left to right, we show five examples: faucet manipulation, object fetching, object lifting, chair folding, and object placing.
All Bottle Box Bucket Cabinet Camera Cart Chair Clock Coffee DishWsh. Dispenser Door Eyegls Fan Faucet
#M 2,346 57 28 36 345 37 61 80 31 55 48 57 36 65 81 84
#P 14,068 114 94 74 1,174 341 232 1,235 106 374 112 162 103 195 172 228
Chair Fridge Globe Kettle Keybrd Knife Lamp Laptop Lighter MicWav Monitor Mouse Oven Pen Phone Pliers
#M 26 44 61 29 37 44 45 56 28 16 37 14 30 48 17 25
#P 58 118 130 66 3,593 149 165 112 86 85 93 61 214 97 271 59
Pot Printer Remote Safe Scissors Stapler Stcase Switch Table Toaster Toilet TrashCan USB Washer Window
#M 25 29 49 30 47 23 24 70 101 25 69 70 51 17 58
#P 53 376 1,490 202 94 69 101 195 420 116 229 208 103 144 195
Table 3: Statistics of PartNet-Mobility Dataset. #M and #P shows the number of models and movable parts respectively.

3.2 SAPIEN Asset

SAPIEN Asset is our simulation content, shown in the right box in Figure 2. It contains the large-scale ready-to-simulate PartNet-Mobility dataset, the simulated robot models and scene layouts.

PartNet-Mobility Dataset.

We propose a large-scale 3D interactive model dataset that contains over 14K articulated parts over 2,346 object models from 46 common indoor object categories. All models are collected from 3D Warehouse*** and organized as in ShapeNet [7] and PartNet [32]. We annotate 3 types of motions: hinge, slider, and screw, where hinge indicates rotation around an axis (e.g. doors); slider indicates translation along an axis (e.g. drawers), and screw indicates a combined hinge and slider (e.g. bottle caps, swivel chairs). For hinge and slider joints, we annotate the motion limit (i.e

. angles, lengths). For screw, we annotate the motion limits and whether the 2 degrees of freedom are coupled. Each joint has a parent and a child, and the collection of connected bodies and joints is called an articulation. We require the joints of an articulation to follow a tree structure with a single root, since most physical simulator handles tree-structured joint system well. Next, for each movable part, we assign a category-specific semantic label. Table 

3 summarizes the dataset statistics. Please see the supplementary for more details about the data annotation pipeline.

SAPIEN Asset Loader

Unified Robot Description Format (URDF) is a common format for representing a physical model. For each object in the SAPIEN Asset, including PartNet-Mobility models and robot models, we provide an associated URDF file, which can be loaded in simulation. For accurate simulation of contact, we decompose meshes into convex parts  [28, 19]. We randomize or manually set the physical properties, e.g. friction, damping, density, to appropriate ranges. For robot models, we also provide C++/Python APIs to create a robot piece by piece to avoid complications introduced by URDF.

3.3 SAPIEN Renderer

SAPIEN Renderer, shown in the left box of Figure 2, renders simulated scenes with OpenGL 4.5 and GLSL shaders, which are exposed to the client application for maximal customizability. By default, the rendering module uses a deferred lighting pipeline to provide RGB, albedo, normal, depth, and segmentation from camera space, where lighting is computed with Oren–Nayar diffuse model [53] and GGX specular model [51]. Our customizable rendering interface can suit special rendering needs, and even allow completely different rendering pipelines. We demonstrate this by replacing the fast OpenGL framework with our ray tracer coded with Nvidia OptiX [36] to produce physically accurate images at the cost of rendering time (see Figure 1).

3.4 Profiling Analysis

Our SAPIEN engine can run at about 5000Hz on the manipulation task we will describe in Sec. 4.2 and can render at about 700Hz with OpenGL mode. Tests were performed on a laptop with Ubuntu 18.04, on 2.2 GHz Intel i7-8750 CPU and an Nvidia GeForce RTX 2070 GPU.

4 Tasks and Benchmarks

We demonstrate the versatile abilities of our simulator by demonstrating robotic perception and interaction tasks.

4.1 Robotic Perception

SAPIEN simulator, equipped with the PartNet-Mobility dataset, provides a platform for several robotic perception tasks. In this paper, we study the tasks of movable part detection and part motion estimation, which are two important vision tasks supporting downstream robotic interaction.

Cabinet Table Faucet Fan All
Algorithm Inputs
body drawer
drawer body wheel door caster switch base spout rotor frame mAP
RCNN [16]
2D (RGB) 62.0 94.2 66.4 27.7 54.3 88.0 3.4 6.3 0.0 52.5 47.9 99.7 54.4 67.5 53.0
2D (RGB-D) 61.7 93.0 63.0 26.3 58.6 89.9 1.4 13.2 0.0 52.1 55.8 98.9 39.4 67.4 52.8
InsSeg [32]
PC (XYZ) 20.6 65.9 35.1 9.8 15.7 71.3 1.7 1.0 0.0 34.4 55.9 64.2 50.9 74.8 36.1
PC (XYZRGB) 17.4 64.3 23.6 5.0 16.4 81.8 1.3 2.0 1.0 29.9 64.1 78.0 42.0 63.5 37.1
Table 4: Movable Part Detection Results. (AP% with IoU threshold 0.5) 2D and PC denote 2D images and point clouds as different input modalities for the two algorithms. We show the detailed results for four objects categories and summarize the mAP over all categories. See supplementary for the full table.

Movable Part Detection

Before interacting with objects by parts, robotic agents need to first detect the parts of interest. Therefore, we define the task of movable part detection as follows. Given a single 2D image snapshot or 3D RGB-D partial scan of an object as input, an algorithm should produce several disjoint part masks associated with their semantic labels, each of which corresponds to an individual movable part of the object.

Figure 4: Movable Part Detection Results. The left column shows the results of Mask R-CNN [16], where each bounding box indicates a detected movable part. The middle and the right columns show the results of PartNet InsSeg [32] and the ground truth point clouds respectively, where different parts are in different color.

Leveraging the rich assets from the PartNet-Mobility dataset and the SAPIEN rendering pipeline, we evaluate two state-of-the-art perception algorithms for object or part detection in literature. Mask R-CNN [16] takes a 2D image as input and uses a region proposal network to detect a set of 2D part mask candidates. PartNet-InsSeg [32] is a 3D shape part instance segmentation approach that uses PointNet++ [38] to extract geometric features and proposes panoptic segmentation over shape point clouds.

We render each object in the PartNet-Mobility dataset into RGB and RGB-D images from 20 randomly sampled views, with resolution . The camera positions are randomly sampled over the upper hemisphere to ensure space coverage. Simple ambient and directional lighting without shadows are provided for RGB rendering. With known camera intrinsics, we lift the 2.5D RGB-D images into 3D partial scans for PartNet-InsSeg experiments. We use all 2,346 objects over 46 categories from the PartNet-Mobility dataset for this task. We use 75% of data (1,772 shapes) for training and 25% (574 shapes) for testing. For quantitative evaluation, we report per-part-category Average Precision (AP) scores as commonly used for object detection tasks and average across all part categories to compute mAP for each algorithm.

Table 4 shows the quantitative results of Mask R-CNN on RGB and RGB-D settings and PartNet-InsSeg on the XYZ (depth-only) and XYZRGB (RGB-D images) settings. We observe that both methods perform poorly on detecting small parts (e.g., table wheel and table caster), and the phenomenon is less severe for object categories that have relatively balanced sizes (e.g., fan and faucet). Small movable parts (e.g., button, switch, and handle) often play critical roles in robot-object interaction, and will demand more well-designed algorithms in the future. Figure 4 visualizes the Mask-RCNN and PartNet-InsSeg part detection results on two example RGB-D partial scans.

Setting Algorithm acc. acc. err () err () err. () door err. () drawer err. ()
RGB-D ResNet50 95.5% 95.5% 0.168 18.9 6.35 14.4 0.0645
RGB-pc PointNet++ 95.4% 95.5% 0.195 18.5 7.75 20.8 0.0918
Table 5: Motion recognition results. acc. and acc. denotes classification accuracy for hinge and slider respectively. err. denotes average distance from predicted hinge origin to ground truth axis. / denotes average hinge/slider angle difference from predicted axis to ground truth. door err. is average angle difference from predicted door pose to ground truth. drawer err. is average length difference from predicted drawer pose to ground truth.

Motion Attributes Estimation

Estimating motion attributes for articulated parts gives strong priors for robots before interacting with objects. In this section, we perform the motion attributes estimation task that jointly predicts the motion type, motion axis, and part state for articulated parts.

We consider two types of rigid part motions: 3D rotation and translation. Some parts, such as bottle cap, may have both rotation and translation motions. For translation motions, we use a 3-dim vector to represent the direction. For rotation motions, we parameterize the outputs as two 3-dim vectors to specify rotation axis direction and a pivot point on the axis. We define relative positions of the articulated part with respect to its semantic rest positions as part states. For example, the rest position for drawers and doors is when they are closed. However, defining part rest states has intrinsic ambiguities. For example, round knobs with rotation symmetry do not present a detectable rest position. Thus, we use a subset of 640 models over 10 categories, which consists of 779 doors and 529 drawers for this task, following the same train and test splits used in the previous section.

We evaluate two baseline algorithms, ResNet-50 [17] and PointNet++ [38], that deals with the input RGB-D partial scans using either 2D or 3D formats. For ResNet-50, we input RGB-D images augmented with target part mask (5-channel in total). For PointNet++, we substitute the 5-channel image with its camera-space RGB point cloud. We train both networks to output a 14-dim motion vector , where and

respectively output the probability of this joint being rotational and translational,

and indicate pivot point and rotation axis for hinge joints, represents the direction of a proposed slider axis, and finally, and regress the part poses for doors and drawers respectively. The part pose is a number normalized within indicating the current joint position. See supplementary for more details about network architectures, loss designs, and training protocols.

We summarize the experimental results in Table 5. The classification of different motion types achieves quite high accuracy, and the axis prediction for sliders (translational joints) achieves lower error than for hinges (rotational joints). In our experiments, ResNet50 achieves better performance than PointNet++. This could be explained by the much higher number of network parameters in ResNet. However, intuition suggests that such 3D information should be more easily predicted directly on 3D data. Future research should focus more on how to improve 3D axis prediction with 3D grounding.

4.2 Robotic Interaction

Figure 5: Robotic Interaction tasks. We study two robotic interaction tasks: door-opening and drawer-pulling.

With the large-scale PartNet-Mobility dataset, SAPIEN also supports various robotic interaction tasks, including solving low-level control tasks, such as button pushing, handle grasping, and drawer pulling, and planning tasks that require long-horizon logical planning and low-level controls, e.g., removing the mug from a microwave oven and then putting it on a table. Having both diverse object categories and rich intra-class instance variations allows us to perform such tasks on multiple object instances at category levels. Figure 3 shows a rich variety of robotic interaction tasks that SAPIEN enables.

In SAPIEN, we enable two modes for robotic interaction tasks: 1) using perception ground-truth (e.g., part mask, part motion information, and 3D locations) to accomplish the task. In this way, we factor out the perception module and allow algorithms to focus on robotic control and interaction tasks; 2) using the raw image/point-cloud as inputs, the method needs to develop its own perception, planning and control modules, which is our end-goal for the home-assistant robots to achieve. Also, this mode enables end-to-end learning for perception and interactions (e.g., learning perception with a specific interaction target).

Door-opening and Drawer-pulling.

We perform two manipulation tasks: door-opening and drawer-pulling, as shown in Figure 5. We use a flying gripper (Kinova Gripper 3 [5]) that can move freely in the workspace. All dynamics properties except gravity, (e.g., contact, friction, and damping) are simulated in our environment. We perform our drawer-pulling tasks on cabinet instances and door-opening tasks on cabinet instances.

In our tasks, if the gripper can move a given joint (e.g., slider joint of the drawer, hinge joint of the door) through of its motion range, then it will be regarded as a success. If the agent cannot move the joint to the given threshold or move in the opposite direction, then it fails. The input of the agent consists of point clouds, normal maps and segmentation masks captured by three fixed cameras mounted on the left, right and front of the arena respectively. The agent can also access all information about its self (e.g., 6 DoF pose).

Heuristic Based Manipulation.

To demonstrate our simulator in manipulation tasks, we first use manually designed heuristic pipelines to solve the tasks. For drawer-pulling, we use point cloud with ground-truth segmentation to detect a valid grasp pose for drawer handle. Then we use velocity controller to pull it to the joint limit. Using ground-truth visual information, we can achieve a success rate. As for the door-opening task, we first open the door with a small angle using a similar approach (grasp a handle at first). Then we use Position Based Visual Servoing (PBVS) [21] to track and clamp the edge of the door. Finally, the door is opened by rotating the edge. This method (PBVS) achieves an success rate for door opening. A more detailed illustration of this heuristic-based pipeline can be found in our supplementary video.

Learning Based Manipulation.

We also demonstrate the above two tasks using reinforcement learning. We test the generalizability of the RL agent by training on limited objects and testing on unseen objects with different size, density, and motion properties. We adopt Soft Actor-Critic(SAC) [15], which is one of the SOTA reinforcement learning algorithms, trained on doors or drawers, and test on the rest unseen models.

We provide three different state representations: 1) raw state of the whole scene (raw-exp), consisting of current positions and velocities of all the parts; 2) mobility-based representation (mobility-exp), with 6D pose of motion axis and average normal, and current joint angles and velocities of the target part; 3) visual inputs (visual-exp), where we set a front-view camera capturing RGB-D images for the object every time step, augmented with segmentation mask for the target part.

We use the same flying gripper and initialize it on the handle. The grasp pose is generated by the heuristic method as described in the above section. During training, agents receive positive rewards when the target part approaches the joint limit with the opening door/drawer, while obtaining negative rewards when the gripper falls off the handle. We interact with multiple objects simultaneously during training, and use a shared replay buffer to collect experiences to train SAC. After 1M interaction steps, we evaluate the performance on the unseen objects, each for 20 episodes.

For doors, the evaluation metric is the average achieved degree. For drawers, we report the success rate of opening 80% of joint limits. Table 

6 shows our experimental results. For door-opening, the RL agent tends to overfit the training objects, as when the number of training objects grows, the performance drops. However, training on more scenarios will improve the generalization capability with increased test performance. For drawer-pulling, although the performance follows the same pattern as the door, it is relatively stable across the number of training objects. This is because drawers are relatively easier to pull out, as the movement for the gripper almost follows the same pattern every time step.

Among all the representations, mobility-exp gives the best performance. For doors, visual-exp representation also performs close to mobility-exp; however for drawers, raw-exp is better than visual-exp. This is because the camera is fixed during the interaction. For drawer-opening, the visual features remain almost the same every time step from the front view, so it provides little information about state changing. These observations lead us to some interesting future work. First, we need proper vision methods to encode the geometric information of the scene, which may change during interaction procedures. Second, although these tasks are not hard for heuristic algorithms, RL-based approaches fail to perform well on all the objects. Future works may study how to enhance the transferability and efficiency of RL on the tasks.

(Final Angle Degree)
(Success Rate)
2 4 8 16 2 4 8 16
raw-exp train 85.4 70.5 50.5 38.4 0.84 0.82 0.77 0.75
test 14.7 18.7 21.2 27.3 0.61 0.63 0.66 0.66
mobility-exp train 88.7 78.6 59.2 41.1 0.83 0.81 0.79 0.78
test 22.9 27.3 27.5 32.8 0.65 0.65 0.69 0.68
visual-exp train 90.2 65.2 56.7 32.1 0.80 0.72 0.69 0.63
test 21.7 24.5 28.1 29.6 0.59 0.60 0.61 0.60
Table 6: SAC results on door and drawer opening.

5 Conclusion

We present SAPIEN, a simulation environment for robotic vision and interaction tasks, which provides detailed part-level physical simulation, hierarchical robotics controllers and versatile rendering options. We demonstrate that our SAPIEN enables a large variety of robotic perception and interaction tasks.


This research was supported by NSF grant IIS-1764078, NSF grant IIS-1763268, a Vannevar Bush Faculty Fellowship, the Canada CIFAR AI Chair program, gifts from Qualcomm, Adobe, and Kuaishou Technology, and grants from the Samsung GRO program and the SAIL Toyota Research Center.


  • [1] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-Language Navigation: interpreting visually-grounded navigation instructions in real environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §2.
  • [2] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: Table 1, §1, §2, §2, §3.1, §3.1.
  • [3] S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. Courville (2017) HoME: a household multimodal environment. arXiv preprint arXiv:1711.11017. Cited by: §1, §2, §2.
  • [4] B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, and A. M. Dollar (2017) Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics Research 36 (3), pp. 261–268. Cited by: §1.
  • [5] A. Campeau-Lecours, H. Lamontagne, S. Latour, P. Fauteux, V. Maheu, F. Boucher, C. Deguire, and L. C. L’Ecuyer (2019) Kinova modular robot arms for service robotics applications. In Rapid Automation: Concepts, Methodologies, Tools, and Applications, pp. 693–719. Cited by: §4.2.
  • [6] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §2.
  • [7] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §2, §3.2.
  • [8] S. Chitta, E. Marder-Eppstein, W. Meeussen, V. Pradeep, A. R. Tsouroukdissian, J. Bohren, D. Coleman, B. Magyar, G. Raiola, M. Lüdtke, et al. (2017) Ros_control: a generic and simple control framework for ros. Cited by: §3.1.
  • [9] S. Chitta, I. Sucan, and S. Cousins (2012) Moveit![ros topics]. IEEE Robotics & Automation Magazine 19 (1), pp. 18–19. Cited by: §3.1.
  • [10] E. Coumans and Y. Bai (2016)

    Pybullet, a python module for physics simulation for games, robotics and machine learning

    GitHub repository. Cited by: §1, §1, §2.
  • [11] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2054–2063. Cited by: §1.
  • [12] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338. Cited by: §2, §2.
  • [13] X. Gao, R. Gong, T. Shu, X. Xie, S. Wang, and S. Zhu (2019) VRKitchen: an interactive 3D virtual environment for task-oriented learning. arXiv abs/1903.05757. Cited by: §2, §2.
  • [14] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625. Cited by: §1.
  • [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §4.2.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Figure 4, §4.1, Table 4.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [18] R. Hu, W. Li, O. Van Kaick, A. Shamir, H. Zhang, and H. Huang (2017) Learning to predict part mobility from a single static snapshot. ACM Transactions on Graphics (TOG) 36 (6), pp. 227. Cited by: §2, Table 2.
  • [19] J. Huang, H. Su, and L. Guibas (2018) Robust watertight manifold surface generation method for shapenet models. arXiv preprint arXiv:1802.01698. Cited by: §3.2.
  • [20] L. Hugues and N. Bredeche (2006) Simbad: an autonomous robot simulation package for education and research. In International Conference on Simulation of Adaptive Behavior, pp. 831–842. Cited by: §2.
  • [21] S. Hutchinson, G. D. Hager, and P. I. Corke (1996) A tutorial on visual servo control. IEEE transactions on robotics and automation 12 (5), pp. 651–670. Cited by: §4.2.
  • [22] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison (2019) RLBench: the robot learning benchmark & learning environment. arXiv preprint arXiv:1909.12271. Cited by: Table 1, §1, §2, §2, §2.
  • [23] A. Juliani, V. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and D. Lange (2018) Unity: a general platform for intelligent agents. arXiv preprint arXiv:1809.02627. Cited by: §2.
  • [24] N. Koenig and A. Howard (2004) Design and use paradigms for Gazebo, an open-source multi-robot simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), Vol. 3, pp. 2149–2154. Cited by: §1, §2, §3.1.
  • [25] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) AI2-THOR: an interactive 3D environment for visual AI. arXiv:1712.05474. Cited by: Table 1, §2, §2.
  • [26] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen (2018)

    Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection

    The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §1.
  • [27] M. Lutter, C. Ritter, and J. Peters (2019) Deep lagrangian networks: using physics as model prior for deep learning. arXiv preprint arXiv:1907.04490. Cited by: §2.
  • [28] K. Mamou, E. Lengyel, and E. A. Peters (2016) Volumetric hierarchical approximate convex decomposition. Game Engine Gems 3, pp. 141–158. Cited by: §3.2.
  • [29] A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al. (2018) ROBOTURK: a crowdsourcing platform for robotic skill learning through imitation. arXiv preprint arXiv:1811.02790. Cited by: §1.
  • [30] R. Martín-Martín, C. Eppner, and O. Brock (2019) The RBO dataset of articulated objects and interactions. The International Journal of Robotics Research 38 (9), pp. 1013–1019. Cited by: §2, Table 2.
  • [31] J. Meyer, A. Sendobry, S. Kohlbrecher, U. Klingauf, and O. Von Stryk (2012) Comprehensive simulation of quadrotor uavs using ROS and Gazebo. In International conference on simulation, modeling, and programming for autonomous robots, pp. 400–411. Cited by: §2.
  • [32] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su (2019-06) PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.2, Figure 4, §4.1, Table 4.
  • [33] M. Müller, V. Casser, J. Lahoud, N. Smith, and B. Ghanem (2018) Sim4CV: a photo-realistic simulator for computer vision applications. International Journal of Computer Vision 126 (9), pp. 902–919. Cited by: §2, §2.
  • [34] A. Murali, T. Chen, K. V. Alwala, D. Gandhi, L. Pinto, S. Gupta, and A. Gupta (2019) PyRobot: an open-source robotics framework for research and benchmarking. arXiv preprint arXiv:1906.08236. Cited by: §1.
  • [35] Nvidia PhysX physics engine. Note: Cited by: §1, §2, §3.1.
  • [36] S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, et al. (2010) OptiX: a general purpose ray tracing engine. In Acm transactions on graphics (tog), Vol. 29, pp. 66. Cited by: §3.3.
  • [37] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba (2018) VirtualHome: simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [38] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §4.1, §4.1.
  • [39] W. Qiu (2017) UnrealCV: virtual worlds for computer vision. ACM Multimedia Open Source Software Competition. Cited by: §2, §2.
  • [40] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng (2009) ROS: an open-source robot operating system. In ICRA workshop on open source software, Vol. 3, pp. 5. Cited by: §3.1.
  • [41] E. Rohmer, S. P. Singh, and M. Freese (2013) V-REP: a versatile and scalable robot simulation framework. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1321–1326. Cited by: §1, §2.
  • [42] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun (2017) MINOS: multimodal indoor simulator for navigation in complex environments. arXiv:1712.03931. Cited by: Table 1, §1, §2.
  • [43] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra (2019) Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: Table 1, §1, §1, §2.
  • [44] C. Schenck and D. Fox (2018) Spnets: differentiable fluid dynamics for deep neural networks. arXiv preprint arXiv:1806.06094. Cited by: §1.
  • [45] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser (2017) Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [46] Y. Song, J. Wang, T. Lukasiewicz, Z. Xu, M. Xu, Z. Ding, and L. Wu (2019) Arena: a general evaluation platform and building toolkit for multi-agent intelligence. arXiv preprint arXiv:1905.08085. Cited by: §2.
  • [47] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe (2019) The Replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: §2.
  • [48] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller (2018-01) DeepMind control suite. Technical report Vol. abs/1504.04804, DeepMind. Note: External Links: Link Cited by: §2, §2.
  • [49] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §1.
  • [50] Y. Urakami, A. Hodgkinson, C. Carlin, R. Leu, L. Rigazio, and P. Abbeel (2019) DoorGym: a scalable door opening environment and baseline agent. arXiv preprint arXiv:1908.01887. Cited by: §2, §2.
  • [51] B. Walter, S. R. Marschner, H. Li, and K. E. Torrance (2007) Microfacet models for refraction through rough surfaces. In Proceedings of the 18th Eurographics conference on Rendering Techniques, pp. 195–206. Cited by: §3.3.
  • [52] X. Wang, B. Zhou, Y. Shi, X. Chen, Q. Zhao, and K. Xu (2019) Shape2Motion: joint analysis of motion parts and attributes from 3D shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8876–8884. Cited by: §2, Table 2.
  • [53] L. B. Wolff, S. K. Nayar, and M. Oren (1998) Improved diffuse reflection models for computer vision. International Journal of Computer Vision 30 (1), pp. 55–71. Cited by: §3.3.
  • [54] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian (2018) Building generalizable agents with a realistic and rich 3D environment. arXiv preprint arXiv:1801.02209. Cited by: §1, §2.
  • [55] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese (2019) Interactive Gibson: a benchmark for interactive navigation in cluttered environments. arXiv preprint arXiv:1910.14442. Cited by: Table 1.
  • [56] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson Env: real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9068–9079. Cited by: Table 1, §1, §1, §2, §2.
  • [57] C. Yan, D. Misra, A. Bennnett, A. Walsman, Y. Bisk, and Y. Artzi (2018) CHALET: Cornell house agent learning environment. arXiv:1801.07357. Cited by: §2.
  • [58] Z. Yan, R. Hu, X. Yan, L. Chen, O. van Kaick, H. Zhang, and H. Huang (2019) RPM-Net: recurrent prediction of motion and parts from point cloud. ACM Trans. on Graphics (Proc. SIGGRAPH Asia). Cited by: §2, Table 2.
  • [59] J. Yang, Z. Ren, M. Xu, X. Chen, D. J. Crandall, D. Parikh, and D. Batra (2019) Embodied amodal recognition: learning to move to perceive objects. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2040–2050. Cited by: §1.
  • [60] A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser (2019) TossingBot: learning to throw arbitrary objects with residual physics. arXiv preprint arXiv:1903.11239. Cited by: §2.