ManiSkill: Learning-from-Demonstrations Benchmark for Generalizable Manipulation Skills

by   Tongzhou Mu, et al.
University of California, San Diego

Learning generalizable manipulation skills is central for robots to achieve task automation in environments with endless scene and object variations. However, existing robot learning environments are limited in both scale and diversity of 3D assets (especially of articulated objects), making it difficult to train and evaluate the generalization ability of agents over novel objects. In this work, we focus on object-level generalization and propose SAPIEN Manipulation Skill Benchmark (abbreviated as ManiSkill), a large-scale learning-from-demonstrations benchmark for articulated object manipulation with 3D visual input (point cloud and RGB-D image). ManiSkill supports object-level variations by utilizing a rich and diverse set of articulated objects, and each task is carefully designed for learning manipulations on a single category of objects. We equip ManiSkill with a large number of high-quality demonstrations to facilitate learning-from-demonstrations approaches and perform evaluations on baseline algorithms. We believe that ManiSkill can encourage the robot learning community to explore more on learning generalizable object manipulation skills.



There are no comments yet.


page 1

page 5


Robobarista: Object Part based Transfer of Manipulation Trajectories from Crowd-sourcing in 3D Pointclouds

There is a large variety of objects and appliances in human environments...

Deep Object-Centric Representations for Generalizable Robot Learning

Robotic manipulation in complex open-world scenarios requires both relia...

Vision-based Robot Manipulation Learning via Human Demonstrations

Vision-based learning methods provide promise for robots to learn comple...

Reusable neural skill embeddings for vision-guided whole body movement and object manipulation

Both in simulation settings and robotics, there is an ambition to produc...

Human-like Planning for Reaching in Cluttered Environments

Humans, in comparison to robots, are remarkably adept at reaching for ob...

Semantic State Estimation in Cloth Manipulation Tasks

Understanding of deformable object manipulations such as textiles is a c...

Verbal Focus-of-Attention System for Learning-from-Demonstration

The Learning-from-Demonstration (LfD) framework aims to map human demons...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the great promises of robotics is to build intelligent robots that can interact with the surrounding environments to achieve task automation. With the recent development of various high-quality and affordable robot arms as well as dedicated software platforms [1, 2, 3, 4, 5, 6, 7], research in object manipulations has attracted a great deal of attention. The broad applicability of robot manipulation skills in fields such as assembly [8], object rearrangement [9], healthcare [10], and food preparation [11] entails the need to build robots to deal with unseen and unstructured environments. In such environments, learning is critical to achieving an acceptable level of automation since it is difficult to manually design manipulation skills that can handle the endless variations of scenes and objects encountered in the real world.

Normally, either reinforcement learning, imitation learning, or combined approaches

[12, 13, 14, 15, 16] are adopted to learn these manipulation skills. Recent advances in RL mostly focus on improving sample efficiency rather than the generalizability of the learned policies. On the other hand, imitation learning (or learning from demonstrations) is a promising direction to achieving generalizability as it distills policy from rich demonstrations. However, as a typical requirement of data-driven approaches, the scale and diversity of the demonstrations are keys to generalizable manipulation skills.

Among several types of generalization for object manipulation skills, we focus on object-level generalizability. For instance, when training a robot to open cabinets, we want it to be capable of generalizing to opening unseen cabinets during inference. Thus, we define object-level generalizability as the ability of a learned manipulation skill to generalize on novel object instances among the same category of objects. In real applications such as household environments, manipulating unseen objects is a very common scenario.

In the past, reusable skills have been studied on basic skill primitives such as moving, pushing, and grasping [17], or combined skill primitives such as push with grasp, pick and place [18]. These skill primitives do not emphasize the generalizability over objects of the same category. To achieve object-level generalizability, we believe combining low-level primitive skills can be highly non-trivial on different objects with different geometries and physics. For example, to open a door, the agent needs to first grasp the handle, then move along a trajectory while avoiding collisions. The handle grasping, the trajectory following, and the collision avoidance processes are all highly coupled with each other and unique to the object being manipulated, and thus learning them separately might not lead to an object-level generalizable policy. Therefore, we believe that learning the entire short-horizon tasks as a whole, like opening a door, can help solve more realistic problems and may facilitate future real-world manipulation tasks.

A major challenge of learning object-level generalizable skills is the lack of rich 3D (articulated) object assets in existing benchmarks [19, 20, 21, 22]. Policies trained with inadequate variations in object assets might generalize over slightly different locations of objects and goals from their training distributions, yet their performance tends to degrade drastically on novel objects. Moreover, the complexity of articulated objects forbids augmenting objects by simply varying physical parameters as in sim2real [23, 24]. A recently proposed benchmark DoorGym [25], a door-opening benchmark, contains different door instances, but this benchmark is limited to only one single skill and does not incorporate a variety of primitive manipulation skills.

To facilitate the learning of manipulation skills with object-level generalizability, we propose the SAPIEN Manipulation Skill Benchmark (abbreviated as ManiSkill), a large-scale learning-from-demonstrations benchmark for learning manipulation skills over a diverse set of articulated objects. ManiSkill currently includes four tasks (with an ongoing effort to include more), each of which requires a robot to perform a manipulation skill that is common in the household environment (illustrated in Figure 1). Each task has variations in initial states, physical parameters, object instances, and target parameters. These variations, along with the robot models and observation modes (point clouds, RGB-D images & states), are fully configurable. ManiSkill is built based on SAPIEN [26] & PartNet-Mobility dataset [27, 28] and supports both velocity and position-based controllers. Moreover, we take a divide-and-conquer approach to create high-quality demonstrations. With a meticulous effort on tuning a population of RL agents for each object instance of each manipulation task, we are able to collect a large number of expert trajectories. We find alternative approaches infeasible, such as training a single RL agent jointly on multiple object instances.

In summary, below are the key points that distinguish ManiSkill from existing benchmarks:

  • We designed four manipulation tasks in ManiSkill, where each task supports a wide range of object-level variations by utilizing a rich and diverse set of articulated objects, easing both the training and evaluation of generalizable manipulation policies.

  • Each task in ManiSkill is carefully designed for learning short-horizon control policies to manipulate a single category of objects. This setup eliminates the need for high-level planning, while ensuring that each task still requires complicated skills. The object-level variations encourage an agent to learn transferable skills, which might be used to constitute more complicated tasks.

  • We equip each task environment with a large number of high-quality demonstrations to facilitate generalizable skill learning via learning-from-demonstrations approaches. The demonstrations are collected by using carefully tuned RL agents trained with well-shaped rewards.

2 Related Work

Robot learning environments

Recently, several simulation environments have been developed to facilitate research on embedded AI, visual reasoning and robot learning. These environments usually embed physics simulators [29, 30, 31, 32] and support tasks with different levels of abstraction. AI2-THOR [1], House3D [2], VirtualHome [3], Gibson [4], iGibson [5], AI Habitat [6], and TDW [7] simulate mobile robot in a household environment to solve semantic or interactive navigation tasks. These environments focus on navigation in complex scenes with a simplified interaction model and are not tailored towards object manipulation tasks that involve complex rigid-body contacts. Robot control environments like OpenAI Gym [33], Deepmind Control Suite [34], RoboSuite [20], DoorGym [25], MetaWorld [19], and RLBench [22] integrate full-feathered physics engines for continuous control and reinforcement learning. Along with the rise of offline reinforcement learning, benchmarks like D4RL [35] have been proposed to measure the progress of offline reinforcement learning algorithms. However, except for DoorGym [25], existing benchmarks are not suitable for studying object-level generalizability due to insufficient asset variety, and DoorGym is limited to one single object category. Our ManiSkill benefits from the large-scale articulated object collection in SAPIEN assets [26] and from the variety of object categories with physically realistic interactions, making it a proper testbed for object-level generalizable policy learning.

Manipulation skill learning

Object manipulation is a long-standing task in robotics. Recently, reinforcement learning and imitation learning show success in learning object grasping [36] and manipulation [37, 13, 14, 15, 38, 16, 39, 40]. Approaches like Domain Randomization [23, 24, 41, 42, 43] have shown great potential in generalizing policy learned in simulation to real world. Despite these works’ success in solving challenging control tasks, they do not aim at learning a generalizable policy over variations in articulated objects. On the other hand, learning-based grasping [44, 45, 46, 47, 48, 49, 50, 51] has gained increasing popularity and can propose novel grasp poses on novel objects. However, grasping policies are limited to solving pick-and-place tasks, and few of them have explored the grasping of articulated objects. ManiSkill, as a complement to previous works, contains multiple manipulation tasks and supports dexterous manipulation of both multi-arm robots and multiple categories of 3D articulated objects, thereby enriching the learning environments for generalizable manipulation skills.

Interaction with articulated objects

Articulated objects, such as cabinets, drawers, and ovens, are common in household environments. Traditional planning and control approaches [52, 53]

that rely on expert strategies and object models are not capable of handling diverse scenes and objects. One solution is to incorporate a perception module that estimates articulation models into planning approaches 

[54, 55, 56]. Recently, there has been a widespread interest in adopting learning models for perception. For example, works such as  [57, 58, 59]

identify articulated objects’ kinematic constraints, parts, and poses through neural networks. These approaches can usually generalize to novel objects and novel categories. However, how to incorporate perception models into direct policy learning

[12, 14, 60, 61, 62] or generalize learned policies to novel objects is underexplored. DoorGym [25], a door-opening environment with randomly sampled doors, joints, knobs and visual appearance, helps train policies that can transfer to the real world. To extend prior works, ManiSkill provides more diverse articulated objects. We hope that ManiSkill can encourage the interdisciplinarity of perception, planning, and policy learning.

Learning from demonstrations

Imitation learning [63, 64, 14] and offline RL [65, 66, 67] have shown promising progress in robot policy learning. Learning-from-demonstrations algorithms are able to leverage existing large-scale datasets of past interaction experiences [68, 69] to mine optimal policies or hint exploration in further online policy learning. ManiSkill provides high quality demonstrations to enable the training of these learning-form-demonstrations algorithms and the evaluation of object-level generalizability.

3 ManiSkill: SAPIEN Manipulation Skill Benchmark

Figure 2: Overall architecture of ManiSkill. We sample objects from the PartNet-Mobility dataset, split them into training and test sets, and then build the corresponding training and test environments in the SAPIEN simulator. We then generate successful demonstration trajectories on the training environments. Users are expected to build policies based on the demonstration trajectories and the training environments, then evaluate the generalization performance on the test environments with the provided evaluation kit (see B.3).

The goal of building ManiSkill can be best described as facilitating to learn generalizable manipulation skills from demonstrations. “Generalizable” requires the learned policy to be able to transfer to unseen objects; “manipulation” involves low-level physical interactions between robot agents and objects; “skills” refer to policies that can solve short-horizon tasks, which can be viewed as basic building blocks of more complicated policies; “demonstrations” are high-quality trajectories provided that solved tasks successfully.

Next, we will describe the components of ManiSkill in detail, including basic benchmark setup, design of tasks, and demonstration trajectories. The overall architecture of the benchmark suite is also summarized in Figure 2.

3.1 Basic Benchmark Setup

We first define some terminologies used in ManiSkill.

  • Task: In ManiSkill, we define a set of MDPs that require a specific type of skill (e.g. opening cabinet doors) as a task. In a task, the objects to be manipulated are sampled from the corresponding object dataset, and will change from episode to episode.

  • Environment: Each task includes multiple environments. We define a set of MDPs from the same task and containing the same object as an environment. In an environment, the object to be manipulated does not change, but some environment configuration parameters are randomized. These parameters include initial poses of objects and robots, along with physical parameters such as friction.

  • Level: Each environment includes an infinite number of levels, and each level has fixed environment parameters, i.e., everything is fixed in a specific level. This is similar to the notion of “level” in the ProcGen benchmark [70]. Note that each level is associated with a random seed, and larger random seed does not mean higher difficulty.

ManiSkill currently comes with four tasks: OpenCabinetDoor, OpenCabinetDrawer, PushChair, and MoveBucket, each designed to exemplify a real-world manipulation challenge.

For each task, we first divide all relevant objects of a task into the training set and the test set, and we build the environments accordingly. We will refer to them as the training environments and the test environments. We then collect successful trajectories for each training environment, which are called demonstrations.

Users are expected to develop policies based on the training environments and the corresponding demonstrations. The learned policies are executed and evaluated on the test environments with unseen objects within the same category. Take the OpenCabinetDoor task as an example. This task requires a robot to open a designated door of a cabinet. We provide 42 different cabinets in the training environments, and the trained policy will be evaluated on the held-out 10 test environments (another 10 cabinets). Some examples of training and test objects are shown in Figure 3.

Figure 3: A subset of training and test objects for the OpenCabinetDoor task. ManiSkill requires a policy to learn on the training objects and generalize to the test objects. (Note: This figure shows a random train/test split and does not reflect the actual split.)

3.2 Robots, Actions, Observations and Rewards

All the tasks in ManiSkill use similar robots, which are composed of three parts: moving platform, Sciurus robot body, and one or two Franka Panda arm(s). The moving platform can move and rotate on the ground plane, and its height is adjustable. The robot body is fixed on top of the platform, providing support for the arms. Depending on the task, one or two robot arm(s) are connected to the robot body. There are 22 joints in a dual-arm robot and 13 for a single-arm robot. To match realistic robotics setups, we use PID controllers to control the joints of robots. The robot fingers use position controllers, while all other joints, including the moving platform joints and the arm joints, use velocity controllers. The controllers are internally implemented as augmented PD and PID controllers. The action space corresponds to the normalized target values of all controllers.

The observations from our environments consist of three components: 1) A vector that describes the current state of the robot, including pose, velocity, angular velocity of the moving platform of the robot, joint angles and joint velocities of all robot joints, positions and velocities of end effectors, as well as states of all controllers; 2) A vector that describes task-relevant information, if necessary; 3) Perception of the scene, which has different representations according to the observation modes. ManiSkill supports three observation modes: state, RGB-depth (RGB-D), and point cloud. In state mode, the agent receives a vector that specifies the current ground truth state of all objects in the scene, including pose, velocity, and angular velocity of the object being manipulated, as well as joint angles and joint velocities if it is an articulated object (e.g, cabinet). State mode is commonly used when training and testing on the same environment, but is not suitable for studying the generalization to unseen objects, as ground truth information about the non-robot objects in the scene is not available in realistic setups, and such information has to be estimated based on some forms of visual inputs that are universally obtainable. Therefore, we provide RGB-D and point cloud observation modes to enable the learning of generalizable policies. The RGB-D and point cloud observations are captured from a set of cameras mounted on the robot, resembling common real-world robotics setups. Specifically, three cameras are mounted on the robot 120 apart from each other and look 45 downwards. Each camera has 110 field of view along the x-axis and 60 field of view along the y-axis. The resolution of the cameras is 160

400. The observations from all cameras are combined to form a final observation. Visualizations of RGB-D/point cloud observations are shown in Figure 4. In addition, we provide some task-relevant segmentation masks in both RGB-D and point cloud observation modes, and more details can be found in Section B.2 of the supplementary materials.

Figure 4: Some observations provided by our ManiSkill benchmark. Left two images: RGB/Depth from one of the cameras; right image: fused point cloud from all three cameras mounted on the robot.

ManiSkill supports two kinds of rewards: sparse and dense. A sparse reward is a binary signal which is equivalent to a task-specific success condition. Learning with sparse rewards is very difficult. To alleviate such difficulty, we carefully designed well-shaped dense reward functions for each task. Details can be found in Section C.1 of the supplementary materials.

# object instances # robot arms
All Train Test Reserved
OpenCabinetDoor 68(101) 42(66) 10(16) 16(19) 1
OpenCabinetDrawer 38(73) 25(49) 10(21) 3(3) 1
PushChair 77 26 10 41 2
MoveBucket 58 29 10 19 2
Table 1: Dataset statistics for ManiSkill. For OpenCabinetDoor and OpenCabinetDrawer, numbers outside of the parenthesis indicate the number of unique cabinets, where each cabinet may have more than one doors/drawers. Numbers inside the parenthesis indicate the total number of doors/drawers. Numbers in the “Reserved” column indicate the number of objects without demonstrations.

3.3 Tasks

Next, we describe task-specific properties, setups, and success conditions. Some statistics for the tasks can be found in Table 1.

OpenCabinetDoor and OpenCabinetDrawer are examples of manipulating articulated objects with revolute and prismatic joints respectively. An agent is required to open the target door or drawer through the coordination between arm and body. Since one cabinet may have several doors/drawers, we use a mask in RGB-D/point cloud observations to indicate the target door/drawer. Success is marked by opening the joint to 90% of its limit and keeping it static for a period of time afterwards. Mastering these tasks is essential for manipulating daily objects, as these joints are very common in indoor environments.

PushChair exemplifies the ability to manipulate complex underactuated systems. An agent needs to push a swivel chair to a target location. Each chair is typically equipped with several omni-directional wheels and a rotating seat. The task is successful if the chair (1) is close enough (within 15 centimeters) to the target location; (2) is kept static for a period of time after being close enough to the target; and (3) does not fall over. A unique challenge of this task is that it is very hard to model the environment with traditional methods since there are a large number of joints and contacts.

MoveBucket is an example of manipulation that heavily relies on two-arm coordination. An agent is required to lift a bucket of balls from the ground above a platform. The task is successful if (1) the bucket is placed on or above the platform at the upright position and kept static for a period of time, and (2) all the balls remain in the bucket. Since agents need to lift and move the bucket steadily, the collaboration between arms becomes critical.

3.4 Demonstration Trajectories

Learning purely in a trial-and-error manner (e.g, reinforcement learning) simultaneously over many environments can be very difficult and time-consuming. Thus, we provide a large number of demonstration trajectories for each training environment, which facilitate generalizable skill learning via learning-from-demonstration algorithms. For each training environment, we first collect demonstration trajectories under the state mode by running SAC [71] with the well-shaped dense reward signals we have designed (details in Sec. 4.2). Each trajectory is a sequence of observations, actions, rewards, and other necessary information to recover the context. Using these pieces of information, we can then render the corresponding point cloud or RGB-D demonstration. When we release our dataset, we will provide the full state-based demonstration trajectories along with the rendering functions to render the corresponding images and point clouds, since the rendering results can take up hundreds of gigabytes of space.

4 Benchmark Construction

4.1 Simulation Environment Construction

4.1.1 System Design

We configure our simulation environments by a YAML-based configuration system. This system is mainly used to configure physical properties, rendering properties, and scene layouts that can be reused across tasks. It allows benchmark designers to specify simulation frequencies, physical solver parameters, lighting conditions, camera placement, randomized object/robot layouts, robot controller parameters, object surface materials, and other common properties shared across all environments. After preparing the configurations, designers can load the configurations as SAPIEN scenes and perform further specific customization with Python scripts. In our task design, after we build the environments, we manually validate them to make sure they behave as expected (see B.4 for details).

4.1.2 Controller Design

The joints in our robots are controlled by velocity or position controllers. For velocity controllers, we use the built-in inverse dynamics functions in PhysX to compute the balancing forces for a robot. We then apply the internal PD controllers of PhysX by setting stiffness to and damping to a positive constant, where damping is used to drive a robot to a given velocity. We additionally add a first-order low-pass filter, implemented as an exponential moving average, to the input velocity signal, which is a common practice in real robotics systems [72]. Position controllers are built on top of velocity controllers: the input position signal is passed into a PID controller, which outputs a velocity signal for a velocity controller.

4.1.3 Asset Preparation

For all of our tasks, the target objects to manipulate come from the PartNet-Mobility dataset in SAPIEN. The PartNet-Mobility dataset does not come with convex collision shapes, but SAPIEN requires convex collision shapes for dynamic objects to get better physical simulation results. Therefore, we need to pre-process some objects. Our first strategy was to adopt the VHACD [73] algorithm for convex decomposition, which is standard in physical simulation. However, we found that VHACD often generated convex shapes that are visually different from the original shapes. Therefore, we eventually decided to manually decompose the objects from some categories in modeling software (see B.5).

4.2 Demonstration Collection

# Different Cabinets 1 5 10 20
Success Rate 100% 82% 2% 0%
Table 2: The success rates of SAC [71] agents on OpenCabinetDrawer trained from scratch for 1M timesteps on different numbers of cabinets. The SAC agents are trained with manually designed states and rewards. Jointly training one single RL agent on a large number of environments (objects) from scratch to collect demonstrations is infeasible.

We collect demonstrations by training RL agents with manually designed states and rewards. Another option to collect demonstrations is through human annotation, i.e., controlling the robot by humans to solve the tasks. However, manually controlling a robot with high degree of freedom is challenging. More importantly, manual annotation is not scalable, while RL agents can generate an arbitrary number of demonstrations at scale.

Since different environments of a task contain different objects of the same category, training a single agent from scratch directly on all training environments through trial-and-error to collect demonstrations might seem feasible at first glance. However, as shown in Table 2, even with carefully-designed states and rewards, the performance of such approach drops sharply as the number of different objects increases.

While directly training one single RL agent on many environments of a task is very challenging, training an agent to solve a single specific environment is feasible and well-studied. Therefore, we collect demonstrations in a divide-and-conquer way: We train a population of RL agents on different environments of a task and ensure that each agent is able to solve a specific environment well through careful reward engineering. These agents are then used to interact with their corresponding environments to generate successful trajectories. In this way, we can generate an arbitrary number of demonstration trajectories. Details of reward design can be found in C.1 in supplementary materials.

5 Baselines and Experiments

After we collect the demonstrations, we are now able to train agents using learning-from-demonstrations algorithms to learn generalizable manipulation skills. We benchmark two types of common learning-from-demonstrations approaches - Imitation Learning (IL) and Offline/Batch Reinforcement Learning (Offline/Batch RL). For imitation learning, we choose a simple and widely-adopted algorithm: behaviour cloning (BC) - directly matching predicted and ground truth actions through minimizing distance. For offline RL, we benchmark Batch-Constrained Q-Learning (BCQ) [65] and Twin-Delayed DDPG with Behavior Cloning (TD3+BC) [74]

. We follow their original implementations and briefly tune the hyperparameters. Details of the algorithm implementations are presented in Section

D of the supplementary material.

The input to the learning agents consists of robot state and visual perception for the scene, as previously mentioned in Section 3.2. For visual input, we use point cloud observations by default because point cloud features contain explicit and accurate positional information, which can be challenging to be inferred purely through RGB-D. Point clouds are obtained from three cameras of 160400 resolution (as previously mentioned in Sec. 3.2), with position, RGB, and segmentation masks as features (for the details of segmentation masks, see Sec. B.2 in the supplementary).

To process the point cloud input and accelerate computation, we sample 800 points where any segmentation mask is true, along with 400 points without any segmentation mask (further details in Section D.1 of the supplementary). We benchmark two architectures. The first architecture uses one single PointNet [75] to extract a global feature for the entire point cloud, which is fed into the final MLP. The second architecture uses different PointNets to process points belonging to different segmentation masks. The global features from the PointNets are then fed into a Transformer [76], after which a final attention pooling layer extracts the final representations and feeds into the final MLP. This architecture allows the model to capture the relation between different objects and possibly provides better performance. In addition, we found it crucial to also concatenate the robot state to the pixel/point features. Intuitively, this allows the extracted feature to not only contain geometric information of objects, but also contain the relation between the robot and each individual object, such as the closest point to the robot, which is very difficult to be learnt without such concatenation.

Details of the architectures are presented in Sec. D.2 of the supplementary material, and a detailed architecture diagram of PointNet + Transformer is presented in Fig. 6 of the supplementary material.

5.1 Single Environment Results

#Demo Trajectories 10 30 100 300 1000
#Gradient Steps 2000 4000 10000 20000 40000
PointNet, BC 0.13 0.23 0.37 0.68 0.76
PointNet + Transformer, BC 0.16 0.35 0.51 0.85 0.90
PointNet + Transformer, BCQ 0.02 0.05 0.23 0.45 0.55
PointNet + Transformer, TD3+BC 0.03 0.13 0.22 0.31 0.57
Table 3: The average success rates of different agents on one single environment (fixed object instance) of OpenCabinetDrawer with different numbers of demonstration trajectories. The average success rates are calculated over 100 evaluation trajectories. While network architectures and algorithms play an important role in the performance, learning manipulation skills from demonstrations is challenging without a large number of trajectories, even in one single environment.

As a glimpse into the difficulty of learning manipulation skills from demonstrations in our benchmark, we first present the results with an increasing number of demonstration trajectories on one single environment of OpenCabinetDrawer in Table 3. We observe that the success rate gradually increases as the number of demonstration trajectories increases, which shows the agents can indeed benefit from more demonstrations. We also observe that inductive bias in network architecture plays an important role in the performance, as PointNet + Transformer is more sample efficient than PointNet. Interestingly, we did not find offline RL algorithms to outperform BC. We conjecture that this is because the provided demonstrations are all successful ones, so an agent is able to learn a good policy through BC. In addition, our robot’s high degree of freedom and the difficulty of the task itself pose a challenge to offline RL algorithms. Further discussions on this observation are presented in Section D.3 of the supplementary material.

5.2 Object-Level Generalization Results

Algorithm BC BCQ TD3+BC
Architecture PointNet
+ Transformer
+ Transformer
+ Transformer
Split Training Test Training Test Training Test Training Test
OpenCabinetDoor 0.19 0.01 0.27 0.10 0.11 0.04 0.14 0.02
OpenCabinetDrawer 0.28 0.08 0.47 0.14 0.25 0.13 0.20 0.13
PushChair 0.11 0.08 0.13 0.08 0.11 0.07 0.09 0.07
MoveBucket 0.05 0.03 0.14 0.06 0.08 0.06 0.03 0.03
Table 4: Average success rates on training and test environments of each task under the point cloud observation, with 300 demonstration trajectories per environment. For each task, the average test success rates are calculated over the 10 test environments and 50 evaluation trajectories per environment. Obtaining one single agent capable of learning manipulation skills across multiple objects and generalizing the learnt skills to novel objects is challenging.

We now present results on object-level generalization. For each training environment of each task, we generate 300 successful demonstration trajectories. We train each model for 150k gradient steps. This takes about 5 hours for BC, 35 hours for BCQ, and 9 hours for TD3+BC using the PointNet + Transformer architecture on one NVIDIA RTX 2080Ti GPU. As shown in Table 4, even with our best agent (BC PointNet + Transformer), the overall success rates on both training and test environments are low, which suggest that 1) it is challenging to train agents over object variations in a task through learning-from-demonstration algorithms; 2) generalizing to test environments with unseen objects is even more difficult; 3) similar to the observations in Table 3, algorithms and architectures matter in both training and generalization performance. Therefore, we believe there is a large space to improve, and our benchmark poses interesting and challenging problems for the community.

6 Conclusion

In this work, we propose ManiSkill, a large-scale learning-from-demonstrations benchmark for generalizable manipulation skills. As a complement to existing robot learning benchmarks, ManiSkill focuses on object-level skill generalization by incorporating a rich and diverse set of articulated objects and a large number of high-quality demonstrations. We expect ManiSkill would encourage the community to look into object-level generalizability of manipulation skills, specifically by combining cutting-edge research of multiple fields. In the future, we plan to add more tasks to ManiSkill to form a complete set of commonly used primitive skills.


We thank Qualcomm for sponsoring the associated challenge, Sergey Levine and Ashvin Nair for insightful discussions during the whole development process, Yuzhe Qin for the suggestions on building robots, Jiayuan Gu for providing technical support on SAPIEN, and Rui Chen, Songfang Han, Wei Jiang for testing our system.


Supplementary Material

Appendix A Overview

This supplementary material includes detailed experimental results and analysis on all four tasks, as well as more details on implementations and system design.

Section B and C provide more details on the system, tasks, and demonstration collection.

Section D.1 provides implementation details of point cloud subsampling in our baselines.

Section D.2 provides implementation details of our point cloud-based baseline network architectures, along with a diagram of our PointNet + Transformer model.

Section D.3 provides implementation details of learning-from-demonstrations algorithms, specifically imitation learning (Behavior Cloning) and offline RL.

Appendix B Further Details of Tasks and System

b.1 The Speed of Environments

We provide the running speeds of our environments in Table 5. We would like to note that the numbers cannot be directly compared with other environments like the MuJoCo tasks in OpenAI gym based on the following reasons.

  1. In our environments, one environment step corresponds to 5 control steps. This “frame-skipping” technique can make the horizon of our tasks shorter, which is also a common practice in reinforcement learning [77]. We can make the environment FPS 5x larger by simply disabling frame-skipping, but this will make the tasks more difficult as the agent needs to make more decisions.

  2. Compared to other environments, such as the MuJoCo tasks in OpenAI Gym, our articulated objects and robots are much more complicated. For example, our chairs contain up to 20 joints and tens of thousands of collision meshes. Therefore, the physical simulation process is inherently slow.

  3. Many other robotics/control environments do not provide visual observations, while ManiSkill does. When generating visual observations, rendering is a very time-consuming process, especially when we are using three cameras simultaneously.

Observation Mode state pointcloud
Table 5:

Mean and standard deviation of FPS (frame per second) of the environments in ManiSkill. In

state mode, most computations are used on physical simulation. In pointcloud mode, most computations are used on rendering. All the numbers are tested on a single Intel i9-9960X CPU and a single NVIDIA RTX TITAN GPU.

b.2 Segmentation Masks

As mentioned in Sec 3.2, we provide task-relevant segmentation masks in pointcloud and rgbd modes. Each mask is a binary array indicating a part or an object. Here are the details about our segmentation masks for each task:

  • OpenCabinetDoor: handle of the target door, target door, robot (3 masks in total)

  • OpenCabinetDrawer: handle of the target drawer, target drawer, robot (3 masks in total)

  • PushChair: robot (1 mask in total)

  • MoveBucket: robot (1 mask in total)

Basically, we provide the robot mask and any mask that is necessary for specifying the target. For example, in OpenCabinetDoor/Drawer environments, a cabinet might have many doors/drawers, so we provide the door/drawer mask such that users know which door/drawer to open. We also provide handle mask such that users know from which direction the door/drawer should be opened.

b.3 Evaluation Kit

ManiSkill provides a straightforward evaluation script. The script takes a task name, an observation mode (RGB-D or point cloud), and a solution file as input. The solution file is expected to contain a single policy function that takes observations as input and outputs an action. The evaluation kit takes the policy function and evaluates it on the test environments. For each environment, it reports the average success rate and the average satisfactory rate for each success conditions (e.g. whether the ball is inside the bucket and whether the bucket is on or above the platform in MoveBucket).

b.4 Environment Validation

Environment construction is not complete without testing. After modeling the environment, we need to ensure the environment has the following properties: (1) The environment is correctly modeled with realistic parameters. As the environment contains hundreds of parameters, including the friction coefficients, controller parameters, object scales, etc, its correctness needs comprehensive checking; (2) The environment is solvable. We need to check that the task can be completed within the allocated time steps, and that the physical parameters allow task completion. For example, the weight of a target object to be grasped must be smaller than the maximum force allowed for lifting the robot gripper; (3) The physical simulation is free from significant artifacts. Physical simulation faces the trade-off between stability and speed. When the simulation frequency and the contact solver iterations are too small, simulation artifacts such as jittering and interpenetration can occur; (4) There does not exist undesired exploits and shortcuts.

To inspect our environments, after modeling them, we first use the SAPIEN viewer to visually inspect the appearance of all assets. We also inspect the physical properties of crucial components in several sampled environments. Next, we design a mouse-and-keyboard based robot controller. It controls the robot gripper in Cartesian coordinates using inverse kinematics of the robot arm. We try to manually solve the tasks to identify potential problems in the environments. This manual process can identify most solvability issues and physical artifacts. However, an agent could still learn unexpected exploits, entailing us to iteratively improve the environments: we execute the demonstration collection algorithm on the current environment and record videos of sampled demonstration trajectories. We then watch the videos to identify causes of success and failure, potentially spotting unreasonable behaviors. We finally investigate and improve the environment.

b.5 Manually Processed Collision Shapes

As described in Section 4.1.3, we manually decompose the collision shapes into convex shapes. This manual process is performed on the Bucket objects. We justify this choice in Fig. 5.

Figure 5: When decomposing a bucket (a), standard VHACD [73] algorithm (b, 2340 faces) misses details, and tends to produce artifacts, such as bumps and seams, that make visual appearances quite different from collision shapes, so we manually process the mesh (c, 1445 faces).

Appendix C Details of Demonstration Collection

c.1 Dense Reward Design

In dense reward mode, we use a multistage reward function for all of the environments. In each stage, we guarantee that the reward in the next stage is strictly larger than the reward in the current stage to prevent the agent from staying in an intermediate stage forever. We also carefully design the rewards at stage-transition states, such that the rewards are smooth.

c.1.1 OpenCabinetDoor and OpenCabinetDrawer

For environments in OpenCabinetDoor and OpenCabinetDrawer, our reward function contains three stages. In the first stage, the agent receives rewards from being close to the handle on the target link (door or drawer). To encourage contact with the target link, we penalize the Euclidean distance between the handle and the gripper. When the gripper’s distance to the target link is less than a threshold, the agent enters the second stage. In this stage, the agent gets a reward from the opening angle of the door or the opening distance of the drawer. When the agent opens the door or drawer enough, the agent enters the final stage. In the final stage, the agent receives a negative reward based on the speed of the target link to encourage the scene to be static at the end of the trajectory.

c.1.2 PushChair

The reward function for PushChair contains three stages. In the first stage, the agent receives reward from moving towards the chair. To encourage contact with the chair, we compute the distance between the robot end effectors and the chair and take its logarithm as reward. When the robot end effectors are close enough to the chair, the agent enters the second stage. In the second stage, the agent receives rewards based on the distance between the chair’s current location and the target location. The agent receives additional rewards based on the angle between the chair’s velocity vector and the vector pointing towards the target location. In our experiments, we find that this term is critical. When the chair is close enough to the target location, the agent enters the final stage. In the final stage, the agent is penalized based on the linear and angular velocity of the chair, such that the agent learns to keep the chair static. In all stages, the agent is penalized based on the chair’s degree of tilt in order to keep the chair upright.

c.1.3 MoveBucket

The reward function for MoveBucket consists of four stages. In the first stage, the agent receives rewards from moving towards the bucket. To encourage contact with the bucket, we compute the distance between the robot end effectors (grippers) and the bucket and take its log value as reward. When the robot end effectors are close enough to the bucket, the agent enters the second stage. In the second stage, the agent is required to lift the bucket to a specific height. The agent receives a position-based reward and a velocity-based reward that encourage the agent to lift the bucket. In our experiments, we find that it is very difficult for the agent to learn how to lift the bucket without any domain knowledge. To ease the difficulty, we use the angle between the two vectors pointing from the two grippers to the bucket’s center of mass as a reward. This term encourages the agent to place the two grippers on opposite sides of the bucket. We also penalize the agent based on the grippers’ height difference in the bucket frame so that the grasp pose is more stable. Once the bucket is lifted high enough, the agent enters the third stage. In this stage, the agent receives a position-based reward and a velocity-based reward that encourages the agent to move the bucket towards the target platform. When the bucket is on top of the platform, the agent enters the final stage, and it is penalized based on the linear and angular velocity of the bucket, such that the agent learns to hold the bucket steadily. In all stages, the agent is also penalized based on the bucket’s degree of tilt to keep the bucket upright. Since it is harder to keep the bucket upright in MoveBucket than in PushChair, we take the log value of the bucket’s degree of tilt as an additional penalty term so that the reward is more sensitive at near-upright poses.

c.2 RL Agents

Hyperparameters Value
Optimizer Adam
Learning rate
Discount () 0.95
Replay buffer size ()
Number of hidden layers (all networks) 3
Number of hidden units per layer 256
Number of threads for collecting samples 4
Number of samples per minibatch 1024
Nonlinearity ReLU
Target smoothing coefficient() 0.005
Target update interval 1
Gradient steps 1
Total Simulation Steps
Table 6: The hyperparameters of SAC for demonstration generation.

For all environments, we use SAC[71] with the manually designed dense reward functions stated above to train the agent. To speed up the training process, we use 4 processes in parallel to collect samples. For better demonstration quality, during training, we remove the early-done signal when the agent succeeds. While this potentially lowers the success rate of the agent, we find that this leads to more robust policy at near-end states. However, during demonstration collection, we stop at the first success signal. The detailed hyperparameters of SAC can be found in table 6.

For all environments, we train our agents for steps. To ensure the quality of our demonstrations, we set the success rate threshold to be 0.3 for every agent, i.e. we collect demonstrations only using agents that have success rate. We then uniformly sample the initial states and use the trained agents to collect successful trajectories for each environment of each task.

Appendix D Implementation Details of Point Cloud-Based Models and Baselines

d.1 Point Cloud Subsampling

To process the point cloud input as mentioned in Section 5 “Baselines and Experiments” in our main paper, we first sample 50 points for each segmentation mask (if there are fewer than 50 points, then we keep all the points). We then randomly sample from the rest of the points where at least one of the segmentation masks is true, such that the total number of those points is 800 (if there are fewer than 800, then we keep all of them). Finally, we randomly sample from the points where none of the segmentation masks are true and where the points are not on the ground (i.e. have positive -coordinate value), such that we obtain a total of 1200 points.

d.2 Network Architectures

Figure 6: Architecture diagram for our “PointNet + Transformer” model.

For all of our PointNet policy network models, we concatenate the features of each point (which include position, RGB, and segmentation masks) with the robot state (as mentioned in Section 3.2 “Robots, Actions, Observations and Rewards” in our main paper) to form new point input features. For the position feature, we first calculate the mean coordinates for the point cloud / sub point cloud, then concatenate it with the original position subtracting the mean. We found that such normalized position feature significantly improves performance.

In our vanilla PointNet model, we feed all point features into one single PointNet. The PointNet has hidden layer dimensions [256, 512], and the global feature is passed through an MLP with layer sizes [512, 256, action_dim] to output actions.

For our PointNet + Transformer model, we use different PointNets to process points having different segmentation masks. If the masks have dimension , then we use PointNets (one for each of the segmentation masks, one for the points without any segmentation mask, and one for the entire point cloud) with hidden dimension 256 to extract global features. We also use an additional MLP to output a 256-d hidden vector for the robot state alone (i.e. the robot state is not only concatenated with the point features and fed into the PointNets, but also processed alone through this MLP). The global point features, the processed robot state vector, and an additional trainable embedding vector (serving as a bias for the task) are fed into a Transformer [76] with and . We did not add position encoding to the Transformer, as we found it significantly hurts performance. The output vectors are passed through a global attention pooling to extract a representation of dimension 256, which is then fed into a final MLP with layer sizes [256, 128, action_dim] to output actions. A diagram of this architecture is presented in Figure 6. All of our models use ReLU activation.

d.3 Implementation Details of Learning-from-Demonstration Algorithms

We benchmark imitation learning with Behavior Cloning (BC), along with two offline-RL algorithms: Batch-Constrained Q-Learning (BCQ) [65] and Twin-Delayed DDPG with Behavior Cloning (TD3+BC) [74]. Different from BC, BCQ does not directly clone the demonstration actions given input, and instead uses a VAE to fit the distribution of actions in the demonstration. It then learns a Q function that estimates the reward of actions given input, and selects an action with the best reward among samples during inference. TD3+BC [74] adds a weighted BC loss to the TD3 loss to constrain the output action to the demonstration data. The original paper also normalizes the features of every state in the demonstration dataset, but this trick is not applicable in our case as our inputs are visual. There are also other offline-RL algorithms like CQL [66], and we leave them for future work.

For Q-networks in BCQ [65] and TD3+BC [74], when using the PointNet + Transformer model, the action is concatenated with the point features and the state vector and fed into the model. The final feature from the model is fed into an MLP with layer sizes [256, 128, 1] to output Q-values. The VAE encoder and decoder in BCQ uses similar architecture, and the dimension of the latent vector equals 2 times the action space dimension. The hyperparameters for BCQ and TD3+BC are shown in Table 7 and Table 8.

Note that in TD3+BC, , where . In the original paper, , and the algorithm is equivalent to BC if . Interestingly, as shown in Table 9, we found that when is non-zero, the performance of TD3+BC is always worse than BC, even when has decreased 100 times from the value in the original paper. However, in our previously reported results, we used to illustrate the performance comparison between TD3+BC and BC, since setting is not interesting, and does not distinguish TD3+BC from BC.

Hyperparameters Value
Batch size 64
Perturbation limit 0.00
Action samples during evaluation 100
Action samples during training 10
Learning rate
Discount () 0.95
Nonlinearity ReLU
Target smoothing coefficient() 0.005
Table 7: The hyperparameters of BCQ.
Hyperparameters Value
Learning rate
Action noise 0.2
Noise clip 0.5
Discount () 0.95
Nonlinearity ReLU
Target smoothing coefficient() 0.005
Table 8: The hyperparameters of TD3+BC.
0.00 0.02 0.2 2.5
Success Rate 0.85 0.31 0.01 0.00
Table 9: The success rates of TD3+BC trained with different values of on one environment of OpenCabinetDrawer and 300 demonstration trajectories. The algorithm becomes equivalent to BC if .