Log In Sign Up

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can we still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action". Unlike frameworks that operate on 2D images, the voxelized observation and action space provides a strong structural prior for efficiently learning 6-DoF policies. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.


page 2

page 4

page 8

page 21

page 22

page 24

page 26


Behavior Transformers: Cloning k modes with one stone

While behavior learning has made impressive progress in recent times, it...

Grounding Language with Visual Affordances over Unstructured Data

Recent works have shown that Large Language Models (LLMs) can be applied...

Instruction-driven history-aware policies for robotic manipulations

In human environments, robots are expected to accomplish a variety of ma...

Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation

We study the problem of learning a range of vision-based manipulation ta...

Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks

Transformer networks have seen great success in natural language process...

Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation

Reflecting on the last few years, the biggest breakthroughs in deep rein...

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Benefiting from language flexibility and compositionality, humans natura...

1 Introduction

Transformers [2]

have become prevalent in natural language processing and computer vision. By framing problems as sequence modeling tasks, and training on large amounts of diverse data, Transformers have achieved groundbreaking results in several domains 

[3, 4, 5, 6]. Even in domains that do not conventionally involve sequence modeling [7, 8], Transformers have been adopted as a general architecture [9]. But in robotic manipulation, data is both limited and expensive. Can we still bring the power of Transformers to 6-DoF manipulation with the right problem formulation?

Language models operate on sequences of tokens [10], and vision transformers operate on sequences of image patches [4]. While pixel transformers [11, 1] exist, they are not as data efficient as approaches that use convolutions or patches to exploit the 2D structure of images. Thus, while Transformers may be domain agnostic, they still require the right problem formulation to be data efficient. A similar efficiency issue is apparent in behavior-cloning (BC) agents that directly map 2D images to 6-DoF actions. Agents like Gato [9] and BC-Z [12, 13] have shown impressive multi-task capabilities, but they require several weeks or even months of data collection.

In contrast, recent works in reinforcement-learning like C2FARM 

[14] construct a voxelized observation and action space to efficiently learn visual representations of 3D actions with 3D ConvNets. Similarly, in this work, we aim to exploit the 3D structure of voxel patches for efficient 6-DoF behavior-cloning with Transformers (analogous to how vision transformers [4] exploit the 2D structure of image patches).

Figure 1: Language-Conditioned Manipulation Tasks: PerAct is a language-conditioned multi-task agent capable of imitating a wide range of 6-DoF manipulation tasks. We conduct experiments on 18 simulated tasks in RLBench [15] (a-j; only 10 shown), with several pose and semantic variations. We also demonstrate our approach with a Franka Panda on 7 real-world tasks (k-o; only 5 shown) with a multi-task agent trained with just 53 demonstrations. See the supplementary video for simulated and real-world rollouts.

To this end, we present PerAct (short for Perceiver-Actor), a language-conditioned BC agent that can learn to imitate a wide variety of 6-DoF manipulation tasks with just a few demonstrations per task. PerAct encodes a sequence of RGB-D voxel patches and predicts discretized translations, rotations, and gripper actions that are executed with a motion-planner in an observe-act loop. PerAct

 is essentially a classifier trained with supervised learning to

detect actions akin to prior work like CLIPort [16], except our observations and actions are represented with 3D voxels instead of 2D image pixels. Voxel grids are less prevalent than images in end-to-end BC approaches often due to scaling issues with high-dimensional inputs. But in PerAct, we use a Perceiver111Throughout the paper we refer to PerceiverIO [1] as Perceiver for brevity. Transformer [1]

to encode very high-dimensional input of up to 1 million voxels with only a small set of latent vectors. This voxel-based formulation provides a strong structural prior with several benefits: a natural method for fusing multi-view observations, learning robust action-centric

222Action-centric refers to a perception system that learns visual representations of actions; see Appendix J. representations [17], and enabling data augmentation in 6-DoF – all of which help learn generalizable skills by focusing on diverse rather than narrow multi-task data.

To study the effectiveness of this formulation, we conduct large-scale experiments in the RLBench [15] environment. We train a single multi-task agent on 18 diverse tasks with 249 variations that involve a range of prehensile and non-prehensile behaviors like placing wine bottles on a rack and dragging objects with a stick (see Figure 1 a-j). Each task also includes several pose and semantic variations with objects that differ in placement, color, shape, size, and category. Our results show that PerAct significantly outperforms image-to-action agents (by ) and 3D ConvNet baselines (by ), without using any explicit representations of instance segmentations, object poses, memory, or symbolic states. We also validate our approach on a Franka Panda with a multi-task agent trained from scratch on 7 real-world tasks with a total of just 53 demonstrations (see Figure 1 k-o).

In summary, our contributions are as follows:

  • [leftmargin=0.8cm,itemsep=0.05em]

  • A novel problem formulation for perceiving, acting, and specifying goals with Transformers.

  • An efficient action-centric framework for grounding language in 6-DoF actions.

  • Empirical results investigating multi-task agents on a range of simulated and real-world tasks.

The code and pre-trained models will be made available at

2 Related Work

Vision for Manipulation. Traditionally, methods in robot perception have used explicit “object” representations like instance segmentations, object classes, poses [18, 19, 20, 21, 22, 23]. Such methods struggle with deformable and granular items like cloths and beans that are hard to represent with geometric models or segmentations. In contrast, recent methods [16, 24, 25] learn action-centric representations without any “objectness” assumptions, but they are limited to top-down 2D settings with simple pick-and-place primitives. In 3D, James et al. proposed C2FARM [14], an action-centric reinforcement learning (RL) agent with a coarse-to-fine-grain 3D-UNet backbone. The coarse-to-fine-grain scheme has a limited receptive field that cannot look at the entire scene at the finest level. In contrast, PerAct learns action-centric representations with a global-receptive field through a Transformer backbone. Also, PerAct does BC instead of RL, which enables us to easily train a multi-task agent for several tasks by conditioning it with language goals.

End-to-End Manipulation approaches [26, 27, 28, 29] make the least assumptions about objects and tasks, but are often formulated as an image-to-action prediction task. Training directly on RGB images for 6-DoF tasks is often inefficient, generally requiring several demonstrations or episodes just to learn basic skills like rearranging objects. In contrast, PerAct uses a voxelized observation and action space, which is dramatically more efficient and robust in 6-DoF settings. While other works in 6-DoF grasping [30, 31, 32, 33, 34] have used RGB-D and pointcloud input, they have not been applied to sequential tasks or used with language-conditioning. Another line of work tackles data inefficiency by using pre-trained image representations  [16, 35, 36] to bootstrap BC. Although our framework is trained from scratch, such pre-training approaches could be integrated together in future works for even greater efficiency and generalization to unseen objects.

Transformers for Agents and Robots. Transformers have become the prevalent architecture in several domains. Starting with NLP [2, 3, 37], recently in vision [4, 38], and even RL [8, 39, 40]. In robotics, Transformers have been applied to assistive teleop [41], legged locomotion [42], path-planning [43, 44]

, imitation learning 

[45, 46], and grasping [47]. Transformers have also achieved impressive results in multi-domain settings like in Gato [9] where a single Transformer was trained for 16 domains such as captioning, language-grounding, robotic control etc. However, Gato relies on extremely large datasets like 15K episodes for block stacking and 94K episodes for Meta-World [48] tasks. Our voxel-based approach might complement agents like Gato, which could use our problem formulation for greater efficiency and robustness in 6-DoF manipulation settings.

Language Grounding for Manipulation. Several works have proposed methods for grounding language in robot actions [49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]. However, these methods use disentangled pipelines for perception and action, with the language primarily being used to guide perception. Recently, a number of end-to-end approaches [13, 12, 61] have been proposed for conditioning BC agents with language instructions. These methods require thousands of human teleoperated demonstrations that are collected over several days or even months. In contrast, PerAct can learn robust multi-task policy with just a few minutes of training data. For benchmarking, several simulation environments exist [62, 24, 48], but we use RLBench [15] for its diversity of 6-DoF tasks and ease of generating expert demonstrations with templated language goals.

3 Perceiver-Actor 

PerAct is a language-conditioned behavior-cloning agent for 6-DoF manipulation. The key idea is to learn perceptual representations of actions conditioned on language goals. Given a voxelized reconstruction of a scene, we use a Perceiver Transformer [1] to learn per-voxel features. Despite the extremely large input space (), Perceiver uses a small set of latent vectors to encode the input. The per-voxel features are then used to predict the next best action in terms of discretized translation, rotation, and gripper state at each timestep. PerAct relies purely on the current observation to determine what to do next in sequential tasks. See Figure 2 for an overview.

Section 3.1 and Section 3.2 describe our dataset setup. Section 3.3 describes our problem formulation for PerAct, and Section 3.4 provides details on training PerAct.

Figure 2: PerAct Overview. PerAct is a language-conditioned behavior-cloning agent trained with supervised learning to detect actions. PerAct takes as input a language goal and a voxel grid reconstructed from RGB-D sensors. The voxels are split into 3D patches, and the language goal is encoded with a pre-trained language model. These language and voxel features are appended together as a sequence and encoded with a Perceiver transformer [1]. Despite the extremely long input sequence, Perceiver uses a small set of latent vectors to encode the input (see Appendix Figure 6 for an illustration). These encodings are upsampled back to the original voxel dimensions with a decoder and reshaped with linear layers to predict a discretized translation, rotation, gripper open, and collision avoidance action. This action is executed with a motion-planner after which the new observation is used to predict the next discrete action in an observe-act loop until termination.

3.1 Demonstrations

We assume access to a dataset of expert demonstrations, each paired with English language goals . These demonstrations are collected by an expert with the aid of a motion-planner to reach intermediate poses. Each demonstration is a sequence of continuous actions paired with observations . An action consists of the 6-DoF pose, gripper open state, and whether the motion-planner used collision avoidance to reach an intermediate pose: . An observation consists of RGB-D images from any number of cameras. We use four cameras for simulated experiments , but just a single camera for real-world experiments .

3.2 Keyframes and Voxelization

Following prior work by James et al. [14], we construct a structured observation and action space through keyframe extraction and voxelization.

Training our agent to directly predict continuous actions is inefficient and noisy. So instead, for each demonstration , we extract a set of keyframe actions that capture bottleneck end-effector poses [63]

in the action sequence with a simple heuristic: an action is a keyframe if (1) the joint-velocities are near zero and (2) the gripper open state has not changed. Each datapoint in the demonstration

can then be cast as a “predict the next (best) keyframe action” task [14, 64, 65]. See Appendix Figure F for an illustration of this process.

To learn action-centric representations [17] in 3D, we use a voxel grid [66, 67] to represent both the observation and action space. The observation voxels are reconstructed from RGB-D observations fused through triangulation from known camera extrinsics and intrinsics. By default, we use a voxel grid of , which corresponds to a volume of in metric scale. The keyframe actions are discretized such that training our BC agent can be formulated as a “next best action” classification task [14]. Translation is simply the closest voxel to the center of the gripper fingers. Rotation is discretized into 5 degree bins for each of the three rotation axes. Gripper open state is a binary value. Collide is also a binary value that indicates if the motion-planner should avoid everything in the voxel grid or nothing at all; switching between these two modes of collision avoidance is crucial as tasks often involve both contact based (e.g., pulling the drawer open) and non-contact based motions (e.g., reaching the handle without colliding into anything).

3.3 PerAct Agent

PerAct is a Transformer-based [2] agent that takes in a voxel observation and language goal , and outputs a discretized translation, rotation, and gripper open action. This action is executed with a motion-planner, after which this process is repeated until the goal is reached.

The language goal is encoded with a pre-trained language model. We use CLIP’s [68] language encoder, but any pre-trained language model would suffice [13, 61]. Our choice of CLIP opens up possibilities for future work to use pre-trained vision features that are aligned with the language for better generalization to unseen semantic categories and instances [16].

The voxel observation is split into 3D patches of size (akin to vision-transformers like ViT [4]

). In implementation, these patches are extracted with a 3D convolution layer with a kernel-size and stride of 5, and then flattened into a sequence of voxel encodings. The language encodings are fine-tuned with a linear layer and then appended with the voxel encodings to form the input sequence. We also add learned positional embeddings to the sequence to incorporate voxel and token positions.

The input sequence of language and voxel encodings is extremely long. A standard Transformer with self-attention connections and an input of patches is hard to fit on the memory of a commodity GPU. Instead, we use the Perceiver [1] Transformer. Perceiver is a latent-space Transformer, where instead of attending to the entire input, it first computes cross-attention between the input and a much smaller set of latent vectors (which are randomly initialized and trained). These latents are encoded with self-attention layers, and for the final output, the latents are again cross-attended with the input to match the input-size. See Appendix Figure 6 for an illustration. By default, we use latents of dimension 512 : , but in Appendix G we experiment with different latent sizes.

The Perceiver Transformer uses 6 self-attention layers to encode the latents and outputs a sequence of patch encodings from the output cross-attention layer. These patch encodings are upsampled with a 3D convolution layer and tri-linear upsampling to decode 64-dimensional voxel features. The decoder includes a skip-connection from the encoder (like in UNets [69]). The per-voxel features are then used to predict discretized actions [14]. For translation, the voxel features are reshaped into the original voxel grid () to form a 3D

-function of action-values. For rotation, gripper open, and collide, the features are max-pooled and then decoded with linear layers to form their respective

-function. The best action is chosen by simply maximizing the -functions:

where is the voxel location in the grid, are discrete rotations in Euler angles, is the gripper open state and is the collide variable. See Figure 5 for examples of -predictions.

3.4 Training Details

PerAct is trained through supervised learning with discrete-time input-action tuples from a dataset of demonstrations. These tuples are composed of voxel observations, language goals, and keyframe actions . During training, we randomly sample a tuple and supervise the agent to predict the keyframe action given the observation and goal . For translation, the ground-truth action is represented as a one-hot voxel encoding

. Rotations are also represented with a one-hot encoding per rotation axis with

rotation bins ( degrees for all experiments). Similarly, open and collide variables are binary one-hot vectors , . The agent is trained with cross-entropy loss like a classifier:

where , , , respectively. For robustness, we also augment and with translation and rotation perturbations. See Appendix E for more details.

By default, we use a voxel grid size of . We conducted validation tests by replaying expert demonstrations with discretized actions to ensure that is a sufficient resolution for execution. The agent was trained with a batch-size of 16 on 8 NVIDIA V100 GPUs for 16 days (600K iterations). We use the LAMB [70] optimizer following Perceiver [1].

For multi-task training, we simply sample input-action tuples from all tasks in the dataset. To ensure that tasks with longer horizons are not over-represented during sampling, each batch contains a uniform distribution of tasks. That is, we first uniformly sample a set of tasks of batch-size length, then pick a random input-action tuple for each of the sampled tasks. With this strategy, longer-horizon tasks need more training steps for full coverage of input-action pairs, but all tasks are given equal weighting during gradient updates.

4 Results

We perform experiments to answer the following questions: (1) How effective is PerAct compared to unstructured image-to-action frameworks and standard architectures like 3D ConvNets? And what are the factors that affect PerAct’s performance? (2) Is the global receptive field of Transformers actually beneficial over methods with local receptive fields? (3) Can PerAct be trained on real-world tasks with noisy data?

4.1 Simulation Setup

We conduct our primary experiments in simulation for the sake of reproducibility and benchmarking.

Environment. The simulation is set in CoppelaSim [71] and interfaced through PyRep [72]. All experiments use a Franka Panda robot with a parallel gripper. The input observations are captured from four RGB-D cameras positioned at the front, left shoulder, right shoulder, and on the wrist, as shown in Appendix Figure 7. All cameras are noiseless and have a resolution of .

Language-Conditioned Tasks. We train and evaluate on 18 RLBench [15] tasks. See for examples and Appendix A for details on individual tasks. Each task includes several variations, ranging from 2-60 possibilities, e.g., in the stack blocks task, “stack 2 red blocks” and “stack 4 purple blocks” are two variants. These variants are randomly sampled during data generation, but kept consistent during evaluations for one-to-one comparisons. Some RLBench tasks were modified to include additional variations to stress-test multi-task and language-grounding capabilities. There are a total of 249 variations across 18 tasks, and the number of extracted keyframes range from 2-17. All keyframes from an episode have the same language goal, which is constructed from templates (but human-annotated for real-world tasks). Note that in all experiments, we do not test for generalization to unseen objects, i.e., our train and test objects are the same. However during test time, the agent has to handle novel object poses, randomly sampled goals, and randomly sampled scenes with different semantic instantiations of object colors, shapes, sizes, and categories. The focus here is to evaluate the performance of a single multi-task agent trained on all tasks and variants.

Evaluation Metric. Each multi-task agent is evaluated independently on all 18 tasks. Evaluations are scored either 0 for failures or 100 for complete successes. There are no partial credits. We report average success rates on 25 evaluation episodes per task ( total episodes) for agents trained with demonstrations per task. During evaluation, an agent keeps taking actions until an oracle indicates task-completion or reaches a maximum of 25 steps.

4.2 Simulation Results

Table 1 reports success rates of multi-task agents trained on all 18 tasks. We could not investigate single-task agents due to resource constraints of training 18 individual agents.

Baseline Methods. We study the effectiveness of our problem formulation by benchmarking against two language-conditioned baselines: Image-BC and C2FARM-BC. Image-BC is an image-to-action agent similar to BC-Z [12]. Following BC-Z, we use FiLM [73] for conditioning with CLIP [68] language features, but the vision encoders take in RGB-D images instead of just RGB. We also study both CNN and ViT vision encoders. C2FARM-BC is a 3D fully-convolutional network by James et al. [14] that has achieved state-of-the-art results on RLBench tasks. Similar to our agent, C2FARM-BC also detects actions in a voxelized space, however it uses a coarse-to-fine-grain scheme to detect actions at two-levels of voxelization: voxels with a m grid, and voxels with a m grid after “zooming in” from the first level. Note that at the finest level, C2FARM-BC has a higher resolution (cm) than PerAct (cm). We use the same 3D ConvNet architecture as James et al. [14], but instead of training it with RL, we do BC with cross-entropy loss (from Section 3.4). We also condition it with CLIP [68] language features at the bottleneck like in LingUNets [74, 16].

sweep to
meat off
put in
Method 10 100 10 100 10 100 10 100 10 100 10 100 10 100 10 100 10 100
Image-BC (CNN) 4 4 4 0 0 0 0 0 20 8 0 8 0 0 0 0 0 0
Image-BC (ViT) 16 0 8 0 8 0 0 0 24 16 0 0 0 0 0 0 0 0
C2FARM-BC 28 20 12 16 4 0 40 20 60 68 12 4 28 24 72 24 4 0
PerAct (w/o Lang) 20 28 8 12 20 16 40 48 36 60 16 16 16 12 48 60 0 0
PerAct 68 80 32 72 72 56 68 84 72 80 16 68 32 60 36 68 12 36
put in
put in
10 100 10 100 10 100 10 100 10 100 10 100 10 100 10 100 10 100
Image-BC (CNN) 0 0 0 4 0 0 0 0 0 0 4 0 0 0 0 0 0 0
Image-BC (ViT) 0 0 0 0 4 0 4 0 0 0 16 0 0 0 0 0 0 0
C2FARM-BC 12 8 0 12 36 8 4 0 8 8 88 72 0 4 0 0 0 0
PerAct (w/o Lang) 0 24 8 20 8 20 0 0 0 0 60 68 4 0 0 0 0 0
PerAct 28 24 16 44 20 12 0 16 16 20 56 48 4 0 0 0 0 0
Table 1: Multi-Task Test Results. Success rates (mean %) of various multi-task agents tasks trained with either 10 or 100 demonstrations per task and evaluated on 25 episodes per task. Each evaluation episode is scored either a 0 for failure or 100 for succces. PerAct outperforms C2FARM-BC [14], the most competitive baseline, with an average improvement of with 10 demos and with 100 demos.

Multi-Task Performance. Table 1 compares the performance of Image-BC and C2FARM-BC against PerAct. With insufficient demonstrations, Image-BC has near zero performance on most tasks. Image-BC is disadvantaged with single-view observations and has to learn hand-eye coordination from scratch. In contrast, PerAct’s voxel-based formulation naturally allows for integrating multi-view observations, learning 6-DoF action representations, and data-augmentation in 3D, all of which are non-trivial to achieve in image-based methods. C2FARM-BC is the most competitive baseline, but it has a limited receptive field mostly because of the coarse-to-fine-grain scheme and partly due to the convolution-only architecture. PerAct outperforms C2FARM-BC in  evaluations in Table 1 with an average improvement of with 10 demonstrations and with 100 demonstration. For a number of tasks, C2FARM-BC actually performs worse with more demonstrations, likely due to insufficient capacity. Since additional training demonstrations include additional task variants to optimize for, they might end up hurting performance.

In general, 10 demonstrations are sufficient for PerAct to achieve success on tasks with limited variations like open drawer (3 variations). But tasks with more variations like stack blocks (60 variations) need substantially more data, sometimes to simply cover all possible concepts like “teal color block” that might have not appeared in the training data. See the simulation rollouts in the supplementary video to get a sense of the complexity of these evaluations. For three tasks: insert peg, stack cups, and place cups, all agents achieve near zero success. These are very high-precision tasks where being off by a few centimeters or degrees could lead to unrecoverable failures. But in Appendix H we find that training single-task agents, specifically for these tasks, slightly alleviates this issue.

Figure 3: Ablation Experiments. Success rate of PerAct after ablating key components.

Ablations. Table 1 reports PerAct w/o Lang, an agent without any language conditioning. Without a language goal, the agent does not know the underlying task and performs at chance. We also report additional ablation results on the open drawer task in Figure 3. To summarize these results: (1) the skip connection helps train the agent slightly faster, (2) the Perceiver Transformer is crucial for achieving good performance with the global receptive field, and (3) extracting good keyframes actions is essential for supervised training as randomly chosen or fixed-interval keyframes lead to zero-performance.

Figure 4: Global vs. Local Receptive Field Experiments. Success rates of PerAct against various C2FARM-BC  [14] baselines

Sensitivity Analysis. In Appendix G we investigate factors that affect PerAct’s performance: the number of Perceiver latents, voxelization resolution, and data augmentation. We find that more latent vectors generally improve the capacity of the agent to model more tasks, but for simple short-horizon tasks, fewer latents are sufficient. Similarly, with different voxelization resolutions, some tasks are solvable with coarse voxel grids like , but some high-precision tasks require the full grid. Finally, rotation perturbations in the data augmentation generally help in improving robustness essentially by exposing the agent to more rotation variations of objects.

Figure 5: Q-Prediction Examples: Qualitative examples of translation -Predictions from PerAct along with expert actions, highlighted with dotted-circles. The left two are simulated tasks, and the right two are real-world tasks. See Appendix J for more examples.

4.3 Global vs. Local Receptive Fields

To further investigate our Transformer agent’s global receptive field, we conduct additional experiments on the open drawer task. The open drawer task has three variants: “open the top drawer”, “open the middle drawer”, and “open the bottom drawer”, and with a limited receptive field it is hard to distinguish the drawer handles, which are all visually identical. Figure 4 reports PerAct and C2FARM-BC agents trained with 100 demonstrations. Although the open drawer tasks can be solved with fewer demonstrations, here we want to ensure that insufficient data is not an issue. We include several versions of C2FARM-BC with different voxelization schemes. For instance, indicates two levels of voxel grids at and , respectively. And indicates a single level of a voxel grid without the coarse-to-fine-grain scheme. PerAct is the only agent that achieves success, whereas all C2FARM-BC versions perform at chance with , indicating that the global receptive field of the Transformer is crucial for solving the task.

Task # Train # Test Succ. %
Press Handsan 5 10 90
Put Marker 8 10 70
Place Food 8 10 60
Put in Drawer 8 10 40
Hit Ball 8 10 60
Stack Blocks 10 10 40
Sweep Beans 8 5 20
Table 2: Success rates (mean %) of a multi-task model trained an evaluated 7 real-world tasks (see Figure 1).

4.4 Real-Robot Results

We also validated our results with real-robot experiments on a Franka Emika Panda. See Appendix D for setup details. Without any sim-to-real transfer or pre-training, we trained a multi-task PerAct agent from scratch on 7 tasks (with 18 unique variations) from a total of just 53 demonstrations. See the supplementary video for qualitative results that showcase the diversity of tasks and robustness to scene changes. Table 2 reports success rates from small-scale evaluations. Similar to the simulation results, we find that PerAct is able to achieve success on simple short-horizon tasks like pressing hand-sanitizers from just a handful number of demonstrations. The most common failures involved predicting incorrect gripper open actions, which often lead the agent into unseen states. This could be addressed in future works by using HG-DAgger style approaches to correct the agent [12]. Other issues included the agent exploiting biases in the dataset like in prior work [16]. This could be addressed by scaling up expert data with more diverse tasks and task variants.

5 Limitations and Conclusion

We presented PerAct, a Transformer-based multi-task agent for 6-DoF manipulation. Our experiments with both simulated and real-world tasks indicate that the right problem formulation, i.e., detecting voxel actions, makes a substantial difference in terms of data efficiency and robustness.

While PerAct is quite capable, extending it to dexterous continuous control remains a challenge. PerAct is at the mercy of a sampling-based motion-planner to execute discretized actions, and is not easily extendable to N-DoF actuators like multi-fingered hands. See Appendix L for an extended discussion on PerAct’s limitations. But overall, we are excited about scaling up robot learning with Transformers by focusing on diverse rather than narrow multi-task data for robotic manipulation.

We thank Selest Nashef and Karthik Desingh for their help with the Franka setup at UW. We thank Stephen James for helping with RLBench and ARM issues. We are also grateful to Zoey Chen, Markus Grotz, Aaron Walsman, and Kevin Zakka, for providing feedback on the initial draft. This work was funded in part by ONR under award #1140209-405780. Mohit Shridhar is supported by the NVIDIA Graduate Fellowship, and was also a part-time intern at NVIDIA throughout the duration of this project.


Appendix A Task Details

Task Variation Type # of Variations Avg. Keyframes Language Template
open drawer placement 3 3.0 “open the   drawer”
slide block color 4 4.7 “slide the block to   target”
sweep to dustpan size 2 4.6 “sweep dirt to the   dustpan”
meat off grill category 2 5.0 “take the   off the grill”
turn tap placement 2 2.0 “turn   tap”
put in drawer placement 3 12.0 “put the item in the   drawer”
close jar color 20 6.0 “close the   jar”
drag stick color 20 6.0 “use the stick to drag the cube onto the   target”
stack blocks color, count 60 14.6 “stack     blocks”
screw bulb color 20 7.0 “screw in the   light bulb”
put in safe placement 3 5.0 “put the money away in the safe on the   shelf”
place wine placement 3 5.0 “stack the wine bottle to the   of the rack”
put in cupboard category 9 5.0 “put the   in the cupboard”
sort shape shape 5 5.0 “put the   in the shape sorter”
push buttons color 50 3.8 “push the   button, [then the   button]”
insert peg color 20 5.0 “put the ring on the   spoke”
stack cups color 20 10.0 “stack the other cups on top of the   cup”
place cups count 3 11.5 “place   cups on the cup holder”
Table 3: Language-Conditioned Tasks in RLBench [15].

Setup. Our simulated experiments are set in RLBench [15]. We select 18 out of 100 tasks that involve at least two or more variations to evaluate the multi-task capabilities of agents. While PerAct could be easily applied to more RLBench tasks, in our experiments, we were specifically interested grounding diverse language instructions, rather than learning one-off policies for single-variation tasks like “[always] take off the saucepan lid”. Some tasks were modified to include additional variations. See Table 3 for an overview. We report average keyframes extracted from the method described in Section 3.2.

Variations. Task variations include randomly sampled colors, sizes, shapes, counts, placements, and categories of objects. The set of colors include 20 instances: colors = red, maroon, lime, green, blue, navy, yellow, cyan, magenta, silver, gray, orange, olive, purple, teal, azure, violet, rose, black, white. The set of sizes include 2 instances: sizes = short, tall. The set of shapes include 5 instances: shapes = cube, cylinder, triangle, star, moon. The set of counts include 3 instances: counts = 1, 2, 3. The placements and object categories are specific to each task. For instance, open drawer has 3 placement locations: top, middle, and bottom, and put in cupboard includes 9 YCB objects. In addition to these semantic variations, objects are placed on the tabletop at random poses. Some large objects like drawers have constrained pose variations [15] to ensure that manipulating them is kinematically feasible with the Franka arm.

In the following sections, we describe each of 18 tasks in detail. We highlight tasks that were modified from the original RLBench [15] codebase333 and describe what exactly was modified.

a.1 Open Drawer


Task: Open one of the three drawers: top, middle, or bottom.

Modified: No.

Objects: 1 drawer.

Success Metric: The prismatic joint of the specified drawer is fully extended.

a.2 Slide Block


Task: Slide the block on to one of the colored square targets. The target colors are limited to red, blue, pink, and yellow.

Modified: Yes. The original task contained only one target. Three other targets were added to make a total of 4 variations.

Objects: 1 block and 4 colored target squares.

Success Metric: Some part of the block is inside the specified target area.

a.3 Sweep to Dustpan


Task: Sweep the dirt particles to either the short or tall dustpan.

Modified: Yes. The original task contained only one dustpan. One other dustpan was added to make a total of 2 variations.

Objects: 5 dirt particles and 2 dustpans.

Success Metric: All 5 dirt particles are inside the specified dustpan.

a.4 Meat Off Grill


Task: Take either the chicken or steak off the grill and put it on the side.

Modified: No.

Objects: 1 piece of chicken, 1 piece of steak, and 1 grill.

Success Metric: The specified meat is on the side, away from the grill.

a.5 Turn Tap


Task: Turn either the left or right handle of the tap. Left and right are defined with respect to the faucet orientation.

Modified: No.

Objects: 1 faucet with 2 handles.

Success Metric: The revolute joint of the specified handle is at least off from the starting position.

a.6 Put in Drawer


Task: Put the block in one of the three drawers: top, middle, or bottom.

Modified: No.

Objects: 1 block and 1 drawer.

Success Metric: The block is inside the specified drawer.

a.7 Close Jar


Task: Put the lid on the jar with the specified color and screw the lid in. The jar colors are sampled from the full set of 20 color instances.

Modified: No.

Objects: 1 block and 2 colored jars.

Success Metric: The lid is on top of the specified jar and the Franka gripper is not grasping anything.

a.8 Drag Stick


Task: Grab the stick and use it to drag the cube on to the specified colored target square. The target colors are sampled from the full set of 20 color instances.

Modified: Yes. The original task contained only one target. Three other targets were added with randomized colors.

Objects: 1 block, 1 stick, and 4 colored target squares.

Success Metric: Some part of the block is inside the specified target area.

a.9 Stack Blocks


Task: Stack blocks of the specified color on the green platform. There are always 4 blocks of the specified color, and 4 distractor blocks of another color. The block colors are sampled from the full set of 20 color instances.

Modified: No.

Objects: 8 color blocks (4 are distractors), and 1 green platform.

Success Metric: blocks are inside the area of the green platform.

a.10 Screw Bulb


Task: Pick up the light bulb from the specified holder, and screw it into the lamp stand. The colors of holder are sampled from the full set of 20 color instances. There are always two holders in the scene – one specified and one distractor holder.

Modified: No.

Objects: 2 light bulbs, 2 holders, and 1 lamp stand.

Success Metric: The bulb from the specified holder is inside the lamp stand dock.

a.11 Put in Safe


Task: Pick up the stack of money and put it inside the safe on the specified shelf. The shelf has three placement locations: top, middle, bottom.

Modified: No.

Objects: 1 stack of money, and 1 safe.

Success Metric: The stack of money is on the specified shelf inside the safe.

a.12 Place Wine


Task: Grab the wine bottle and put it on the wooden rack at one of the three specified locations: left, middle, right. The locations are defined with respect to the orientation of the wooden rack.

Modified: Yes. The original task had only one placement location. Two other locations were added to make a total of 3 variations.

Objects: 1 wine bottle, and 1 wooden rack.

Success Metric: The wine bottle is at the specified placement location on the wooden rack.

a.13 Put in Cupboard


Task: Grab the specified object and put it in the cupboard above. The scene always contains 9 YCB objects that are randomly placed on the tabletop.

Modified: No.

Objects: 9 YCB objects, and 1 cupboard (that hovers in the air like magic).

Success Metric: The specified object is inside the cupboard.

a.14 Sort Shape


Task: Pick up the specified shape and place it inside the correct hole in the sorter. There are always 4 distractor shapes, and 1 correct shape in the scene.

Modified: Yes. The sizes of the shapes and sorter were enlarged so that they are distinguishable in the RGB-D input.

Objects: 5 shapes, and 1 sorter.

Success Metric: The specified shape is inside the sorter.

a.15 Push Buttons


Task: Push the colored buttons in the specified sequence. The button colors are sampled from the full set of 20 color instances. There are always three buttons in scene.

Modified: No.

Objects: 3 buttons.

Success Metric: All the specified buttons were pressed.

a.16 Insert Peg


Task: Pick up the square and put it on the specified color spoke. The spoke colors are sampled from the full set of 20 color instances.

Modified: No.

Objects: 1 square, and 1 spoke platform with three color spokes.

Success Metric: The square is on the specified spoke.

a.17 Stack Cups


Task: Stack all cups on top of the specified color cup. The cup colors are sampled from the full set of 20 color instances. The scene always contains three cups.

Modified: No.

Objects: 3 tall cups.

Success Metric: All other cups are inside the specified cup.

a.18 Place Cups


Task: Place cups on the cup holder. This is a very high precision task where the handle of the cup has to be exactly aligned with the spoke of the cup holder for the placement to succeed.

Modified: No.

Objects: 3 cups with handles, and 1 cup holder with three spokes.

Success Metric: cups are on the cup holder, each on a separate spoke.

Appendix B PerAct Details

In this section, we provide implementation details for PerAct. See this Colab tutorial

for a PyTorch implementation.

Input Observation. Following James et al. [14], our input voxel observation is a voxel grid with channels: . The grid is constructed by fusing calibrated pointclouds with PyTorch’s scatter_ function444 The channels are composed of: RGB, point, occupancy, and position index values. The RGB values are normalized to a zero-mean distribution. The point values are Cartesian coordinates in the robot’s coordinate frame. The occupancy value indicates if a voxel is occupied or empty. The position index values represent the 3D location of the voxel with respect to the grid. In addition to the voxel observation, the input also includes proprioception data with scalar values: gripper open, left finger joint position, right finger joint position, and timestep (of the action sequence).

Input Language. The language goals are encoded with CLIP’s language encoder [68]. We use CLIP’s tokenizer to preprocess the sentence, which always results in an input sequence of

tokens (with zero-padding). These tokens are encoded with the language encoder to produce a sequence of dimensions


Preprocessing. The voxel grid is encoded with a 3D convolution layer with a kernel to upsample the channel dimension from to . Similarly, the proprioception data is encoded with a linear layer to upsample the input dimension from to . The encoded voxel grid is split into

patches through a 3D convolution layer with a kernel-size and stride of 5, which results in a patch tensor of dimensions

. The proprioception features are tiled in 3D to match the dimensions of the patch tensor, and concattenated along the channel to form a tensor of dimensions . This tensor is flattened into a sequence of dimensions . The language features are downsampled with a linear layer from to dimensions, and then appended to the tensor to form the final input sequence to the Perceiver Transformer, which of dimensions . We also add learned positional embeddings to the input sequence. These embeddings are represented with trainable nn.Parameter(s) in PyTorch.

Figure 6: Perceiver Transformer Architecture. Perceiver is a latent-space transformer. Q, K, V represent queries, keys, and values, respectively. We use 6 self-attention layers in our implementation.

Perceiver Transformer is a latent-space Transformer [1] that uses a small set of latent vectors to encode extremely long input sequences. See Figure 6 for an illustration of this process. Perceiver first computes cross-attention between the input sequence and the set of latent vectors of dimensions . These latents are randomly initialized and trained end-to-end. The latents are encoded with 6 self-attention layers, and then cross-attended with the input to output a sequence that matches the input-dimensions. This output is upsampled with a 3D convolution layer and tri-linear upsampling to form a voxel feature grid with channels: . This feature grid is concatenated with the initial -dimensional feature grid from the processing stage as a skip connection to the encoding layers. Finally, a 3D convolution layer with a kernel downsamples the channels from back to

dimensions. Our implementation of Perceiver is based on an existing open-source repository


Decoding. For translation, the voxel feature grid is decoded with a 3D convolution layer with a kernel to downsample the channel dimension from to . This tensor is the translation -function of dimensions . For rotation, gripper open, and collision avoidance actions, the voxel feature grid is max-pooled along the 3D dimensions to form a vector of dimensions . This vector is decoded with three independent linear layers to form the respective

-functions for rotation, gripper open, and collision avoidance. The rotation linear layer outputs logits of dimensions

( bins of 5 degree increments for each of the three axes). The gripper open and collide linear layers output logits of dimensions .

Our codebase is built on the ARM repository666 by James et al. [14].

Appendix C Evaluation Workflow

c.1 Simulation

Simulated experiments in Section 4.2 follow a four-phase workflow: (1) generate a dataset with train, validation, and test sets, each containing , , and demonstrations, respectively. (2) Train an agent on the train set and save checkpoints at intervals of 10K iterations. (3) Evaluate all saved checkpoints on the validation set, and mark the best performing checkpoint. (4) Evaluate the best performing checkpoint on the test set. While this workflow follows a standard train-val-test paradigm from supervised learning, it is not the most feasible workflow for real-robot settings. With real-robots, collecting a validation set and evaluating all checkpoints could be very expensive.

c.2 Real-Robot

For real-robot experiments in Section 4.4, we simply pick the last checkpoint from training. We check if the agent has been sufficiently trained by visualizing -predictions on training examples with swapped or modified language goals. While evaluating a trained agent, the agent keeps acting until a human user stops the execution. We also visualize the -predictions live to ensure that the agent’s upcoming action is safe to execute.

Appendix D Robot Setup

d.1 Simulation

Figure 7: Simulated Setup. The four camera setup: front, left shoulder, right shoulder, and on the wrist.

All simulated experiments use the four camera setup illustrated in Figure 7. The front, left shoulder, and right shoulder cameras, are static, but the wrist camera moves with the end-effector. We did not modify the default camera poses from RLBench [15]. These poses maximize coverage of the tabletop, while minimizing occlusions caused by the moving arm. The wrist camera in particular is able to provide high-resolution observations of small objects like handles.

d.2 Real-Robot

Hardware Setup. The real-robot experiments use a Franka Panda manipulator with a parallel-gripper. For perception, we use a Kinect-2 RGB-D camera mounted on a tripod, at an angle, pointing towards the tabletop. See Figure D for reference. We tried setting-up multiple Kinects for multi-view observations, but we could not fix the interference issue caused by multiple Time-of-Flight sensors. The Kinect-2 provides RGB-D images of resolution at 30Hz. The extrinsics between the camera and robot base-frame are calibrated with the easy_handeye package777 We use an ARUCO888 AR marker mounted on the gripper to aid the calibration process.

Figure 8: Real-Robot Setup with Kinect-2 and Franka Panda.

Data Collection. We collect demonstrations with an HTC Vive controller. The controller is a 6-DoF tracker that provides accurate poses with respect to a static base-station. These poses are displayed as a marker on RViz999 along with the real-time RGB-D pointcloud from the Kinect-2. A user specifies target poses by using the marker and pointcloud as reference. These target poses are executed with a motion-planner. We use Franka ROS and MoveIt101010, which by default uses an RRT-Connect planner.

Training and Execution. We train a PerAct agent from scratch with 53 demonstrations. The training samples are augmented with m translation perturbations and yaw rotation perturbations. We train on 8 NVIDIA P100 GPUs for 2 days. During evaluation, we simply chose the last checkpoint from training (since we did not collect a validation set for optimization). Inference is done on a single Titan X GPU.

Appendix E Data Augmentation

PerAct’s voxel-based formulation naturally allows for data augmentation with SE(3) transformations. During training, samples of voxelized observations and their corresponding keyframe actions are perturbed with random translations and rotations. Translation perturbations have a range of . Rotation perturbations are limited to the yaw axis and have a range of . The limit ensures that the perturbed rotations do not go beyond what is kinematically reachable for the Franka arm. We did experiment with pitch and roll perturbations, but they substantially lengthened the training time. Any perturbation that pushed the discretized action outside the observation voxel grid was discarded. See the bottom row of Figure 10 for examples of data augmentation.

Appendix F Demo Augmentation

Figure 9: Keyframes and Demo Augmentation.

Following James et al. [15], we cast every datapoint in a demonstration as a “predict the next (best) keyframe action” task. See Figure 9 for an illustration of this process. In this illustration, and are two keyframes that were extracted from the method described in Section 3.2. The orange circles indicate datapoints whose RGB-D observations are paired with the next keyframe action.

Appendix G Sensitivity Analysis

sweep to
meat off
put in
PerAct 80 72 56 84 80 68 60 68 36
PerAct  w/o Rot Aug 92 72 56 92 96 60 56 100 8
PerAct  latents 84 88 44 68 84 48 48 84 12
PerAct  latents 84 48 52 84 84 52 32 92 12
PerAct  latents 92 84 48 100 92 32 32 100 20
PerAct  voxels 88 72 80 60 84 36 40 84 32
PerAct  voxels 28 44 100 60 72 24 0 24 0
PerAct  patches 72 48 96 92 76 76 36 96 32
PerAct  patches 68 64 56 52 96 56 36 92 20
put in
put in
PerAct 24 44 12 16 20 48 0 0 0
PerAct  w/o Rot Aug 20 32 48 8 8 56 8 4 0
PerAct  latents 32 44 52 8 12 72 4 4 0
PerAct  latents 24 32 36 8 20 40 8 4 0
PerAct  latents 48 40 36 24 16 32 12 0 4
PerAct  voxels 24 48 44 12 4 32 0 4 0
PerAct  voxels 12 20 52 0 0 60 0 0 0
PerAct  patches 8 48 76 0 12 16 0 0 0
PerAct  patches 12 36 72 12 0 20 0 0 0
Table 4: Sensitivity Analysis. Success rates (mean %) of various PerAct agents trained with 100 demonstrations per task. We
investigate three factors that affect PerAct’s performance: rotation augmentation, number of Perceiver latents, and voxel resolution.

In Table 4, we investigate three factors that affect PerAct’s performance: rotation data augmentation, number of Perceiver latents, and voxelization resolution. All multi-task agents were trained with 100 demonstrations per task and evaluated on 25 episodes per task. To briefly summarize these results: (1) yaw perturbations improve performance on tasks with lots of rotation variations like stack blocks, but also worsen performance on tasks with constrained rotations like place wine. (2) PerAct with just latents is competitive with (and sometimes even better than) the default agent with latents, which showcases the compression capability of the Perceiver architecture. (3) Coarse grids like are sufficient for some tasks, but high-precision tasks like sort shape need higher resolution voxelization. (4) Large patch-sizes reduce memory usage, but they might affect tasks that need sub-patch precision.

Appendix H High-Precision Tasks

Multi Single
place cups 0 24
stack cups 0 32
insert peg 0 16
Table 5: Success rates (mean %) of multi-task and single-task PerAct agents trained with 100 demos and evaluated on 25 episodes.

In Table 1, PerAct achieves zero performance on three high-precision tasks: place cups, stack cups, and insert peg. To investigate if multi-task optimization is itself one of the factors affecting performance, we train 3 separate single-task agents for each task. We find that single-task agents are able to achieve non-zero performance, indicating that better multi-task optimization methods might improve performance on certain tasks.

Appendix I Additional Related Work

In this section, we briefly discuss additional works that were not mentioned in Section 2.

Concurrent Work. Recently, Mandi et al. [75] found that pre-training and fine-tuning on new tasks is competitive, or even better, than meta-learning approaches for RLBench tasks in multi-task (but single-variation) settings. This pre-training and fine-tuning paradigm might be directly applicable to PerAct, where a pre-trained PerAct agent could be quickly adapted to new tasks without the explicit use of meta-learning algorithms.

Multi-Task Learning. In the context of RLBench, Auto- [65] presents a multi-task optimization framework that goes beyond uniform task weighting from Section 3.4. The method dynamically tunes task weights based on the validation loss. Future works with PerAct could replace uniform task weighting with Auto- for better multi-task performance. In the context of Meta-World [48], Sodhani et al. [76] found that language-conditioning leads to performance gains for multi-task RL on 50 task variations.

Language-based Planning. In this paper, we only investigated single-goal settings where the language instruction does not change throughout the episode. However, language-conditioning natural allows for composing several instructions in a sequential manner [61]. As such, several prior works [77, 13, 78, 79] have used language as medium for planning high-level actions, which can then be executed with pre-trained low-level skills. Future works could incorporate language-based planning for grounding more abstract goals like “make dinner”.

Task and Motion Planning. In the sub-field of Task and Motion Planning (TAMP) [80, 81], Konidaris et al. [82] present an action-centric approach to symbolic planning. Given a set of predefined action-skills, an agent interacts with its environment to construct a set of symbols, which can then be used for planning.

Voxel Representations. Voxel-based representations have been used in several domains that specifically benefit from 3D understanding. Like in object detection [83, 84] and vision-language grounding [85, 86], voxel maps have been used to build persistent scene representations. In Neural Radiance Fields (NeRFs), voxel feature grids have dramatically reduced training and rendering times [87, 88]. Similarly, other works in robotics have used voxelized representations to embed viewpoint-invariance for driving [89] and manipulation [90]. The use of latent vectors in Perceiver [1] is broadly related to voxel hashing [91] from computer graphics. Instead of using a location-based hashing function to map voxels to fixed size memory, PerceiverIO uses cross attention to map the input to fixed size latent vectors, which are trained end-to-end. Another major difference is the treatment of unoccupied space. In graphics, unoccupied space does not affect rendering, but in PerAct, unoccupied space is where a lot of “action detections” happen. Thus the relationship between unoccupied and occupied space, i.e., scene, objects, robot, is crucial for learning action representations.

Appendix J Additional Q-Prediction Examples

Figure 10 showcases additional -prediction examples from trained PerAct agents. Traditional object-centric representations like poses and instance-segmentations struggle to represent piles of beans or tomato vines with high-precision. Whereas action-centric agents like PerAct focus on learning perceptual representations of actions, which elevates the need for practitioners to define what should be an object.

Figure 10: Additional Q-Prediction Examples. Translation -Prediction examples from PerAct. The top two rows are from simulated tasks without any data augmentation perturbations, and the bottom row is from real-world tasks with translation and yaw-rotation perturbations.

Appendix K Things that did not work

In this section, we describe things we tried, but did not work or caused issues in practice.

Real-world multi-camera setup. We tried setting up multiple Kinect-2s for real-world multi-view observations, but we could not solve interference issues with multiple Time-of-Flight sensors. Particularly, the depth frames became very noisy and had lots of holes. Future works could try turning the cameras on-and-off in a rapid sequence, or use better Time-of-Flight cameras with minimal interference.

Fourier features for positional embeddings. Instead of the learned positional embeddings, we also experimented with concatenating Fourier features to the input sequence like in some Perceiver models [1]. The Fourier features led to substantially worse performance.

Pre-trained vision features. Following CLIPort [16], we tried using pre-trained vision features from CLIP [68], instead of raw RGB values, to bootstrap learning and also to improve generalization to unseen objects. We ran CLIP’s ResNet50 on each of the 4 RGB frames, and upsampled features with shared decoder layers in a UNet fashion. But we found this to be extremely slow, especially since the ResNet50 and decoder layers need to be run on 4 independent RGB frames. With this additional overhead, training multi-task agents would have taken substantially longer than 16 days. Future works could experiment with methods for pre-training the decoder layers on auxiliary tasks, and pre-extracting features for faster training.

Upsampling at multiple self-attention layers. Inspired by Dense Prediction Transformers (DPT) [92], we tried upsampling features at multiple self-attention layers in the Perceiver Transformer. But this did not work at all; perhaps the latent-space self-attention layers of Perceiver are substantially different to the full-input self-attention layers of ViT [4] and DPT [92].

Extreme rotation augmentation. In addition to yaw rotation perturbations, we also tried perturbing the pitch and roll. While PerAct was still able to learn policies, it took substantially longer to train. It is also unclear if the default latent size of is appropriate for learning 6-DoF polices with such extreme rotation perturbations.

Using Adam instead of LAMB. We tried training PerAct with the Adam [93] optimizer instead of LAMB [70], but this led to worse performance in both simulated and real-world experiments.

Appendix L Limitations and Risks

While PerAct is quite capable, it is not without limitations. In the following sections, we discuss some of these limitations and potential risks for real-world deployment.

Sampling-Based Motion Planner. PerAct relies on a sampling-based motion planner to execute discretized actions. This puts PerAct at the mercy of randomized planner to reach poses. While this issue did not cause any major problems with the tasks in our experiments, a lot of other tasks are sensitive to the paths taken to reach poses. For instance, pouring water into a cup would require a smooth path for tilting the water container appropriately. This could be addressed in future works by using a combination of learned and sampled motion paths [94].

Dynamic Manipulation. Another issue with discrete-time discretized actions is that they are not easily applicable to dynamic tasks that require real-time closed-loop maneuvering. This could be addressed with a separate visuo-servoing mechanism that can reach target poses with closed-loop control. Alternatively, instead of predicting just one action, PerAct could be extended to predict a sequence of discretized actions. Here, the Transformer-based architecture could be particularly advantageous.

Dexterous Manipulation. Using discretized actions with N-DoF robots like multi-fingered hands is also non-trivial. Specifically for multi-fingered hands, PerAct could be modified to predict finger-tip poses that can be reached with an IK (Inverse Kinematics) solver. But it is unclear how feasible or robust such an approach would be with under-actuated systems like multi-fingered hands.

Figure 11: Perturbation Tests. Results from a multi-task PerAct agent trained on a single drawer and evaluated on several instances perturbed drawers. Each perturbation consists of 25 evaluation episodes, and reported successes are relative to the training drawer.

Generlization to Novel Instances and Objects. In Figure 11, we report results from small-scale perturbation experiments on the open drawer task. We observe that changing the shape of the handles does not affect performance. However, handles with randomized textures and colors confuse the agent since it has only seen one type of drawer color and texture during training. Going beyond this one-shot setting, and training on several instances of drawers might improve generalization performance. Although we did not explicitly study generalization to unseen objects, it might be feasible to train PerAct’s action-detector on a broad range of objects and evaluate its ability to handle novel objects, akin to how language-conditioned instance-segmentors and object-detectors are used [95]. Alternatively, pre-trained vision features from multi-modal encoders like CLIP [68] or R3M [35] could be used to boostrap learning.

Scope of Language Grounding. Like with prior work [16], PerAct’s understanding of verb-noun phrases is closely grounded in demonstrations and tasks. For example, “cleaning” in “clean the beans on the table with a dustpan” is specifically associated with the action sequence of pushing beans on to a dustpan, and not “cleaning” in general, which could be applied to other tasks like cleaning the table with a cloth.

Predicting Task Completion. For both real-world and simulated evaluations, an oracle indicates whether the desired goal has been reached. This oracle could be replaced with a success classifier that can be pre-trained to predict task completion from RGB-D observations.

Data Augmentation with Kinematic Feasibility. The data augmentation method described in Section E does not consider the kinematic feasibility of reaching perturbed actions with the Franka arm. Future works could pre-compute unreachable poses in the discretized action space, and discard any augmentation perturbations that push actions into unreachable zones.

Balanced Datasets. Since PerAct is trained with just a few demonstrations, it occassionally tends to exploit biases in the training data. For instance, PerAct might have a tendency to always “place blue blocks on yellow blocks”

if such an example is over-represented in the training data. Such issues could be potentially fixed by scaling datasets to include more diverse examples of objects and attributes. Additionally, data visualization methods could be used to identify and fix these biases.

Multi-Task Optimization. The uniform task sampling strategy presented in Section 3.4 might sometimes hurt performance. Since all tasks are weighted equally, optimizing for certain tasks with common elements (e.g., moving blocks), might adversarial affect the performance on other dissimilar tasks (e.g., turning taps). Future works, could use dynamic task-weighting methods like Auto- [65] for better multi-task optimization.

Deployment Risks. PerAct is an end-to-end framework for 6-DoF manipulation. Unlike some methods in Task-and-Motion-Planning that can sometimes provide theoretical guarantees on task completion, PerAct is a purely reactive system whose performance can only be evaluated through empirical means. Also, unlike prior works [16], we do not use internet pre-trained vision encoders that might contain harmful biases [96, 97]. Even so, it is prudent to thoroughly study and mitigate any biases before deployment. As such, for real-world applications, keeping humans in the loop both during training and testing, might help. Usage with unseen objects and observations with people is not recommended for safety critical systems.

Appendix M Emergent Properties

In this section, we present some preliminary findings on the emergent properties of PerAct.

Figure 12: Object Tracker. Tracking an unseen hand sanitizer instance.

m.1 Object Tracking

Although PerAct was not explicitly trained for 6-DoF object-tracking, our action detection framework can be used to localize objects in cluttered scenes. In this video, we show an agent that was trained with one hand sanitizer instance on just 5 “press the handsan” demos, and then evaluated on tracking an unseen sanitizer instance. PerAct does not need to build a complete representation of hand sanitizers, and only has to learn where to press them. Our implementation runs at an inference speed of 2.23 FPS (or 0.45 seconds per frame), allowing for near real-time closed-loop behaviors.

Figure 13: Examples of Multi-Modal Predictions.

m.2 Multi-Modal Actions

PerAct’s problem formulation allows for modeling multi-modal action distributions, i.e., scenarios where multiple actions are valid given a specific goal. Figure 13 presents some selected examples of multi-modal action predictions from PerAct. Since there are several “yellow blocks” and “cups” to choose from, the -prediction distributions have several modes. In practice, we observe that the agent has a tendency to prefer certain object instances over others (like the front mug in Figure 13) due to preference biases in the training dataset. We also note that the cross-entropy based training method from Section 3.4

is closely related to Energy-Based Models (EBMs) 

[98, 99]. In a way, the cross-entropy loss is pulling up expert 6-DoF actions, while pushing-down every other action in the discretized action space. At test time, we simply maximize the learned -predictions, instead of minimizing an energy function with optimization. Future works could look into EBM [99] training and inference methods for better generalization and execution performance.