Act3D: Infinite Resolution Action Detection Transformer for Robotic Manipulation

06/30/2023
by   Théophile Gervet, et al.
0

3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, typically demanding high-resolution 3D perceptual grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we propose Act3D, a manipulation policy Transformer that casts 6-DoF keypose prediction as 3D detection with adaptive spatial computation. It takes as input 3D feature clouds unprojected from one or more camera views, iteratively samples 3D point grids in free space in a coarse-to-fine manner, featurizes them using relative spatial attention to the physical feature cloud, and selects the best feature point for end-effector pose prediction. Act3D sets a new state-of-the-art in RLbench, an established manipulation benchmark. Our model achieves 10 improvement over the previous SOTA 2D multi-view policy on 74 RLbench tasks and 22 In thorough ablations, we show the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions. Code and videos are available at our project site: https://act3d.github.io/.

READ FULL TEXT

page 2

page 5

page 12

research
06/26/2023

RVT: Robotic View Transformer for 3D Object Manipulation

For 3D object manipulation, methods that build an explicit 3D representa...
research
06/23/2021

Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation

Reflecting on the last few years, the biggest breakthroughs in deep rein...
research
02/05/2023

Multi-View Masked World Models for Visual Robotic Manipulation

Visual robotic manipulation research and applications often use multiple...
research
04/26/2022

Coarse-to-fine Q-attention with Tree Expansion

Coarse-to-fine Q-attention enables sample-efficient robot manipulation b...
research
07/27/2023

IML-ViT: Image Manipulation Localization by Vision Transformer

Advanced image tampering techniques are increasingly challenging the tru...
research
08/18/2022

The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs

We present a simple baseline for directly estimating the relative pose (...
research
04/03/2023

RePAST: Relative Pose Attention Scene Representation Transformer

The Scene Representation Transformer (SRT) is a recent method to render ...

Please sign up or login with your details

Forgot password? Click here to reset