CLIPort: What and Where Pathways for Robotic Manipulation

by   Mohit Shridhar, et al.

How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.


page 2

page 4

page 8

page 20

page 21

page 22

page 24


Transporter Networks: Rearranging the Visual World for Robotic Manipulation

Robotic manipulation can be formulated as inducing a sequence of spatial...

Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

Grounded understanding of natural language in physical scenes can greatl...

SORNet: Spatial Object-Centric Representations for Sequential Manipulation

Sequential manipulation tasks require a robot to perceive the state of a...

Spatial-Language Attention Policies for Efficient Robot Learning

We investigate how to build and train spatial representations for robot ...

Vision-Based Manipulators Need to Also See from Their Hands

We study how the choice of visual perspective affects learning and gener...

End-to-End Egospheric Spatial Memory

Spatial memory, or the ability to remember and recall specific locations...

TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation

How do we imbue robots with the ability to efficiently manipulate unseen...

Please sign up or login with your details

Forgot password? Click here to reset