RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

07/28/2023
∙
by   Anthony Brohan, et al.
∙
0
∙

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

READ FULL TEXT

page 1

page 2

page 5

page 7

page 11

page 22

page 23

page 25

research
∙ 03/06/2023

PaLM-E: An Embodied Multimodal Language Model

Large language models excel at a wide range of complex tasks. However, e...
research
∙ 02/12/2020

Deep compositional robotic planners that follow natural language commands

We demonstrate how a sampling-based robotic planner can be augmented to ...
research
∙ 09/02/2023

Developmental Scaffolding with Large Language Models

Exploratoration and self-observation are key mechanisms of infant sensor...
research
∙ 04/19/2023

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Advances in GPT-based large language models (LLMs) are revolutionizing n...
research
∙ 09/14/2022

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Effective scaling and a flexible task interface enable large language mo...
research
∙ 03/10/2023

Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors

In recent years, a number of models that learn the relations between vis...
research
∙ 01/18/2023

Effective End-to-End Vision Language Pretraining with Semantic Visual Loss

Current vision language pretraining models are dominated by methods usin...

Please sign up or login with your details

Forgot password? Click here to reset