Distilling Internet-Scale Vision-Language Models into Embodied Agents

01/29/2023
by   Theodore Sumers, et al.
0

Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent's behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.

READ FULL TEXT

page 4

page 5

page 6

page 7

page 12

page 16

page 17

research
09/05/2023

Cognitive Architectures for Language Agents

Recent efforts have incorporated large language models (LLMs) with exter...
research
10/24/2022

Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

Humans are excellent at understanding language and vision to accomplish ...
research
04/25/2022

C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval

Pretrained language models have improved effectiveness on numerous tasks...
research
07/20/2023

Behavioral Analysis of Vision-and-Language Navigation Agents

To be successful, Vision-and-Language Navigation (VLN) agents must be ab...
research
02/27/2022

DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following

Language-guided Embodied AI benchmarks requiring an agent to navigate an...
research
12/08/2022

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

This study focuses on embodied agents that can follow natural language i...
research
05/03/2023

Plan, Eliminate, and Track – Language Models are Good Teachers for Embodied Agents

Pre-trained large language models (LLMs) capture procedural knowledge ab...

Please sign up or login with your details

Forgot password? Click here to reset