Vision-Language Models as Success Detectors

03/13/2023
by   Yuqing Du, et al.
7

Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.

READ FULL TEXT

page 2

page 5

page 8

page 9

page 10

page 11

page 14

page 20

research
03/31/2021

Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos

We are motivated by the goal of generalist robots that can complete a wi...
research
11/21/2022

Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback

An important goal in artificial intelligence is to create agents that ca...
research
07/18/2023

Towards A Unified Agent with Foundation Models

Language Models and Vision Language Models have recently demonstrated un...
research
09/14/2022

Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering

Benefiting from large-scale Pretrained Vision-Language Models (VL-PMs), ...
research
05/18/2023

Large Language Models can be Guided to Evade AI-Generated Text Detection

Large Language Models (LLMs) have demonstrated exceptional performance i...
research
02/11/2021

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

The canonical approach to video-and-language learning (e.g., video quest...
research
11/22/2017

Building Machines that Learn and Think for Themselves: Commentary on Lake et al., Behavioral and Brain Sciences, 2017

We agree with Lake and colleagues on their list of key ingredients for b...

Please sign up or login with your details

Forgot password? Click here to reset