Multimodal Web Navigation with Instruction-Finetuned Foundation Models

05/19/2023
by   Hiroki Furuta, et al.
0

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded visual perception, HTML comprehension and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB benchmark, we improve over the previous best offline methods by more than 31.9 online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

READ FULL TEXT
research
04/30/2020

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Following a navigation instruction such as 'Walk down the stairs and sto...
research
04/06/2020

Sub-Instruction Aware Vision-and-Language Navigation

Vision-and-language navigation requires an agent to navigate through a r...
research
05/08/2023

Accessible Instruction-Following Agent

Humans can collaborate and complete tasks based on visual signals and in...
research
01/10/2019

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

The Vision-and-Language Navigation (VLN) task entails an agent following...
research
11/10/2021

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task that an agent is required...
research
09/20/2023

Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

Visual language navigation (VLN) is an embodied task demanding a wide ra...
research
03/30/2021

Diagnosing Vision-and-Language Navigation: What Really Matters

Vision-and-language navigation (VLN) is a multimodal task where an agent...

Please sign up or login with your details

Forgot password? Click here to reset