VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

07/12/2023
by   Raphael Schumann, et al.
0

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation (VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25 task completion over the previous state-of-the-art for two datasets.

READ FULL TEXT

page 2

page 4

page 5

page 11

research
11/29/2018

Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

We study the problem of jointly reasoning about language and vision thro...
research
03/01/2019

Learning To Follow Directions in Street View

Navigating and understanding the real world remains a key challenge in m...
research
06/13/2019

Cross-View Policy Learning for Street Navigation

The ability to navigate from visual observations in unfamiliar environme...
research
01/10/2020

Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View

The Touchdown dataset (Chen et al., 2019) provides instructions by human...
research
09/15/2023

LASER: LLM Agent with State-Space Exploration for Web Navigation

Large language models (LLMs) have been successfully adapted for interact...
research
10/04/2019

Talk2Nav: Long-Range Vision-and-Language Navigation in Cities

Autonomous driving models often consider the goal as fixed at the start ...
research
07/11/2020

Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

The ability to perform effective planning is crucial for building an ins...

Please sign up or login with your details

Forgot password? Click here to reset