Learning To Follow Directions in Street View

03/01/2019
by   Karl Moritz Hermann, et al.
4

Navigating and understanding the real world remains a key challenge in machine learning and inspires a great variety of research in areas such as language grounding, planning, navigation and computer vision. We propose an instruction-following task that requires all of the above, and which combines the practicality of simulated environments with the challenges of ambiguous, noisy real world data. StreetNav is built on top of Google Street View and provides visually accurate environments representing real places. Agents are given driving instructions which they must learn to interpret in order to successfully navigate in this environment. Since humans equipped with driving instructions can readily navigate in previously unseen cities, we set a high bar and test our trained agents for similar cognitive capabilities. Although deep reinforcement learning (RL) methods are frequently evaluated only on data that closely follow the training distribution, our dataset extends to multiple cities and has a clean train/test separation. This allows for thorough testing of generalisation ability. This paper presents the StreetNav environment and tasks, a set of novel models that establish strong baselines, and analysis of the task and the trained agents.

READ FULL TEXT
research
07/12/2023

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Incremental decision making in real-world environments is one of the mos...
research
06/13/2019

Cross-View Policy Learning for Street Navigation

The ability to navigate from visual observations in unfamiliar environme...
research
05/19/2020

Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text

Recent work has described neural-network-based agents that are trained w...
research
10/04/2019

Talk2Nav: Long-Range Vision-and-Language Navigation in Cities

Autonomous driving models often consider the goal as fixed at the start ...
research
05/31/2019

Multi-modal Discriminative Model for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a natural language grounding tas...
research
07/04/2022

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Existing benchmarks for grounding language in interactive environments e...
research
01/10/2020

Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View

The Touchdown dataset (Chen et al., 2019) provides instructions by human...

Please sign up or login with your details

Forgot password? Click here to reset