R2H: Building Multimodal Navigation Helpers that Respond to Help

by   Yue Fan, et al.

The ability to assist humans during a navigation task in a supportive role is crucial for intelligent agents. Such agents, equipped with environment knowledge and conversational abilities, can guide individuals through unfamiliar terrains by generating natural language responses to their inquiries, grounded in the visual information of their surroundings. However, these multimodal conversational navigation helpers are still underdeveloped. This paper proposes a new benchmark, Respond to Help (R2H), to build multimodal navigation helpers that can respond to help, based on existing dialog-based embodied datasets. R2H mainly includes two tasks: (1) Respond to Dialog History (RDH), which assesses the helper agent's ability to generate informative responses based on a given dialog history, and (2) Respond during Interaction (RdI), which evaluates the helper agent's ability to maintain effective and consistent cooperation with a task performer agent during navigation in real-time. Furthermore, we propose a novel task-oriented multimodal response generation model that can see and respond, named SeeRee, as the navigation helper to guide the task performer in embodied tasks. Through both automatic and human evaluations, we show that SeeRee produces more effective and informative responses than baseline methods in assisting the task performer with different navigation tasks. Project website: https://sites.google.com/view/respond2help/home.


page 1

page 12

page 13


Vision-and-Dialog Navigation

Robots navigating in human environments should use language to ask for a...

Intelligent Personal Assistant with Knowledge Navigation

An Intelligent Personal Agent (IPA) is an agent that has the purpose of ...

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

Visual Dialog is a vision-language task that requires an AI agent to eng...

Chat-crowd: A Dialog-based Platform for Visual Layout Composition

In this paper we introduce Chat-crowd, an interactive environment for vi...

Learning by Asking for Embodied Visual Navigation and Task Completion

The research community has shown increasing interest in designing intell...

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

This report details the methods of the winning entry of the AVDN Challen...

Maria: A Visual Experience Powered Conversational Agent

Arguably, the visual perception of conversational agents to the physical...

Please sign up or login with your details

Forgot password? Click here to reset