AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

10/14/2022
by   Sudipta Paul, et al.
0

Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN  – an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio-cues for navigation or to query the oracle, and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. The policies are trained via rewarding for the success on the navigation task while minimizing the number of queries to the oracle. To empirically evaluate AVLEN, we present experiments on the SoundSpaces framework for semantic audio-visual navigation tasks. Our results show that equipping the agent to ask for help leads to a clear improvement in performance, especially in challenging cases, e.g., when the sound is unheard during training or in the presence of distractor sounds.

READ FULL TEXT

page 2

page 9

research
06/06/2023

Active Sparse Conversations for Improved Audio-Visual Embodied Navigation

Efficient navigation towards an audio-goal necessitates an embodied agen...
research
05/16/2018

FollowNet: Robot Navigation by Following Natural Language Directions with Deep Reinforcement Learning

Understanding and following directions provided by humans can enable rob...
research
02/06/2016

End-to-End Goal-Driven Web Navigation

We propose a goal-driven web navigation as a benchmark task for evaluati...
research
10/30/2022

Towards Versatile Embodied Navigation

With the emergence of varied visual navigation tasks (e.g, image-/object...
research
06/20/2022

Good Time to Ask: A Learning Framework for Asking for Help in Embodied Visual Navigation

In reality, it is often more efficient to ask for help than to search th...
research
08/20/2023

Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation

Audio-visual navigation is an audio-targeted wayfinding task where a rob...
research
03/13/2023

Audio Visual Language Maps for Robot Navigation

While interacting in the world is a multi-sensory experience, many robot...

Please sign up or login with your details

Forgot password? Click here to reset