Learning to Set Waypoints for Audio-Visual Navigation

by   Changan Chen, et al.

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation.


Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources in Unmapped 3D Environments

Recent work on audio-visual navigation targets a single static sound in ...

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

A crucial aspect of mobile intelligent agents is their ability to integr...

Semantic Audio-Visual Navigation

Recent work on audio-visual navigation assumes a constantly-sounding tar...

Audio-Visual Embodied Navigation

Moving around in the world is naturally a multisensory experience, but t...

Sound Adversarial Audio-Visual Navigation

Audio-visual navigation task requires an agent to find a sound source in...

SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based a...

A Deep Reinforcement Learning Approach to Audio-Based Navigation in a Multi-Speaker Environment

In this work we use deep reinforcement learning to create an autonomous ...