Audio-Visual Embodied Navigation

by   Changan Chen, et al.

Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf - restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to an audio-based target. We develop a multi-modal deep reinforcement learning pipeline to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce audio renderings based on geometrical acoustic simulations for a set of publicly available 3D assets and instrument AI-Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of apartment, office, and hotel environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces.


page 1

page 3

page 7

page 9

page 13

page 15


Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources in Unmapped 3D Environments

Recent work on audio-visual navigation targets a single static sound in ...

SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based a...

Active Audio-Visual Separation of Dynamic Sound Sources

We explore active audio-visual separation for dynamic sound sources, whe...

Semantic Audio-Visual Navigation

Recent work on audio-visual navigation assumes a constantly-sounding tar...

Do Autonomous Agents Benefit from Hearing?

Mapping states to actions in deep reinforcement learning is mainly based...

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments

In this work we apply deep reinforcement learning to the problems of nav...

Learning to Set Waypoints for Audio-Visual Navigation

In audio-visual navigation, an agent intelligently travels through a com...

Code Repositories


A first-of-its-kind acoustic simulation platform for audio-visual embodied AI research. It supports training and evaluating multiple tasks and applications.

view repo


Starter code for SoundSpaces challenge at CVPR 21's Embodied AI workshop

view repo

1 Introduction

Embodied agents perceive and act in the world around them, with a constant loop between their sensed surroundings and their selected movements. Both sights and sounds constantly drive our activity: the laundry machine buzzes to indicate it is done, a crying child draws our attention, the sound of breaking glass may require urgent help.

Figure 1: Pressure field of audio simulation overlaid on the top-down map of apartment 1 from Replica [85]. Our audio-enabled agent gets rich directional information about the goal, since the pressure field variation is correlated with the shortest distance. The acoustics also reveal the room’s geometry, major structures, and materials. Notice the discontinuities across walls and the gradient of the field along the geodesic path an agent must use to reach the goal (different from the shortest Euclidean path). As a result, to an agent standing in the bottom right room, the audio reveals the door as a good intermediate goal.

In embodied AI, the navigation task is of particular importance, with applications in search and rescue or service robotics, among many others. Navigation has a long history in robotics, where a premium is placed on rigorous geometric maps [91, 44]

. More recently, researchers in computer vision are exploring models that loosen the metricity of maps in favor of end-to-end policy learning and learned spatial memories that can generalize to visual cues in novel environments 

[117, 42, 41, 81, 5, 66, 61, 98].

However, while current navigation models tightly integrate seeing and moving, they are deaf to the world around them. This poses a significant sensory hardship: sound is key to (1) understanding a physical space and (2) localizing sound-emitting targets. As leveraged by blind people and animals who perform sonic navigation, acoustic feedback partially reveals the geometry of a space, the presence of occluding objects, and the materials of major surfaces in the room [75, 30]—all of which can complement the visual stream. Meanwhile, targets currently outside the visual range may be detectable initially only by their sound (e.g., a person calling for help upstairs, the ringing phone occluded by the sofa, footsteps approaching from behind). See Figure 1. Finally, aural cues become critical when visual cues are either unreliable (e.g., the lights flicker off) or orthogonal to the agent’s task (e.g., a rescue site with rubble that breaks previously learned visual context).

Motivated by these factors, we introduce audio-visual navigation for complex, visually realistic 3D environments. The agent can both see and hear while attempting to reach its target. We consider two variants of the navigation task: (1) AudioGoal, where the target is indicated by the sound it emits, and (2) AudioPointGoal, where the agent is additionally directed towards the goal location at the onset. The former captures scenarios where a target initially out of view makes itself known aurally (e.g., person yelling for help, TV playing). The latter augments the standard PointGoal navigation benchmark [5] and captures scenarios where the agent has a GPS pointer towards the target, but should leverage audio-visual cues to navigate the unfamiliar environment and reach it faster.

We develop a multi-modal deep reinforcement learning (RL) pipeline to train navigation policies end-to-end from a stream of audio-visual observations. Importantly, the audio observations are generated with respect to both the agent’s current position and orientation as well as the physical properties of the 3D environment. To do so, we introduce pre-computed audio renderings for Replica [85]

, a public dataset of 3D environments, and integrate them with the open source Habitat platform 

[61] for fast 3D simulation (essential for scalable RL). The proposed agent learns a policy to choose motions in a novel, yet-unmapped environment that will bring it efficiently to the target while discovering relevant aspects of the latent environment map.

Our results show that both audio and vision are valuable for efficient navigation. The agent learns to blend the two modalities to extract the cues needed to map environments in a generalizable manner, and doing so typically yields faster learning at training time and faster, more accurate navigation at inference time. Furthermore, we demonstrate that for an audio goal, the audio stream is competitive with the dynamic goal displacement vectors often used in navigation 

[5, 61, 38, 55, 15], while having the advantage of not assuming perfect odometry. Finally, we explore the agent’s ability to generalize to not only unseen environments, but also unheard sounds.

Our main contributions are:

  1. We introduce the task of audio-visual navigation in complex visually realistic 3D environments.

  2. We generalize a state-of-the-art deep RL-based navigation pipeline to accommodate audio observations and demonstrate its positive impact for navigation.

  3. We instrument the Replica environments [85] on the Habitat platform [61] with acoustically correct sound renderings. This allows insertion of an arbitrary sound source and proper sensing of it from arbitrary agent receiver positions.

  4. We create a benchmark suite of tasks for audio-visual navigation to facilitate future work in this direction.

2 Related Work

Audio-visual learning.

The recent surge of research in audio-visual learning focuses on video rather than embodied perception. This includes interesting directions for synthesizing sounds for video [73, 18, 116], spatializing monaural sounds [33, 67], sound source separation [32, 72, 115, 29, 34], cross-modal feature learning [112, 113, 74], audio-visual target tracking [35, 9, 10, 3], and learning material properties [73]. Unlike prior work that localizes pixels in video frames associated with sounds [92, 84, 7, 47], our goal is to learn navigation policies to actively locate an audio target in a 3D environment. Unlike any of the above, our work addresses embodied audio-visual navigation, not learning from human-captured video.

Vision-based navigation.

The importance of vision in human navigation is well studied in neuroscience [27]: humans use their vision to build an internal representation, a cognitive map, and extract landmark information that facilitates navigation [93, 28, 1, 45]. Following these findings, recent visual AI agents aggregate egocentric visual inputs [118, 117, 65, 90], often with a spatio-temporal memory [41, 81, 46, 105, 98]. Visual navigation can be tied with other tasks to attain intelligent behavior, such as question answering [23, 25, 39, 99], active visual recognition [50, 110, 109], instruction following [88, 6, 86, 17, 16] and collaborative tasks with visual inputs [49, 48, 24]. Our work goes beyond visual perception to incorporate hearing, offering a novel perspective on navigation.

Audio-based navigation.

Cognitive science also confirms that audio is a strong navigational signal. Blind individuals may exhibit enhanced perceptual abilities to compensate for the loss of vision [89, 64]. Both blind and sighted individuals show comparable behavioral skills on cognitive tasks related to spatial navigation [31] and sound localization [40, 60, 79, 96]. Consequently, audio-based AR/VR equipment has been devised for auditory sensory substitution for obstacle avoidance and navigation [21, 100, 62, 37]. Additionally, caricature-like virtual 2D and 3D audio-visual environments help evaluate human learning of audio cues [20, 101, 63]. Unlike our proposed platform, these environments are non-photorealistic and do not support AI agent or training. Prior studies with autonomous agents in simulated environments are restricted to human-constructed game boards, do not use acoustically correct sound models, e.g., manually coding the distance to target as the pitch [103], and train and test on the same environment [97, 103].

Sound localization in robotics.

In robotics, multi-microphone arrays are often used for sound source localization [70, 78, 69, 71]. Past studies fuse audio and visual cues for household surveillance [104, 76], speech recognition [111], human robot interaction [2, 95], and certain robotic manipulation tasks [80]. To our knowledge, ours is the first work to demonstrate improved navigation by an audio-visual agent in a visually and acoustically realistic 3D environment, and the first to introduce an end-to-end policy learning approach for the problem.

3D environments.

Recent research in embodied perception is greatly facilitated by new 3D environments and simulation platforms. Compared to artificial environments like video games [53, 59, 52, 106, 87], photorealistic environments portray 3D scenes in which real people would interact. Their realistic meshes can be rendered from agent-selected viewpoints to train and test RL policies for navigation in a reproducible manner [4, 13, 108, 56, 8, 85, 11, 107, 61]. None of the commonly used environments and simulators provide audio rendering. While the HoME simulator [11] report states support for acoustics, there are no instructions for its use and the environment is no longer actively maintained. To our knowledge, we present the first audio-visual simulator for AI agent training and the first study of audio-visual embodied agents in realistic 3D environments.

3 Audio Simulation Platform

Our audio platform augments the recently released AI-Habitat simulator [61], particularly the Replica Dataset [85] hosted within it. Habitat is an open-source 3D simulator released with a user-friendly API that supports RGB, depth, and semantic rendering. The API offers fast (over 10K fps) rendering and support for multiple datasets [85, 108, 14, 68, 23]. This has incentivized many embodied AI works to embrace it as the 3D simulator for training navigation and question answering agents [61, 15, 55, 38, 99].

Replica [85] is a Habitat-compatible dataset of 18 apartment, hotel, office, and room scenes with 3D meshes and high definition range (HDR) textures and renderable reflector information. See Fig. 2. By extending this 3D platform with our audio simulator, we enable researchers to take advantage of the familiar and efficient API and easily adopt the audio modality for AI agent training along with other modalities that are readily available in the environment. Our audio platform and data will be shared publicly.

Room impulse response.

For each scene in the Replica dataset, we simulate the acoustics of the environment by pre-computing room impulse responses (RIR). The RIR is the transfer function between a sound source and microphone, which varies as a function of the room geometry, materials, and the sound source location [57].

Figure 2: Acoustic simulation. Room impulse responses are captured between each location pair within the illustrated grid (image shows ‘frl_apartment_0’ scene). In our platform agents can experience audio at densely sampled locations marked with yellow dots.

Let denote the set of possible sound source positions, and let denote the set of possible listener positions (i.e., agent microphones). We densely sample a grid of locations for both, then simulate the RIR for each possible source and listener placement at these locations, . Having done so, we can look up any source-listener pair on-the-fly and render the sound, by convolving the desired waveform with the selected RIR. The process of generating these RIRs entails constructing the grid, augmenting the scene mesh with materials, calculating sound propagation paths, and creating a connected graph for the navigational task, as we define next.

Grid construction.

We use an automatic point placement algorithm to determine the locations where the simulated sound sources and listeners are located. First, we compute an axis-aligned 3D bounding box of the environment. We sample points from a regular 2D square grid with resolution 0.5m that slices the bounding box in the horizontal plane at a distance of 1.5m from the floor (representing the height of a humanoid robot). Second, we prune points to remove those outside of the environment or in inaccessible locations (see Supp. for details). The outcome of this point placement algorithm is points given by their 3D Cartesian coordinates. The Replica scenes range in area from 9.5 to 141.5 m and correspondingly yield .

Mesh upgrades.

In addition to its geometry, a room’s materials affect the RIR. For example, imagine the difference in perceived sound of high heels walking across a room with smooth marble floors and tall glass windows, versus a room with shaggy carpet and heavy drapes on the windows. To capture this aspect, we use the semantic labels provided in Replica to determine the acoustic material properties of the geometry. For each semantic class that was deemed to be acoustically relevant, we provide a mapping to an equivalent acoustic material from an existing material database [26]. For the floor, wall, and ceiling classes, we assume acoustic materials of carpet, gypsum board, and acoustic tile, respectively. This helps simulate more realistic sounds than if a single material were assumed for all surfaces. In addition, we add a ceiling to those Replica scenes that lack one, which is necessary to simulate the acoustics accurately.

Acoustic simulation technique.

During the simulations, we compute the room impulse responses between all pairs of points, producing RIRs. The simulation technique stems from the theory of geometric acoustics (GA), which supposes sound can be treated as a particle or ray rather than a wave [82]. This class of simulation methods is capable of accurately predicting the behavior of sound at high frequencies, but requires special modeling of wave phenomena (e.g., diffraction) that occur at lower frequencies. Specifically, our acoustic simulation is based on a bidirectional path tracing algorithm [94] modified for room acoustics applications [12]. Additionally, it uses a recursive formulation of multiple importance sampling (MIS) to improve the convergence of the simulation [36].

The simulation begins by tracing rays from each source location in . These source rays are propagated through the scene up to a maximum number of bounces (

). At each ray-scene intersection of a source path, information about the intersected geometry, incoming and outgoing ray directions, and probabilities are cached. After all source rays are traced, the simulation traces rays from a listener location

in . These rays are again propagated through the scene up to a maximum number of bounces. At each ray-scene intersection of a listener path, rays are traced to connect the current path vertex to the path vertices previously generated from all sources. If a connection ray is not blocked by scene geometry, a path from the source to listener has been found. The energy throughput along that path is multiplied by a MIS weight and is accumulated to the impulse response for that source-listener pair. After all rays have been traced, the simulation is finished.

We perform the simulation in parallel for four logarithmically-distributed frequency bands.111[0Hz,176Hz], [176Hz,775Hz], [775Hz,3409Hz], [3409Hz,20kHz]

These bands cover the human hearing range and are uniform in their distribution from a perceptual standpoint. For each band, the simulation output is a histogram of sound energy with respect to propagation delay time at audio sample rate (44.1kHz). Spatial information is also accumulated in the form of low-order spherical harmonics for each histogram bin. After ray tracing, these energy histograms are converted to pressure IR envelopes by applying the square root, and the envelopes are multiplied by bandpass-filtered white noise and summed to generate the frequency-dependent reverberant part of the monaural room impulse response


Ambisonic signals (roughly speaking, the audio equivalent of a 360 image) are generated by decomposing a sound field into a set of spherical harmonic bases. We generate ambisonics by multiplying the monaural RIR by the spherical harmonic coefficients for each time sample. Early reflections (ER, paths of order ) are handled specially to ensure they are properly reproduced. ER are not accumulated to the main energy histogram, but are instead clustered together based on the plane equation of the geometry involved in the reflection(s). Then, each ER cluster is added to the final pressure IR with frequency-dependent filtering corresponding to the ER energy and its spherical harmonic coefficients.

The result of this process is second-order ambisonic pressure impulse responses that can be convolved with arbitrary new monaural source audios to generate the ambisonic audio heard at a particular listener location. We convert the ambisonics to binaural audio [114] in order to represent an agent with two human-like ears, for whom perceived sound depends on the body’s relative orientation in the scene.

Connectivity graph.

While a listener and source can be placed at any of the points of the aforementioned grid, an agent might not be able to stand at each of these locations due to embodiment constraints. Hence we create a graph capturing the reachability and connectivity of these locations. The graph is constructed in two steps: first we remove nodes that are non-navigable, then for each node pair , we consider the edge as valid if and only if the Euclidean distance between and is m (i.e., nodes and are immediate neighbors) and the geodesic distance between them is m (i.e., no obstacle in between).

Please see Supp. section for additional implementation details, including all simulation parameters.

4 Task Definitions: Audio-Visual Navigation

In this section, we propose two novel navigation tasks: AudioGoal Navigation and AudioPointGoal Navigation. In AudioGoal, the agent hears an audio source located at the goal—such as a phone ringing—but receives no direct position information about the goal. AudioPointGoal is an audio extension of the PointGoal task studied often in the literature [5, 61, 38, 107, 55, 15] where the agent hears the source and is told its displacement from the starting position.

Figure 3: Navigation network architecture

Task definitions.

For PointGoal, as described in [5, 61], a randomly initialized agent is tasked with navigating to a point goal defined by a static222Static means it is received once at the start of the episode; dynamic means the cue is received at every time step [61]. displacement vector relative to the starting position of the agent. To successfully navigate to the target and avoid obstacles, the agent needs to reach the target using sensory inputs alone, i.e., no map of the scene is provided to the agent. For AudioGoal, the target is instead defined by only a dynamic short audio clip; the agent does not receive a displacement vector pointing to the target. This audio signal is updated as a function of the location of the agent, the location of the goal, and the structure and materials of the room. In AudioPointGoal, the agent receives the union of information received in the PointGoal and AudioGoal tasks, i.e., dynamic audio as well as a static point vector. Note that physical obstacles (walls, furniture) typically exist along the displacement vector; the agent must sense them while navigating.

Agent and goal embodiment.

We build our platform on AI-Habitat, and hence share the same agent embodiment of a cylinder. A target has diameter m and height m, and just as in PointGoal, has no visual presence. While the goal itself does not have a visible embodiment, vision (particularly in understanding depth) is essential to detect and avoid obstacles and move towards the target. Hence, all the tasks have a crucial vision component.

Action space.

The action space is: MoveForward, TurnLeft, TurnRight, and Stop. The last three actions are always valid. The MoveForward action is invalid when the agent attempts to traverse from a node to another without an edge connecting them (as per the graph defined in sec:platform). If valid, the MoveForward action takes the agent forward by m. There is no noise in the actuation, i.e., a step executes perfectly or does not execute at all.


The sensory inputs are binaural sound (absent in PointGoal), GPS, RGB, and depth. To capture binaural spatial sound, the agent emulates two microphones placed at human head height. We assume an idealized GPS sensor, following prior work [61, 15, 38, 55]. However, as we will demonstrate in results, our audio-based learning provides a steady navigation signal that makes it feasible to disable the GPS sensor altogether for the proposed AudioGoal task.

Episode specification.

An episode of PointGoal is defined by an arbitrary 1) scene, 2) agent start location, 3) agent start rotation, and 4) goal location. In each episode the agent can reach the target if it navigates successfully. An episode for AudioGoal and AudioPointGoal additionally includes a source audio waveform. The waveform is convolved with the RIR corresponding to the specific scene, goal, agent location and orientation to generate dynamic audio for the agent. We consider a variety of audio sources, both familiar and unfamiliar to the agent (defined below). An episode is considered successful if an agent executes the Stop action while being exactly at the location of the goal. Agents are allowed a time horizon of actions for all tasks, similar to [61, 49, 15, 38, 55]

. Evaluation metrics and episode dataset preparation are detailed in sec:experiment.

5 Navigation Network and Training

To navigate autonomously, the agent must be able to enter a new yet-unmapped space, accumulate partial observations of the environment over time, and efficiently transport itself to a goal location. Consistent with recent embodied visual navigation work [117, 42, 41, 5, 66, 61], we take a deep reinforcement learning approach but introduce audio to the observation. During training, the agent is rewarded for correctly and efficiently navigating to the target. This yields a policy that maps new observations to agent actions.

Sensory inputs.

The audio input is a spectogram, following literature in audio learning [74, 115, 33]. Specifically, to represent the agent’s binaural audio inputs (corresponding to the left and right ear),

we first compute the Short-Time Fourier Transform (STFT) with the windowed signal length of 2048 samples, which corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 HZ. By using the first 1000 milliseconds of audio as input, STFT gives a

complex-valued matrix; we take its magnitude and downsample along the frequency axis by a factor of . For better contrast we take its logarithm. Finally, we stack the left and right audio channel matrices to obtain a tensor, denoted . The visual input is the RGB and/or depth image, and tensors, respectively, where 128 is the image resolution for the agent’s field of view. The relative displacement vector points from the agent to the goal in the 2D ground plane of the scene.

Which specific subset of these three inputs (audio, visual, vector) the agent receives depends on the agent’s sensors and the goal’s characterization (cf. Sec. 4). The sensory inputs are transformed to a distribution over the action space by the policy network, as we describe next.

Network architecture.

Next we define the parameterization of the agent’s policy , which selects action given the current observation and aggregated past states , and the value function , which scores how good the current state is. Here refers to all trainable weights of the network.

We extend the network architecture used in [61, 102, 23, 49, 24, 98] to also process audio input. As highlighted in fig:model, we transform and by corresponding CNNs and . The CNNs have separate weights but the same architecture of conv , conv , conv

and a linear layer, with ReLU activations between each layer. The outputs of the CNNs are vectors

and of length and , respectively. These are concatenated to the relative displacement vector and transformed by a gated recurrent unit (GRU) [19]. The GRU operates on the current step’s input as well as the accumulated history of states . The GRU updates the history to and outputs the representation of the agent’s state . Finally, the value of the state and the policy distribution

are estimated using the critic and actor heads of the model. Both are linear layers.


We train the full network using Proximal Policy Optimization (PPO) [83]. The agent is rewarded for reaching the goal quickly. Specifically, the agent receives a reward of for executing Stop at the goal location, a negative reward of per time step, a reward of for reducing the geodesic distance to the goal, and the equivalent penalty for increasing it. We include an entropy maximization term to the cumulative reward optimization, for better exploration of the action space [43, 83, 61].

Role of audio for navigation.

Because our agent can both hear and see, it has the potential to not only better localize the target (which emits sound), but also better plan its movements in the environment (whose major structures, walls, furniture, etc. all affect how the sound is perceived). See Figure 1. The optimal policy would trace a path corresponding to monotonically decreasing geodesic distance to the goal. Notably, the displacement does not specify the optimal policy: moving along decreases the geodesic distance but may decrease or increase the Euclidean distance to the goal at each time step. For example, if the goal is behind the sofa, the agent must move around the sofa to reach it. Importantly, the audio stream has complementary and potentially stronger information than is this regard. Not only does the intensity of the audio source suggest the Euclidean distance to the target, but also the geometry of the room captured in the acoustics reveals geodesic distances as well.

Implementation details.

The length of audio, visual, displacement vector, and final state, i.e., , , , and are , , , and , respectively. The GRU is single/bi directional with the input size of and hidden size of ; we use one recurrent layer. We optimize using Adam [54]

with PyTorch defaults for coefficients for momentum and a learning rate of

. We discount rewards with a decay of . We train the network for agent steps, which amounts to 315 GPU hours.

6 Experiments

Our main objectives are to show:

  1. [label=O.0,ref=O.0]

  2. Tackling navigation with both sight and sound (i.e., the proposed AudioGoal and AudioPointGoal tasks) leads to better navigation and faster learning. In particular, AudioPointGoal exceeds PointGoal, demonstrating that audio has complementary information beyond goal coordinates that facilitates better navigation.

  3. Listening for an audio target in a 3D environment serves as a viable alternative to GPS-based cues. Not only does the AudioGoal agent navigate better than the PointGoal agent, it does so without PointGoal’s assumption of perfect odometry.

  4. Audio-visual navigation can generalize to both new environments and new sound sources. In particular, audio-visual agents can navigate better with audio even when the sound sources are unfamiliar.

Episodes and dataset splits.

We divide the 18 newly audio-enabled scenes in the Replica dataset into train/validation/test splits

to obtain a fairly uniform distribution of each category (apartments, office, room, and hotel) per split. See Table 


Scene ID
Dataset Splits Apt. FRL_Apt. Office Room Hotel # Episodes
Train 0 0,1,2,3 0,4 0 0 107,362
Val 2 5 3 2 250
Test 1 4 1,2 1 500
Table 1: The scene IDs for each of the 5 types in Replica for each data split and the number of episodes (Eps.) per split. E.g. the only hotel scene, hotel_0, is included in the train split.

Each episode consists of the tuple: scene, agent start location, agent start rotation, goal location, audio waveform. Note that PointGoal ignores the audio. We generate episodes by choosing a scene and a random start and goal location. To eliminate easier episodes, we prune those that are either too short (geodesic distance less than 4) or can be completed by moving mostly in a straight line (ratio of geodesic to Euclidean distance less than ). For computational tractability, the number of episodes is on par with previous embodied agent research [23, 49, 61].

Sound sources.

Recall that the RIRs can be convolved with an arbitrary input waveform, which gives the option to vary the sounds across episodes. We use 12 copyright-free natural sounds comprised of a telephone, music, fan, and others.333 See the project webpage video for examples. Unless otherwise specified, the sound source is the telephone ringing. We stress that in all experiments, the environment (scene) at test time has never been seen previously in training. It is valid for sounds heard in training to also be heard at test time, e.g., a phone ringing in multiple environments will sound different depending on both the 3D space and the goal and agent positions. Experiments for 3 examine the impact of varied train/test sounds.


We use the success rate normalized by inverse path length (SPL), the standard evaluation metric for navigation [5]. For the success rate, we consider an episode successful only if the agent reaches the goal and executes the Stop action. To analyze agent behaviors we also examine the distance and relative angle to the goal upon stopping.

PointGoal AudioPointGoal
Blind 0.451 0.647
RGB 0.465 0.735
Depth 0.592 0.749
Table 2: Adding sound to sight and GPS sensing improves navigation performance significantly. Values are SPL; higher is better.
(a) PointGoal
(b) AudioGoal
(c) AudioPointGoal
(d) PointGoal
(e) AudioGoal
(f) AudioPointGoal
Figure 4: Navigation trajectories on top-down maps. Blue square and arrow denotes agent’s starting and ending positions, respectively. Yellow squares denote goal and agent turns yellow once it reaches goal. Agent path color fades from dark blue to light blue as time goes by. Pink path indicates the shortest geodesic path in continuous space. Top: the PointGoal agent bumps into the wall several times trying to move towards the target, unable to figure out the target is actually located in another room. In contrast, the AudioGoal and AudioPointGoal agents better sense the target: the sound travels through the door and the agent leaves the starting room immediately. Bottom: all three models successfully reach the goal, but the PointGoal agent first roams around before gradually moving closer to the target position. In comparison, the AudioGoal agent finds the goal more quickly and AudioPointGoal is even faster by taking the shortest path. Best viewed in color.

1: Does audio help navigation?

First we evaluate the impact of adding audio sensing to visual navigation by comparing PointGoal and AudioPointGoal agents. Table 2 compares the navigation performance (in SPL) for both agents on the test environments. We consider three visual sensing capabilities: no visual input (Blind), raw RGB images, or depth images. (RGB+D was no better than depth alone.)

Audio improves accuracy significantly, showing the clear value in multi-modal perception for navigation. We see that both agents do better with stronger visual inputs (depth being the strongest), though AudioPointGoal performs similarly with either RGB or depth, suggesting that audio-visual learning captures geometric structure from the raw images more easily than a model equipped with RGB vision alone. We also find that the audio-visual agents train more quickly than the visual PointGoal agents (see Supp.), indicating that the richer observations are successfully integrated in our model.

To see how audio influences navigation behavior, Figure 4 shows example trajectories. See the project webpage video for more.

2: Can audio supplant GPS for an audio target?

Next we explore the extent to which audio supplies the spatial cues available from perfect or noisy GPS sensing during (audio-)visual navigation. This test requires comparing PointGoal to AudioGoal. Recall that unlike (Audio)PointGoal, AudioGoal receives no displacement vector pointing to the goal; it can only hear and see.

Figure 5(a)

reports the navigation accuracy as a function of GPS sensing quality. The leftmost point corresponds to perfect GPS that tells the PointGoal agents (but not the AudioGoal agent) the exact direction of the goal; for subsequent points, Gaussian noise of increasing variance is added to the vectors up to

m. While AudioGoal’s accuracy is by definition independent of GPS failures, the others suffer noticeably. This is evidence that the audio signal gives similar or even better spatial cues than the PointGoal displacements—which are potentially overly optimistic given the unreliability of GPS in practice.

Figure 6 reinforces this finding: our learned audio features naturally encode the distance and angles to the goal.

Figure 5(b) shows the distances to the goal achieved by each agent for those episodes the agent failed to stop at the goal. Interestingly, the audio-enabled agents—particularly the AudioGoal agent—often still finish close to the goal. This is encouraging, and it also suggests that the agent relying on audio alone to sense the goal has trouble declaring Stop at exactly the right place, because the goal sounds similar once the agent is quite close to it. Having visible goals would address this issue, but is not yet supported in Habitat. In contrast, failure episodes for PointGoal often end much further from the goal.

(a) From perfect to noisy GPS
(b) Distribution of distances to goal
Figure 5: (a) Navigation accuracy as GPS becomes increasingly noisy. (b) Distribution of distances to the goal for failed episodes per agent. Please see text.
(a) Distance to goal
(b) Angle to goal
Figure 6: 2D t-SNE projection of audio features learned by the AudioGoal agent, color coded to reveal their correlation with the goal location. Audio helps our agent capture the (a) distance to the goal i.e., source is far (red) or near (violet), and (b) direction of incoming audio i.e., when the source is to the left of the agent (blue) or to the right (red). The visualization suggests the distance cue is more reliable than the orientation cue, as to be expected.

3: What is the effect of different sound sources?

Finally, we analyze the impact of the sound source—both its familiarity and its type. First, we explore generalization to novel sounds. We train the AudioGoal (AG) and AudioPointGoal (APG) agents using a mix of sounds (fan, canon, horn, engine, radio, telephone), then validate and test on two disjoint splits of sounds unheard in training (see Supp for details). In all cases, the test environments are unseen. Table 3 shows the results. As we move left to right in the table, the sound generalization task gets harder: from a single heard sound, to variable heard sounds, to variable unheard sounds. Our APG agents always outperform the PointGoal agent, even for unheard test sounds, strengthening the conclusions from Table 2. APG performs similarly on heard and unheard sounds, suggesting it has learned well to balance all three modalities.

On the other hand, AG’s accuracy declines somewhat with varied heard sounds, and declines substantially with varied unheard sounds. While it makes sense that the task of following an unfamiliar sound is harder, we also expect that larger training repositories of more sounds will resolve much of this decline.

Finally, we study how audio content affects the AudioGoal learning performance. We take the AudioGoal model above that is trained on varied sounds and test it using instances of any one of those sounds in new environments. Figure 7 shows the resulting SPL accuracies and the spectrograms per test sound. Notably, the sounds with higher entropy that activate a wider frequency band (like telephone and radio) provide a stronger learning signal than those without (like fan or canon). This is consistent with the properties known in the audio literature to best reveal perceptual differences [51].

Same sound Varied heard sounds Varied unheard sounds
Blind 0.451 0.574 0.647 0.499 0.711 0.215 0.659
RGB 0.465 0.598 0.735 0.589 0.600 0.196 0.546
Depth 0.592 0.742 0.749 0.620 0.747 0.244 0.737
Table 3:

Navigation performance (SPL) when generalizing to unheard sounds. Higher is better. Please see text. Results are averaged over 7 test runs, and all standard deviations are

Fan Canon Horn Engine Radio Telephone
0.531 0.562 0.615 0.654 0.658 0.678

Figure 7: Navigation accuracy (SPL) for different target sounds (top) and their source spectrograms (bottom). Sounds with wider frequency distributions provide richer audio input for navigation.

7 Conclusion

We introduced audio-visual navigation in complex 3D environments. Generalizing a state-of-the-art deep RL navigation engine for this task, we presented encouraging results for audio’s role in the embodied visual navigation task. Our work also enables audio rendering for the publicly available Replica environments, which can facilitate future work in the field. In future work it will be interesting to consider multi-agent scenarios, moving sound-emitting targets, and navigating in the context of dynamic audio events.

8 Acknowledgements

The authors are grateful to Alexander Schwing, Dhruv Batra, Erik Wijmans, Oleksandr Maksymets, Ruohan Gao, and Svetlana Lazebnik for valuable discussions and support with the AI-Habitat platform.


  • [1] G. K. Aguirre, J. A. Detre, D. C. Alsop, and M. D’Esposito. The parahippocampus subserves topographical learning in man. Cerebral cortex, 1996.
  • [2] X. Alameda-Pineda and R. Horaud. Vision-guided robot hearing. The International Journal of Robotics Research, 2015.
  • [3] X. Alameda-Pineda, J. Staiano, R. Subramanian, L. Batrinca, E. Ricci, B. Lepri, O. Lanz, and N. Sebe. Salsa: A novel dataset for multimodal group behavior analysis. TPAMI, 2015.
  • [4] P. Ammirato, P. Poirson, E. Park, J. Kosecka, and A. Berg. A dataset for developing and benchmarking active vision. In ICRA, 2016.
  • [5] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
  • [6] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
  • [7] R. Arandjelovic and A. Zisserman. Objects that sound. In ECCV, 2018.
  • [8] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese.

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding.

    ArXiv e-prints, Feb. 2017.
  • [9] Y. Ban, L. Girin, X. Alameda-Pineda, and R. Horaud. Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking. In ICCV Workshop on Computer Vision for Audio-Visual Media, 2017.
  • [10] Y. Ban, X. Li, X. Alameda-Pineda, L. Girin, and R. Horaud. Accounting for room acoustics in audio-visual multi-speaker tracking. In ICASSP, 2018.
  • [11] S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. Courville. HoME: a Household Multimodal Environment. In, 2017.
  • [12] C. Cao, Z. Ren, C. Schissler, D. Manocha, and K. Zhou. Interactive sound propagation with bidirectional path tracing. ACM Transactions on Graphics (TOG), 35(6):180, 2016.
  • [13] A. Chang, A. Dai, T. Funkhouser, , M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2017.
  • [14] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, 2017.
  • [15] D. S. Chaplot, S. Gupta, A. Gupta, and R. Salakhutdinov. Learning to explore using active neural mapping, 2020.
  • [16] D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. R. Salakhutdinov. Gated-attention architectures for task-oriented language grounding. In AAAI, 2018.
  • [17] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In CVPR, 2019.
  • [18] L. Chen, S. Srivastava, Z. Duan, and C. Xu. Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM, 2017.
  • [19] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In NeurIPS, 2015.
  • [20] E. C. Connors, L. A. Yazzolino, J. Sánchez, and L. B. Merabet. Development of an audio-based virtual gaming environment to assist with navigation skills in the blind. JoVE (Journal of Visualized Experiments), 2013.
  • [21] C. Cruz-Neira, D. J. Sandin, T. A. DeFanti, R. V. Kenyon, and J. C. Hart. The cave: audio visual experience automatic virtual environment. Communications of the ACM, 1992.
  • [22] J. Daniel. Spatial sound encoding including near field effect: Introducing distance coding filters and a viable, new ambisonic format. In Audio Engineering Society Conference: 23rd International Conference: Signal Processing in Audio Recording and Reproduction. Audio Engineering Society, 2003.
  • [23] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied Question Answering. In CVPR, 2018.
  • [24] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau. Tarmac: Targeted multi-agent communication. In ICML, 2019.
  • [25] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Neural Modular Control for Embodied Question Answering. In ECCV, 2018.
  • [26] M. D. Egan, J. Quirt, and M. Rousseau. Architectural acoustics, 1989.
  • [27] A. D. Ekstrom. Why vision is important to how we navigate. Hippocampus, 2015.
  • [28] A. D. Ekstrom, M. J. Kahana, J. B. Caplan, T. A. Fields, E. A. Isham, E. L. Newman, and I. Fried. Cellular networks underlying human spatial navigation. Nature, 2003.
  • [29] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In SIGGRAPH, 2018.
  • [30] C. Evers and P. Naylor. Acoustic slam. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.
  • [31] M. Fortin, P. Voss, C. Lord, M. Lassonde, J. Pruessner, D. Saint-Amour, C. Rainville, and F. Lepore. Wayfinding in the blind: larger hippocampal volume and supranormal spatial navigation. Brain, 2008.
  • [32] R. Gao, R. Feris, and K. Grauman. Learning to separate object sounds by watching unlabeled video. In ECCV, 2018.
  • [33] R. Gao and K. Grauman. 2.5D visual sound. In CVPR, 2019.
  • [34] R. Gao and K. Grauman. Co-separating sounds of visual objects. In ICCV, 2019.
  • [35] I. D. Gebru, S. Ba, G. Evangelidis, and R. Horaud. Tracking the active speaker based on a joint audio-visual observation model. In ICCV Workshops, pages 15–21, 2015.
  • [36] I. Georgiev. Implementing vertex connection and merging. Technical Re-port. Saarland University. Accessed May, 22:2018, 2012.
  • [37] A. R. Golding and N. Lesh. Indoor navigation using a diverse set of cheap, wearable sensors. In Digest of Papers. Third International Symposium on Wearable Computers. IEEE, 1999.
  • [38] D. Gordon, A. Kadian, D. Parikh, J. Hoffman, and D. Batra. Splitnet: Sim2sim and task2task transfer for embodied visual navigation. ICCV, 2019.
  • [39] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. IQA: Visual Question Answering in Interactive Environments. In CVPR, 2018.
  • [40] F. Gougoux, R. J. Zatorre, M. Lassonde, P. Voss, and F. Lepore. A functional neuroimaging study of sound localization: visual cortex activity predicts performance in early-blind individuals. PLoS biology, 2005.
  • [41] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual navigation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2616–2625, 2017.
  • [42] S. Gupta, D. Fouhey, S. Levine, and J. Malik. Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125, 2017.
  • [43] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
  • [44] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge University Press, 2004.
  • [45] T. Hartley, E. A. Maguire, H. J. Spiers, and N. Burgess. The well-worn route and the path less traveled: distinct neural bases of route following and wayfinding in humans. Neuron, 2003.
  • [46] J. F. Henriques and A. Vedaldi. Mapnet: An allocentric spatial memory for mapping environments. In CVPR, 2018.
  • [47] J. R. Hershey and J. R. Movellan. Audio vision: Using audio-visual synchrony to locate sounds. In NeurIPS, 2000.
  • [48] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 2019.
  • [49] U. Jain, L. Weihs, E. Kolve, M. Rastegari, S. Lazebnik, A. Farhadi, A. G. Schwing, and A. Kembhavi. Two body problem: Collaborative visual task completion. In CVPR, 2019. equal contribution.
  • [50] D. Jayaraman and K. Grauman. End-to-end policy learning for active visual categorization. TPAMI, 2018.
  • [51] H. Jeong and Y. Lam. Source implementation to eliminate low-frequency artifacts in finite difference time domain room acoustic simulation. Journal of the Acoustical Society of America, 131(1):258–268, 2012.
  • [52] M. Johnson, K. Hofmann, T. Hutton, and D. Bignell.

    The malmo platform for artificial intelligence experimentation.

    In Intl. Joint Conference on AI, 2016.
  • [53] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jakowski. Vizdoom: A doom-based ai research platform for visual reinforce- ment learning. In Proc. IEEE Conf. on Computational Intelligence and Games, 2016.
  • [54] D. Kingma and J. Ba. A method for stochastic optimization. In CVPR, 2017.
  • [55] N. Kojima and J. Deng. To learn or not to learn: Analyzing the role of learning for navigation in virtual environments. arXiv preprint arXiv:1907.11770, 2019.
  • [56] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
  • [57] H. Kuttruff. Room acoustics. CRC Press, 2016.
  • [58] K. H. Kuttruff. Auralization of impulse responses modeled on the basis of ray-tracing results. Journal of the Audio Engineering Society, 41(11):876–880, 1993.
  • [59] A. Lerer, S. Gross, and R. Fergus. Learning physical intuition of block towers by example. In ICML, 2016.
  • [60] N. Lessard, M. Paré, F. Lepore, and M. Lassonde. Early-blind human subjects localize sound sources better than sighted subjects. Nature, 1998.
  • [61] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra. Habitat: A Platform for Embodied AI Research. In ICCV, 2019.
  • [62] D. Massiceti, S. L. Hicks, and J. J. van Rheede. Stereosonic vision: Exploring visual-to-auditory sensory substitution mappings in an immersive virtual reality navigation paradigm. PloS one, 2018.
  • [63] L. Merabet and J. Sanchez. Audio-based navigation using virtual environments: combining technology and neuroscience. AER Journal: Research and Practice in Visual Impairment and Blindness, 2009.
  • [64] L. B. Merabet and A. Pascual-Leone. Neural reorganization following sensory loss: the opportunity of change. Nature Reviews Neuroscience, 2010.
  • [65] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. Learning to navigate in complex environments. In ICLR, 2017.
  • [66] D. Mishkin, A. Dosovitskiy, and V. Koltun. Benchmarking classic and learned navigation in complex 3d environments. arXiv preprint arXiv:1901.10915, 2019.
  • [67] P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang. Self-supervised generation of spatial audio for 360 video. In NeurIPS, 2018.
  • [68] A. Murali, T. Chen, K. V. Alwala, D. Gandhi, L. Pinto, S. Gupta, and A. Gupta. Pyrobot: An open-source robotics framework for research and benchmarking. arXiv preprint arXiv:1906.08236, 2019.
  • [69] K. Nakadai, T. Lourens, H. G. Okuno, and H. Kitano. Active audition for humanoid. In AAAI, 2000.
  • [70] K. Nakadai and K. Nakamura. Sound source localization and separation. Wiley Encyclopedia of Electrical and Electronics Engineering, 1999.
  • [71] K. Nakadai, H. G. Okuno, and H. Kitano. Epipolar geometry based sound localization and extraction for humanoid audition. In IROS Workshops, 2001.
  • [72] A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
  • [73] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicated sounds. In CVPR, 2016.
  • [74] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016.
  • [75] L. Picinali, A. Afonso, M. Denis, and B. Katz. Exploration of architectural spaces by blind people using auditory virtual reality for the construction of spatial knowledge. International Journal of Human-Computer Studies, 72(4):393–407, 2014.
  • [76] J. Qin, J. Cheng, X. Wu, and Y. Xu. A learning based approach to audio surveillance in household environment. International Journal of Information Acquisition, 2006.
  • [77] B. Rafaely. Fundamentals of spherical array processing, volume 8. Springer, 2015.
  • [78] C. Rascon and I. Meza. Localization of sound sources in robotics: A review. Robotics and Autonomous Systems, 2017.
  • [79] B. RoÈder, W. Teder-SaÈlejaÈrvi, A. Sterr, F. RoÈsler, S. A. Hillyard, and H. J. Neville. Improved auditory spatial tuning in blind humans. Nature, 1999.
  • [80] J. M. Romano, J. P. Brindza, and K. J. Kuchenbecker. Ros open-source audio recognizer: Roar environmental sound detection tools for robot programming. Autonomous robots, 2013.
  • [81] N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. In ICLR, 2018.
  • [82] L. Savioja and U. P. Svensson. Overview of geometrical room acoustic modeling techniques. The Journal of the Acoustical Society of America, 138(2):708–730, 2015.
  • [83] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [84] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. So Kweon. Learning to localize sound source in visual scenes. In CVPR, 2018.
  • [85] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  • [86] A. Suhr, C. Yan, J. Schluger, S. Yu, H. Khader, M. Mouallem, I. Zhang, and Y. Artzi. Executing instructions in situated collaborative interactions. In EMNLP, 2019.
  • [87] S. Sukhbaatar, A. Szlam, G. Synnaeve, S. Chintala, and R. Fergus. Mazebase: A sandbox for learning from games. In CoRR, abs/1511.07401, 2015.
  • [88] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. Teller, and N. Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, 2011.
  • [89] C. Thinus-Blanc and F. Gaunet. Representation of space in blind persons: vision as a spatial sense? Psychological bulletin, 1997.
  • [90] J. Thomason, D. Gordon, and Y. Bisk. Shifting the baseline: Single modality performance on visual navigation & qa. In NAACL-HLT, 2019.
  • [91] S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics. MIT Press, 2005.
  • [92] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu. Audio-visual event localization in unconstrained videos. In ECCV, 2018.
  • [93] E. C. Tolman. Cognitive maps in rats and men. Psychological review, 1948.
  • [94] E. Veach and L. Guibas. Bidirectional estimators for light transport. In Photorealistic Rendering Techniques, pages 145–167. Springer, 1995.
  • [95] R. Viciana-Abad, R. Marfil, J. Perez-Lorenzo, J. Bandera, A. Romero-Garces, and P. Reche-Lopez. Audio-visual perception system for a humanoid robotic head. Sensors, 2014.
  • [96] P. Voss, M. Lassonde, F. Gougoux, M. Fortin, J.-P. Guillemot, and F. Lepore. Early-and late-onset blind individuals show supra-normal auditory abilities in far-space. Current Biology, 2004.
  • [97] Y. Wang, M. Kapadia, P. Huang, L. Kavan, and N. Badler. Sound localization and multi-modal steering for autonomous virtual agents. In Symposium on Interactive 3D Graphics and Games, 2014.
  • [98] L. Weihs, A. Kembhavi, W. Han, A. Herrasti, E. Kolve, D. Schwenk, R. Mottaghi, and A. Farhadi. Artificial agents learn flexible visual representations by playing a hiding game, 2019.
  • [99] E. Wijmans, S. Datta, O. Maksymets, A. Das, G. Gkioxari, S. Lee, I. Essa, D. Parikh, and D. Batra. Embodied Question Answering in Photorealistic Environments with Point Cloud Perception. In CVPR, 2019.
  • [100] J. Wilson, B. N. Walker, J. Lindsay, C. Cambias, and F. Dellaert. Swan: System for wearable audio navigation. In 2007 11th IEEE international symposium on wearable computers. IEEE, 2007.
  • [101] J. Wood, M. Magennis, E. F. C. Arias, T. Gutierrez, H. Graupp, and M. Bergamasco. The design and evaluation of a computer game for the blind in the grab haptic audio virtual environment. Proceedings of Eurohpatics, 2003.
  • [102] M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In CVPR, 2019.
  • [103] A. Woubie, A. Kanervisto, J. Karttunen, and V. Hautamaki. Do autonomous agents benefit from hearing? arXiv preprint arXiv:1905.04192, 2019.
  • [104] X. Wu, H. Gong, P. Chen, Z. Zhong, and Y. Xu. Surveillance robot utilizing video and audio information. Journal of Intelligent and Robotic Systems, 2009.
  • [105] Y. Wu, Y. Wu, A. Tamar, S. Russell, G. Gkioxari, and Y. Tian. Bayesian relational memory for semantic visual navigation. ICCV, 2019.
  • [106] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner. Torcs, the open racing car simulator, 2013.
  • [107] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese. Interactive gibson: A benchmark for interactive navigation in cluttered environments. arXiv preprint arXiv:1910.14442, 2019.
  • [108] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018.
  • [109] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Visual curiosity: Learning to ask questions to learn visual recognition. CoRL, 2018.
  • [110] J. Yang, Z. Ren, M. Xu, X. Chen, D. Crandall, D. Parikh, and D. Batra. Embodied amodal recognition: Learning to move to perceive objects. ICCV, 2019.
  • [111] T. Yoshida, K. Nakadai, and H. G. Okuno. Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In 2009 9th IEEE-RAS International Conference on Humanoid Robots, pages 604–609. IEEE, 2009.
  • [112] A. T. Yusuf Aytar, Carl Vondrick. Learning sound representations from unlabeled video. In NeurIPS, 2016.
  • [113] A. T. Yusuf Aytar, Carl Vondrick. See, hear, and read: Deep aligned representations. In arXiv:1706.00932, 2017.
  • [114] M. Zaunschirm, C. Schörkhuber, and R. Höldrich. Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. The Journal of the Acoustical Society of America, 143(6):3616–3627, 2018.
  • [115] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In ECCV, 2018.
  • [116] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Visual to sound: Generating natural sound for videos in the wild. In CVPR, 2018.
  • [117] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi. Visual Semantic Planning using Deep Successor Representations. In ICCV, 2017.
  • [118] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi. Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning. In ICRA, 2017.
  • [119] F. Zotter and M. Frank. Ambisonics: A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement and Virtual Reality, volume 19. Springer, 2019.

9 Supplementary Material

In this supplementary material we provide additional details in the form of:

  1. Video (with audio) for qualitative assessment of our audio simulations and agent performance. In particular, we demonstrate intuitive variations of intensity and directional information as an agent moves in a scene and trajectory roll-outs of audio-visual navigation agents. Please listen with headphones to hear the binaural sound properly. The minute video is available at

  2. Details on methodology to prune grid points in reference to ‘grid construction’ (sec:platform).

  3. RL notation and training utilized in the network description (sec:approach)

  4. Training curves demonstrating quicker training of AudioPointGoal over PointGoal and AudioGoal (as referenced in sec:experiment).

  5. Additional illustrations of pressure fields from the audio simulation (in addition to the room shown in fig:concept) and the sampled grid (in addition to the room shown in fig:grid).

  6. Heard/Unheard sounds referenced in sec:experiment, tab:pointgoal_vs_audiopointgoal, and table:test.

Figure 8: Pressure field of audio simulation overlaid on the top-down map of apartment 2 from Replica [85]. Our audio-enabled agent gets rich directional information about the goal, since the pressure field variation is correlated with the shortest distance. Notice the discontinuities across walls and the gradient of the field along the geodesic path an agent must use to reach the goal (different from shortest Euclidean path). As a result, to an agent standing in the top right or bottom rooms, the audio reveals the door as a good intermediate goal. In other words, the audio stream signals to the agent that it must leave the current room to get to the target. In contrast, the GPS displacement vector would point through the wall and to the goal, which is a path the agent would discover it cannot traverse.
Figure 9: 3D view of pressure field and grid for FRL apartment 0 in Replica.

9.1 Video for qualitative assessment

The supplementary video introduces the proposed audio simulation platform, shows some examples of audio-based navigation as well as the qualitative results of our navigation models.

9.2 Pruning grid points

As explained in ‘grid construction’ (sec:platform), we use an automatic point placement algorithm to determine the locations where the simulated sound sources and listeners are placed in a two-step procedure: adding points on a regular grid and then pruning. For adding points on a regular grid first we compute an axis-aligned 3D bounding box of a scene. Within this box we sample points from a regular 2D square grid with resolution 0.5m that slices the bounding box in the horizontal plane at a distance of 1.5m from the floor (representing the height of a humanoid robot). We now provide details of the second step i.e. pruning grid points in inaccessible locations.

To prune, we compute how closed the region surrounding a particular point is. This entails tracing uniformly-distributed random rays in all directions from the point, then letting them diffusely reflect through the scene up to bounces using a path tracing algorithm. Simultaneously, we compute the total number of “hits” : the number of rays that intersect the scene. After all rays are traced, the closed-ness of a point is given by . A point is declared outside the scene if . the value of for a particular point is below a threshold . Finally, we remove points that are within a certain distance from the nearest geometry, as identified using the shortest length of the initial rays traced from the point in the previous pruning step.

For all scenes we use , and cm. This value of was chosen to avoid placement of points inside walls or in small inaccessible areas. We find works for most scenes. The exceptions are scenes with open patio areas, where we found works best to provide a sufficient number of points on the patio.

9.3 RL notation and training details

In the following we provide a brief background on reinforcement learning (RL), particularly the objective optimized by PPO. This notation links to sec:approach and fig:model in the main paper.

An agent embedded in an environment must take actions from an action space to accomplish an end goal. For our tasks, the actions are navigation motions: . At every time step the environment is in some state , but the agent obtains only a partial observation of it in the form of . Here is a maximal time horizon, which corresponds to 500 actions for our task. The observation is the combination of the audio, visual, and displacement vector inputs.

Using information about the previous time steps and current observation , the agent develops a policy , where is the probability that the agent chooses to take action at time . We use the shorthand of to show the feed-forward nature of the actor head. After the agent acts, the environment goes into a new state and the agent receives individual rewards .

The agent optimizes its return, i.e. the expected discounted, cumulative rewards


where is the discount factor to modulate the emphasis on recent or long term rewards. The value function is the expected return. Note the GRU in our model is rolled out for 150 steps and the hidden state is repackaged after these steps, to accommodate gradients in GPU memory. The particular reinforcement learning objective we optimize directly follows from Proximal Policy Optimization. We refer the readers to [83] for additional details on optimization.

9.4 Quicker training of AudioPointGoal

Figure 10: Training curves for different tasks for the natural sound of ‘telephone’. PointAudioGoal models learn faster than PointGoal and AudioGoal. In addition, AudioGoal learns faster than PointGoal. Audio helps the agent discover good polices more quickly.

In fig:train_curves_telephone we plot the navigation performance (SPL) as agents train for different tasks for the telephone sound (other sounds exhibit similar performance). We observe that the AudioPointGoal is more sample efficient and hence trains more quickly than its only-audio and only-point counterparts. This is seen in both with and without visual sensing, demonstrated by training agents with blind and depth visual sensing capabilities. In addition, the AudioGoal agent trains more quickly than the PointGoal agent. Audio helps the agent discover good policies more quickly.

9.5 Visualizing audio simulations

Analogous to fig:concept, we illustrate the pressure field visualization of two other scenes in the Replica dataset. In fig:pressure_apt2, we display another big apartment (apartment_1) with four rooms, with the audio source inside one of the rooms. Notice how the pressure decreases from the source along geodesic paths, which leads to doors serving as secondary sources or intermediate goals that lead the agent in the right direction. fig:pressure_frlapt0 illustrates frl_apartment_0’s pressure field from a 3D perspective view. Notice the sampled grid where room impulse responses are stored is in a 2D plane at a human height of approx 1.5m.

Figure 11: Visualizing ambisonics. We visualize the ambisonics components (blue lobes) of the impulse response. Notice that the ambisonics sound fields characterize direction and intensity of the incoming energy.

In fig:audio_lobes, we include a second order ambisonics representation showing the direction and intensity of the incoming direct sound to demonstrate the spatial properties of the audio simulation at two receiver locations. Recall that we render impulse responses for source and receiver positions sampled from a grid in each scene. These impulse responses are stored in ambisonics and converted to binaural to mimic the signals received by a human at the entrance of the ear canal. We create fig:audio_lobes by evaluating the incoming energy of the direct sound (excluding reflections and reverberation) at the horizontal plane.444The minor side lobes pointing in directions other than the source are a result of representing the sound field as a order ambisonics signal, thus using only 9 spherical harmonics. We refer the reader to [22, 77, 119] for more details on ambisonics sound field representation. The greater the energy the bigger the size, and the orientation depicts the angular distribution of energy. In Location 1 energy comes predominantly from its right. Since it is closer to the audio source, the directional sound field has more energy than Location 2.

9.6 Heard/unheard dataset splits

In the following we provide details about the sounds used in sec:experiment. We utilize 12 copy-free natural sounds across six categories: telephone, canon (Canon in D Major) 555, fan, engine, horn, and static radio. Engine, static radio, and canon have three examples each, which differ in content but have similar audio characteristics (similar audio spectrograms). Consequently, we refer to these 12 sounds as: {telephone, fan, horn, engine_1, radio_1, canon_1, engine_2, radio_2, canon_2, engine_3, radio_3, canon_3 }.

For tab:pointgoal_vs_audiopointgoal and same sound experiment in table:test, we use the sound source of ‘telephone’. In table:test, for the varied heard sounds experiment we train using the set:  {telephone, fan, horn, engine_1, radio_1, canon_1 } and test on unseen scenes with the same sounds. Recall that the audio observations vary not only according to the audio file but also the 3D environment. For the varied unheard sounds experiment, we use for training scenes, and generalize to unseen scenes as well as unheard sounds. Particularly, we utilize  {engine_2, radio_2, canon_2 } for validation scenes, and  {engine_3, radio_3, canon_3 } for test scenes.