Few-Shot Audio-Visual Learning of Environment Acoustics

by   Sagnik Majumder, et al.

Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and–in a major departure from traditional methods–generalizing to novel environments in a few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir.


page 2

page 4

page 9

page 15


Visual Acoustic Matching

We introduce the visual acoustic matching task, in which an audio clip i...

Learning Audio-Visual Dereverberation

Reverberation from audio reflecting off surfaces and objects in the envi...

Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds

Audio-visual navigation combines sight and hearing to navigate to a soun...

Active Audio-Visual Separation of Dynamic Sound Sources

We explore active audio-visual separation for dynamic sound sources, whe...

SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based a...

Egocentric scene context for human-centric environment understanding from video

First-person video highlights a camera-wearer's activities in the contex...

Improved Zero-Shot Audio Tagging Classification with Patchout Spectrogram Transformers

Standard machine learning models for tagging and classifying acoustic si...

1 Introduction

Sound is central to our perceptual experience—people talk, doorbells ring, music plays, knives chop, dishwashers hum. A sound carries information not only about the semantics of its source, but also the physical space around it. For instance, compare listening to your favorite song in a big auditorium to hearing the same song in your cozy bedroom: the auditory experience changes drastically due to differences in the environment. In particular, on its way to our ears, sound undergoes various acoustic phenomena: direct sound, early reflections, and late reverberations. Consequently, we hear spatial sound shaped by the environment’s geometry, the materials of constituent surfaces and objects, and the relative locations of the sound source and the listener. These factors together comprise the room impulse response (RIR)—the transfer function that maps an original sound to the sound that meets our ears or microphones (62e5b0bcac3243f0b77291509411ed52).

Learning to model RIRs would have far-reaching implications for augmented reality (AR), virtual reality (VR), and robotics. In AR/VR, a truly immersive experience demands that the user hear sounds that are acoustically matched with the surrounding augmented/virtual space (app11031150). In mobile robotics, an agent cognizant of environment acoustics could better solve important embodied tasks, like localizing sounds or separating out a target sound(s) of interest. In any such application, one must be able to anticipate the environment effects for arbitrarily positioned sources observed from arbitrary receiver poses.

Figure 1: Given few-shot audio-visual observations from a 3D scene (blue boxes), we aim to learn an acoustic model for the entire environment such that we can generate a Room Impulse Response (RIR) for any arbitrary query of source (S) and receiver (R) locations in the scene—without images/echoes at those locations.

Traditional approaches to model room acoustics require extensive access to the physical environment. They either assume a full 3D mesh of the space is available in order to simulate sound propagation patterns (chen2020soundspaces; 4117929), or else require densely sampling sounds at many source-microphone position pairs all about the environment in order to measure the RIRs Holters2009IMPULSERM; stan2002comparison—both of which are expensive if not impractical. Recent work attempts to lighten these requirements by predicting (sometimes implicitly) the RIR from an image (singh2021image2reverb; kon2019estimation; gao20192; DBLP:journals/corr/abs-2007-09902; Xu_2021_CVPR; Rachavarapu_2021_ICCV; chen2022visual), but their output is specific to a single receiver position for which the photo exists, prohibiting generalization to other positions in the space.

Mindful of these limitations, we propose to infer RIRs in novel environments using only few-shot audio-visual observations. Motivated by how humans anticipate the overall structure of a 3D space by looking at a few parts of it, we hypothesize that imagery and echoes captured from a few different locations in a 3D scene can suggest its overall geometry and material composition, which in turn can facilitate interpolation of an RIR to arbitrary (unobserved) locations. See Figure 


To realize this idea, we propose a transformer-based model called Few-ShotRIR along with a novel training objective that facilitates high-quality prediction of RIRs by matching the energy decay between the predicted and the ground truth RIRs. Few-ShotRIR directly attends to the egocentric audio-visual observations to build an acoustic context of the environment. During training, our model learns the association between what is seen and heard in a variety of environments. Then, given a novel environment (e.g., a previously unseen multi-room home), the input is a sparse (few-shot) set of images together with the true RIRs at those image positions. In particular, each true RIR corresponds to positioning both the source and receiver where the image is captured, as obtained by emitting a short frequency sweep and recording the echoes. The output is an environment-specific function that can predict the RIR for arbitrary new source/receiver poses in that space—importantly, without traveling there or sampling any further images or echoes.

Our design has three key advantages: 1) the few-shot sparsity of the observations (on the order of tens, compared to the thousands that would be needed for dense visual or geometric coverage) means that RIR inference in a new space has low overhead; 2) the use of egocentric echoes means that the preliminary observations are simple to obtain, as opposed to repeatedly moving both a sound source (e.g., speaker) and the microphone independently to different relative positions; and 3) our novel differentiable training loss encourages predictions that capture the room acoustics, distinct from existing models that rely on non-differentiable RT60 losses singh2021image2reverb; ratnarajah2022fast.

We evaluate Few-ShotRIR with realistic audio-visual simulations from SoundSpaces (chen2020soundspaces) comprising 83 real-world Matterport3D (chang2017matterport3d) environment scans. Our model successfully learns environment acoustics, outperforming the state-of-the-art models in addition to several baselines. We also demonstrate the impact on two downstream tasks that rely on the spatialization accuracy of RIRs: sound source localization and depth prediction. Our margin of improvement over a state-of-the-art model is as high as 23% on RIR prediction and 67% on downstream evaluation.

2 Related Work

Audio Spatialization and Impulse Response Generation.

Convolving an RIR with a waveform yields the sound of that source in the context of the surrounding physical space and the receiver location (10.1109/TASL.2013.2256897; 10.1145/2980179.2982431; Funkhouser03abeam; Savioja2015OverviewOG). Since traditional methods for measuring RIRs (Holters2009IMPULSERM; stan2002comparison) or simulating them with sound propagation models (imageMethod79; chen2020soundspaces; 4117929) are computationally prohibitive, recent approaches indirectly generate RIRs by first estimating acoustic parameters (ratnarajah2022fast; 7486010; 8521241; Klein_Neidhardt_Seipel_2019; DBLP:conf/interspeech/MackDH20; 8506462)—such as the reverberation time (RT60), the time an RIR takes to decay by 60 dB, and the direct-to-reverberant ratio (DRR), the energy ratio of direct and reflected sound—or matching the distributions of such acoustic parameters in real-world RIRs (ratnarajah2020ir). Whereas Fast-RIR (ratnarajah2022fast) assumes that the environment size and reverberation characteristics are given, our model relies on learning directly from low-level multi-modal sensory information, thus generalizing to novel environments.

Alternately, some methods use images to predict RIRs of the target environment (singh2021image2reverb; kon2019estimation) by implicitly inferring scene geometry (remaggi2019reproducing) and acoustic parameters (kon2019estimation), or directly synthesizing spatial audio (gao20192; DBLP:journals/corr/abs-2007-09902; Xu_2021_CVPR; Rachavarapu_2021_ICCV; chen2022visual). Although such image-based methods have the flexibility to extract diverse acoustic cues, their predictions are agnostic to the exact source and receiver locations, making them unsuitable for tasks where this mapping is important (e.g., sound source localization, audio goal navigation, or fine-grained acoustic matching).

Audio field coding approaches (10.1145/2601097.2601184; 10.1109/TVCG.2014.38; 10.1145/3197517.3201339; 10.1145/3386569.3392459) are able to model exact source and receiver locations, but to improve efficiency they rely on handcrafting features rather than learning them, which adversely impacts generation fidelity luo2022learning. The recently proposed Neural Acoustic Fields (NAF) (luo2022learning) tackles this by learning an implicit representation (mildenhall2020nerf; sitzmann2020implicit) of the RIRs and additionally conditioning on geometrically-grounded learned embeddings. While NAF can generalize to unseen source-receiver pairs from the same environment, it requires training one model per environment. Consequently, NAF is unable to generalize to novel environments, and both its training time and model storage cost scale with the number of environments. On the contrary, given a few egocentric audio-visual observations from a novel environment, our model learns an implicit acoustic representation of the scene and predicts high-quality RIRs for arbitrary source-receiver pairs.

Audio-Visual Learning.

Advances in audio-visual learning have influenced many tasks, like audio-visual source separation and speech enhancement (afouras2018conversation; afouras2020self; chen2021learning; ephrat2018looking; hou2018audio; michelsanti2021overview; owens2018audio; sadeghi2020audio; zhao2019sound; zhou2019vision; majumder2021move2hear; majumder2022active), object/speaker localization (hu2020discriminative; jiang2022egocentric), and audio-visual navigation (chen2020soundspaces; DBLP:journals/corr/abs-2008-09622; DBLP:journals/corr/abs-2012-11583; gan2020look; NEURIPS2020_ab6b331e; NEURIPS2020_ab6b331e). Using echo responses along with vision to learn a better spatial representation (gao2020visualechoes), infer depth christensen2020batvision, or predict the floorplan (purushwalkam2021audio) of a 3D environment has also been explored. In contrast, our model leverages the synergy of egocentric vision and echo responses to infer environment acoustics for predicting RIRs. Results show that both the visual and audio modalities play a vital role in our model training.

3 Few-Shot Learning of Environment Acoustics

We propose a novel task: few-shot audio-visual learning of environment acoustics. The objective is to predict RIRs on the basis of egocentric audio-visual observations captured in a 3D environment. In particular, for a few randomly drawn locations in the 3D scene, we are given egocentric RGB, depth images, and the echoes heard at those positions (from which the corresponding RIR can be computed, detailed below). Using those samples, we model the scene’s acoustic space, in order to predict RIRs for arbitrary pairings of sound source location and receiver pose (i.e., microphone location and orientation).

Task Definition.

Specifically, let be a set of observations ( in our experiments) randomly sampled from a 3D environment, such that where is the egocentric RGB-D view from a field of view (FoV), is the RIR of the binaural echo response from a pose at location and orientation . Given a query for an arbitrary source and receiver pair, , where is the omnidirectional sound source location and is the receiver microphone pose, which includes both its location and orientation, the goal is to predict the binaural RIR for the query . Thus, our goal is to learn a function to predict the RIR for an arbitrary query given the egocentric audio-visual context , such that ). Note that the query contains neither images nor echoes.

This task requires learning from both visual and audio cues. While the visual signal conveys information about the local scene geometry and the material composition of visible surfaces and objects, the audio signal, in the form of echo responses, is more long-range in nature and additionally carries cues about acoustic properties, the global geometric structure, and material distribution in the environment—beyond what’s visible in the image. Our hypothesis is that sampling and aggregating these two complementary signals from a sparse set of locations in a 3D scene can facilitate inference of the full acoustic manifold and, consequently, enable high-quality prediction of arbitrary RIRs.

4 Approach

Figure 2: Our model predicts room impulse responses (RIR) for arbitrary source-receiver pairs in a 3D environment, including novel scenes, by building its implicit acoustic representation in the few-shot context of egocentric audio-visual observations. We train our model with a novel energy decay matching loss that helps capture desirable acoustic properties in its predictions.

We introduce a novel approach called Few-ShotRIR for arbitrary RIR prediction in a 3D environment based on few-shot context of egocentric audio-visual observations. Our model has two main components (see Fig 2): 1) an audio-visual (AV) context encoder, and 2) a conditional RIR predictor. The AV context encoder builds an implicit model of the environment’s acoustic properties by extracting multimodal cues from the input AV context (Sec.4.1). The RIR predictor uses this implicit representation of the scene and, conditioned on a query for an arbitrary source and receiver pair, it predicts the respective RIR (Sec.4.2).

Our model is trained end-to-end to reduce the error in the predicted RIR compared to the ground-truth using a novel training objective (Sec.4.3). Our objective not only encourages our model predictions to match the target RIRs at the spectrogram level, but also ensures that the predictions and the targets are similar with respect to important high-level acoustic parameters, thereby improving prediction quality. Next, we describe these two model components and the proposed training objective in detail.

4.1 Audio-Visual Context Encoder

Our AV context encoder (Fig. 2a) extracts features from the observations . This context is sampled from the unmapped environment via an agent (a person or robot) traversing the scene and taking a small set of AV snapshots at random locations. Our model starts by embedding each observation using visual, acoustic, and pose networks. This is followed by a multi-layer transformer encoder (vaswani2017attention) to learn an implicit representation of the scene’s acoustic properties.


We encode the visual component by first normalizing its RGB and depth images such that all image pixels lie in the range . We then concatenate the images along the channel dimension and encode them with a network (a ResNet-18 (he2016deep)) into visual features .


To measure the RIRs for the echo inputs, we use the standard sine-sweeping technique (farina2000simultaneous)

. We first generate a “chirp” in the form of a sinusoidal sweep signal from 20Hz-20kHz (the human audible range) at the sound source and then retrieve the RIR at the receiver by convolving the spatial sound with the inverse of the sweep signal. We then use the short-time Fourier transform (STFT) to represent all RIRs as magnitude spectrograms 

(singh2021image2reverb; luo2022learning) of size , where is the number of frequency bins, is the number of overlapping time windows, and each spectrogram has 2 channels. Finally, having converted the observed binaural echoes to an RIR , we compute its log magnitude spectrogram and encode it with a network (a ResNet-18 (he2016deep)) into audio features .


To embed the camera pose into feature , we first normalize all poses in to be relative to the first pose in the context and then represent each with a sinusoidal positional encoding (vaswani2017attention).


To enable our model to distinguish between the visual and audio modalities in the context, we introduce a modality token such that the visual and acoustic modality embeddings are learned during training. While the visual modality (RGB-D images) reveals the local geometric and semantic structure of the environment, the acoustic modality (the echoes) carry more global acoustic information about regions of environments that are both within and out of the field of view. Our modality-based embedding allows our model to attend within and across modalities to capture both modality-specific and complementary environmental cues for a comprehensive modeling of the acoustic properties of the scene.

Context Encoder.

For each visual observation in we concatenate its embedding with its pose and modality and project this representation with a single linear layer to get visual input . This creates a multimodal memory of size , such that . Next, our context encoder attends to the embeddings in with self-attention to capture the short- and long-range correlations within and across modalities and through multiple layers to learn the implicit representation that models the acoustic properties of the 3D scene. This representation is then fed to the next module that generates the RIR for an arbitrary source-receiver query, as we describe next.

4.2 Conditional RIR Predictor

Given an arbitrary source-receiver query , we first normalize the poses of and relative to and encode each with a sinusoidal positional encoding as we did in Sec.4.1 to generate the pose encodings and , respectively. Then, we concatenate and project using a single linear layer to get the query encoding . Next, our RIR predictor, conditioned on , performs cross-attention on the learned implicit representation using a transformer decoder (vaswani2017attention), and generates an encoding that is representative of the target RIR for query (i.e., ). Again, we stress that the query consists of only poses—no images or echoes.

We upsample with transpose convolutions using a multi-layer network to predict the magnitude spectrogram in the log space for the RIR. Finally, we transform this log magnitude spectrogram back to the linear space to obtain our model’s RIR prediction for a query .

4.3 Model Training

Our model is optimized to predict the target RIR for a given query during training in a supervised manner using a loss that captures both the prediction accuracy of the spectrogram as well as high-level acoustic properties of the predicted compared to the ground truth . Our loss contains two terms: 1) an reconstruction loss (singh2021image2reverb; luo2022learning) on the magnitude spectrogram of the RIR, and 2) a novel energy decay matching loss .

For a target binaural spectrogram with frequency levels and temporal windows, the

loss tries to reduce the average prediction error in the time-frequency domain:

On the other hand, the loss tries to capture the reverberation quality of the RIR by matching the temporal decay in energy of the predicted RIR with the target. allows our model to reduce errors in important reverberation parameters that depend on the energy decay of the RIR, like RT60, which is the time taken by an impulse to decay by 60 dB, and DRR, which is the direct-to-reverberant energy ratio (cf. Table 1). Although past approaches (singh2021image2reverb; ratnarajah2022fast) have tried to minimize the RT60 error directly, incorporating an RT60 loss in a training objective is not viable due to the non-differentiable nature of the RT60 function. On the contrary, our proposed is completely differentiable and can be combined with any other RIR training objective.

To compute , we first (similar to (singh2021image2reverb)) obtain the energy decay curve of an RIR by summing its spectrogram along the frequency axis to retrieve a full-band amplitude envelope, and then use Schroeder’s backward integration algorithm to compute the decay. However, unlike (singh2021image2reverb), which computes the RT60 value from the decay curve through a series of non-differentiable operations, our directly measures the error in the energy decay curve between the prediction and the target, which not only makes it completely differentiable but also useful for capturing different energy based acoustic properties other than RT60, like DRR and early decay time (EDT) (ratnarajah2021ts). Towards that goal, we compute the absolute error in between and the target for those temporal positions at which the target energy decay is non-zero. This lets our model ignore optimizing for the all-zero tails in shorter RIRs. is defined as follows:

Our final training objective is , where is the weight for . We train our model using Adam (kingma2014adam) with a learning rate of and .

5 Experiments

Evaluation setup.

We evaluate our task using a state-of-the-art perceptually realistic 3D audio-visual simulator. In particular, we use the AI-Habitat simulator (habitat19iccv) with the SoundSpaces (chen2020soundspaces) audio and the Matterport3D scenes (chang2017matterport3d). While Matterport3D contains dense 3D meshes and image scans of real-world houses and other indoor spaces, SoundSpaces provides pre-computed RIRs to render spatial audio at a spatial resolution of 1 meter for Matterport3D. These RIRs capture all major real-world acoustic phenomena (see  (chen2020soundspaces) for details). This framework enables us to evaluate our task on a large number of environments, to test on diverse scene types, to compare methods under same settings, and to report reproducible results. To our knowledge, there is no existing public dataset having both imagery and dense physically measured RIRs. Furthermore, due to the popularity of this framework (e.g., (gao2020visualechoes; DBLP:journals/corr/abs-2008-09622; NEURIPS2020_ab6b331e; majumder2021move2hear; purushwalkam2021audio; chen2022visual)) we can test our model on important downstream tasks for RIR generation that are relevant to the larger community.

Dataset splits.

We evaluate with 83 Matterport3D scenes, of which we treat 56 randomly sampled ones as seen and the remaining 27 as unseen. Unseen environments are only used for testing. For the seen environments, we hold out a subset of queries for testing and use the rest for training and validation. Our test set consists of 14 sets of 50 arbitrary queries for each environment, where our model uses the same randomly chosen observation set for all queries in a set. This results in a train-val split with 8,107,904 queries, and a test split with 39,900 queries for seen and 18,200 queries for unseen. Our testing strategy allows us to evaluate a model on two different aspects: 1) for a given observation set from an environment, how the RIR prediction quality varies as a function of the query, and 2) how well the model generalizes to environments previously unseen during training.


We render all RGB-D images for our model input at a resolution of and sample binaural RIRs at a rate of 16 kHz. To generate the RIR spectrograms, we compute the STFT with a Hann window of 15.5 ms, hop length of 3.875 ms, and FFT size of 511. This results in two-channel spectrograms, where each channel has 256 frequency bins and 259 overlapping temporal windows. Unless otherwise specified, for both training and evaluation, we use egocentric observation sets of size samples for our model.

Existing methods and baselines.

We compare our approach to the following baselines and state-of-the-art methods (see Supp. for implementation and training details):

  • [leftmargin=*,topsep=0pt,partopsep=0pt,itemsep=0pt,parsep=0pt]

  • Nearest Neighbor: a naive baseline that outputs the input echo’s RIR that is closest to the query in terms of the receiver pose .

  • Linear Interpolation: a naive baseline that computes the top four closest observation poses for the query receiver, and outputs the linear interpolation of the corresponding echoes’ RIRs.

  • AnalyticalRIR++: We modify our model to predict RT60 and DRR for the query using the egocentric observations. The modification uses the same transformer encoder-decoder pair, but replaces the transposed convolutions with fully-connected layers for RT60 or DRR prediction. It then analytically

    shapes an exponentially decaying white noise 

    (rir_from_white_noise) on the basis of these two parameters to estimate the target RIR.

  • Fast-RIR (ratnarajah2022fast)++: Fast-RIR (ratnarajah2020ir) is a state-of-the-art model that trains a GAN (NIPS2014_5ca3e9b1) to synthesize RIRs for rectangular rooms on the basis of environment and acoustic attributes, like scene size and RT60, which it assumes to be known a priori. Since Few-ShotRIR makes no such assumptions and is not restricted to rectangular rooms, we improve this method into Fast-RIR++: we use our modified model for AnalyticalRIR++ to estimate both the target RT60 and DRR, and use panoramic depth images at the query source and receiver to infer the scene size. We also train this model by augmenting the originally proposed objective with our loss to further improve its performance.

  • Neural Acoustic Fields (NAF) (luo2022learning):

    a state-of-the-art model that uses an implicit scene representation 

    (mildenhall2020nerf) to model RIRs. As discussed above, a NAF model can only predict new RIRs in the same training scene; it cannot generalize to an unseen environment without retraining a new model from scratch to fit the new scene.

Evaluation Metrics.

We consider four metrics: 1) STFT Error, which measures the average error between predicted and target RIRs at the spectrogram level; 2) RT60 Error (RTE) (singh2021image2reverb; ratnarajah2021ts; ratnarajah2022fast), which measures the error in the RT60 value for our predicted RIRs. 3) DRR Error (DRRE) (ratnarajah2021ts), another standard metric for room acoustics; and 4) Mean Opinion Score Error (MOSE) (chen2022visual)

, which uses a deep learning objective 


to measure the difference in perceptual quality between a prediction and the target when convolved with human speech. While STFT Error measures the fine-grained agreement of a prediction to the target, RTE and DRRE capture the extent of acoustic mismatch in a prediction, and MOSE evaluates the level of perceptual realism for human speech.

5.1 RIR Prediction Results

Seen environments Unseen environments
Nearest Neighbor 4.65 1.15 385 24.4 4.87 1.26 391 28.0
Linear Interpolation 4.44 1.22 393 24.3 4.67 1.32 403 27.2
AnalyticalRIR++ 2.94 0.98 463 28.1 3.02 1.19 467 29.4
Fast-RIR (ratnarajah2022fast)++ 1.37 1.25 137 13.7 1.45 1.61 369 15.2
Few-ShotRIR (Ours) 1.10 0.43 106 8.66 1.22 0.65 164 10.5
Ours w/ 1.69 0.74 362 17.2 1.70 0.90 372 17.4
Ours w/o echoes 1.63 0.63 334 19.5 1.67 0.95 357 20.0
Ours w/o vision 1.63 0.67 332 19.2 1.58 0.83 347 19.3
Ours w/o 1.39 1.60 347 14.0 1.44 2.11 363 14.5
Table 1: RIR prediction results. All methods here train a single model to handle all seen/unseen environment queries. See Table 2 for comparisons with NAF (luo2022learning), which trains one model per seen environment. All metrics use base and lower is better. Results are statistically significant between our model and the nearest baselines ().

Table 1 (top) reports our main results. The naive Nearest Neighbor and Linear Interpolation baselines incur very high STFT error, which shows that using echoes from poses that are spatially close to the query receiver as proxy predictions are insufficient, emphasizing the difficulty of the task. AnalyticalRIR++ fares better than the naive baselines on STFT and RTE but has a higher DRRE and MOSE, showing that reconstructing an RIR using simple waveform statistics is not enough. Fast-RIR++ shows the strongest performance among the baselines. Its improvement over AnalyticalRIR++, with which it shares the RT60 and DRR predictors, shows that high-quality RIR prediction benefits from learned methods that go beyond estimating simple acoustic parameters.

Our model outperforms all baselines by a statically significant margin () on both seen and unseen environments. This shows that our approach facilitates environment acoustics modeling in a way that generalizes to novel environments without any retraining. Furthermore, its performance improvement over Fast-RIR (ratnarajah2022fast)++ emphasizes the advantage of directly predicting RIRs on the basis of the implicit acoustic model inferred from egocentric observations, as opposed to indirect synthesis of RIRs by first estimating high-level acoustic characteristics for the target. As expected, our model has more limited success for very far-field queries; widely separated source and receivers make modeling late reverb difficult (see Supp for details).

Seen environments (3) Unseen environments (all)
Seen zones Unseen zones
NAF 2.31 10.2 523 36.1 2.22 2.65 322 34.7 3.62 5.11 463 56.6
Ours 0.90 1.97 95.7 7.08 1.58 1.25 155 13.6 1.22 0.65 164 10.5
Table 2: RIR prediction results for our model vs. NAF luo2022learning. All metrics use base .
Nearest Neighbor 6.88 68.8 386 22.7
Linear Interpolation 6.61 68.7 388 21.5
AnalyticalRIR++ 3.37 7.95 474 26.0
NAF (luo2022learning) 3.62 51.1 367 56.6
Fast-RIR (ratnarajah2022fast)++ 1.58 1.23 436 17.1
Few-ShotRIR (Ours) 1.51 1.30 202 14.0
Figure 3: RIR prediction results with ambient environment sounds in unseen environments. All metrics use base .
Figure 4: Training time comparison vs.NAF (luo2022learning)
Few-ShotRIR (Ours) vs. NAF (luo2022learning).

Recall that, unlike our approach, NAF requires training one model per environment. Thus, for fair comparison, we train one model per scene for both NAF and our method. Due to the high computational cost of training NAF (as we discuss later) we limit our training to three large seen environments. Further, we split each seen environment into seen zones, which we use for training and also testing the models’ interpolation capabilities, and unseen zones, which test intra-scene generalization. To give NAF access to our model’s observations, we finetune it on our model’s echo inputs before testing. For unseen environments, our model adopts the setup from the previous section, whereas we train NAF from scratch on our model’s observed echoes; note that NAF’s scene-specificity does not allow finetuning of a model trained on a seen environment.

Table 2 shows the results. Our model significantly () outperforms NAF (luo2022learning) on seen environments when considering both seen and unseen zones. The seen zone results underscore the better interpolation capabilities of our model in comparison to NAF when tested on held-out queries from the training zones. Our method’s improvement over NAF on unseen zones shows that our model design and training objective lead to much better intra-scene generalization, even when NAF is separately finetuned on the observation sets from the unseen zones. While NAF improves over the naive baselines from Table 1 on STFT error, it does worse than other methods on most metrics. This demonstrates that just learning to predict a limited amount of echoes contained in the observation set of our model is insufficient to accurately model acoustics for unseen environments.

Figure 4 compares the training cost between NAF (luo2022learning) and our model using wall clock time, when both models are trained on 8 NVIDIA Quadro RTX 6000 GPUs. When we train one model per environment, NAF takes hours to converge on average, while our model takes hours. However, our model design allows us to train one model jointly on all Matterport3D training scenes in hours, which reduces the average training time by —down to hours per environment. Moreover, for unseen environments, training NAF on echoes requires 2.1 hours for each observation set. On the other hand, our model design enables training on a large number of scenes at a much lower average cost, while also allowing generalization to novel environments without further training.

5.2 Model Analysis


In Table 1 (bottom) we ablate the components of our model. When removing one of the modalities from the input, we see a drop in performance, which indicates that our model leverages the complementary information from both vision and audio to learn a better implicit model of the scene. We also observe a performance drop across all metrics, especially on the RTE metric, upon removing our energy decay matching loss . This shows that having as part of the training objective allows our model to better capture desirable reverberation characteristics for the target query, like RT60, while also additionally helping to score better on other metrics. Furthermore, we see that reducing the observation set to just 1 sample, i.e., , impacts our model’s performance. However, even under this extreme condition, our model still shows better generalization compared to several baselines. We further investigate the impact of context size on our model’s performance in Figure 7. Our model already reduces the error significantly using a context size of 5, with diminishing reductions as it gets a larger context. This plot also highlights the low-shot success of our model vs. the strongest baseline, Fast-RIR++.

Ambient environment sounds.

We test our model’s ability to generalize in the presence of ambient and background sounds. To that end, we repeat the experiment from Sec.5.1 where this time we insert a random ambient or background sound (e.g., running heater, dripping water). This background noise impacts the echoes input to our model. Table 4 reports the results. Even in this more challenging setting, our method substantially improves over all methods on almost all metrics.

Qualitative results.
Figure 5: RIR predictions for a high and low reverberation case, where our model uses the same observations in both cases. For high reverb, our model relies more on vision than echoes for inferring scene acoustics, since echoes could be misleading in this case due to its long reverberation tail. For low reverb, our model uses echoes more, likely because their long-range nature better informs about the acoustics of the more open surroundings.

Figure 5 shows two RIR prediction scenarios for our model: high reverberation, where the query receiver is located close to the source in a very narrow and reverberant corridor surrounded by walls, and low reverberation

, where the source and receiver are spread apart in a more open space. Our model shares the same observation samples across these settings. With high reverb, our model focuses on vision due to its ability to better reveal the compact geometry of the surroundings and its effects on scene acoustics, whereas echoes are distorted from strong reverberation. For low reverb, echoes are probably more informative about the acoustics of the more open surroundings due to their long-range nature. However, in both cases, our model prioritizes samples that provide good coverage of the overall scene, rather than just scoping out the local area around the query. This allows our model to make predictions that closely match the targets.

5.3 Downstream Applications for RIR Inference: Source Localization and Depth Estimation

Seen Unseen
True RIR (Upper bound) 14.9 0.97 17.0 1.25
Nearest Neighbor 202 1.50 214 1.57
Linear Interpolation 202 1.39 213 1.49
AnalyticalRIR++ 254 1.64 270 1.69
NAF (luo2022learning) 329 1.68
Fast-RIR (ratnarajah2022fast)++ 168 1.39 201 1.52
Few-ShotRIR (Ours) 50.3 1.35 64.6 1.45
Figure 6: Downstream task evaluation of RIR predictions. All metrics use a base of and lower is better.
Figure 7: STFT error vs. context size .

Next, we consider two downstream tasks inspired by AR/VR and robotics: sound source localization and depth estimation from echoes. For both tasks, spatial audio generated by more accurate RIRs should translate into better representation of the true acoustics, and hence better downstream results. We train a model for each task using ground truth RIRs and evaluate using the predictions from our model and all baselines for the same set of queries from both seen and unseen environments. Table 7 reports the results. DPE is the average error between a normalized depth target and its prediction, and SLE is the average error in prediction of the source location (in meters) relative to the receiver for a query. Our method outperforms all baselines by a statistically significant margin . In particular, for the more difficult unseen environments, our model reduces the error relative to the ground truth upper bound by 74% for SLE and 25% for DPE when compared to Fast-RIR++, highlighting that our model’s predictions capture spatial and directional cues more precisely than all other baselines and existing methods.

6 Conclusion

We introduced a model to infer arbitrary RIRs having observed only a small number of echoes and images in the space. Our approach helps tackle key challenges in modeling acoustics from limited observations, generalizing to unseen environments without retraining, and enforcing desired acoustic properties in the predicted RIRs. The results show its promise: substantial gains over existing models, faster training, and impact on downstream source localization and depth estimation. In future work, we plan to explore ways to optimize the placement of the observation set and explore ways to curate large-scale real world data for sim2real transfer.

Acknowledgements: Thanks to Tushar Nagarajan and Kumar Ashutosh for feedback on paper drafts.


7 Supplementary Material

In this supplementary material we provide additional details about:

  • Video (with audio) for qualitative illustration of our task and qualitative evaluation of our model predictions (Sec. 7.1).

  • Potential societal impact of our work (Sec. 7.2).

  • Evaluation of the impact of the query source location on our model’s prediction quality for a fixed receiver (Sec. 7.3).

  • Audio dataset details (Sec. 7.4), as mentioned in Sec. 5 of the main paper.

  • Model architecture details for RIR prediction (Sec. 7.5.1) and downstream tasks (Sec. 7.5.2), as noted in Sec. 5 of the main paper.

  • Training hyperparameters (Sec. 

    7.6), as referenced in Sec. 5 of the main paper.

7.1 Supplementary Video

The supplementary video shows the perceptually realistic SoundSpaces (chen2020soundspaces) audio simulation platform that we use for our experiments, and provides a qualitative illustration of our task, Few-Shot Audio-Visual Learning of Environment Acoustics. Moreover, we qualitatively demonstrate our model’s prediction quality by comparing the predictions with the ground truths, both at the RIR level and in terms of perceptual similarity when the RIRs are convolved with real-world monaural sounds, like speech and music. We also analyze common failure cases for our model (Sec. 5.1 in main) and qualitatively show how our model predictions can be used to successfully localize an audio source in a 3D environment. Please use headphones to hear the spatial audio correctly. The video is available at http://vision.cs.utexas.edu/projects/fs_rir.

7.2 Potential Societal Impact

Our model enables modeling the acoustics in a 3D scene using only few observations. This has multiple applications with a positive impact. For example, accurate modeling of the scene acoustics enables a robot to locate a sounding object more efficiently (like finding a crying baby, or locating a broken vase). Additionally, this allows for a truly immersive experience for the user in augmented and virtual reality applications. However, RIR generative models allow the user to match the acoustic reverberation in their speech to an arbitrary scene type, and hence hide their true location from the receiver, which may have both positive and negative implications. Finally, our model uses visual samples from the environment for more accurate modeling of the acoustic properties of the scene. However, the dataset used in our experiments contains mainly indoor spaces that are of western designs, and with a certain object distribution that is common to such spaces. This may bias models trained on such data toward similar types of scenes and reduce generalization to scenes from other cultures. More innovations in the model design to handle strong shifts in scene layout and object distributions, as well as more diverse datasets are needed to mitigate the impact of such possible biases.

7.3 Impact of the Source Location on the Prediction Error

In Fig. 8, we show the RIR prediction error as a function of different source locations for a fixed receiver location. As we can see, the prediction error tends to be small when the source is relatively close to the receiver, or there are no major obstacles along the path connecting them. This indicates that the model leverages the local geometry of the scene and the acoustic information captured from echoes for better predictions. However, the error increases when there are large distances between the source and receiver (Sec. 5.1 in main), and especially when there are major obstacles for audio propagation in between (e.g., walls, narrow corridors). Modeling how audio gets transformed on such a long path becomes very challenging due to the limited observations available to the model and the larger scene area that contributes to transforming the audio.

Figure 8: RIR prediction STFT error as a function of varying source locations (filled circles) for a given receiver (a green square with an arrow). We show two scenes and two examples per scene. The color of the circle at the source location indicates the STFT error in the RIR prediction associated with that source and receiver pair. The error in each example is normalized between the min and max values shown underneath the map.

7.4 Audio Dataset

For computing the mean opinion score error (MOSE) (chen2022visual) (Sec. 5 in main), we sample 5 second long speech clips from the LibriSpeech (7178964) dataset, which comprise both male and female speakers. For every test query, we randomly choose one of the sampled clips and convolve with the true RIR or a model’s prediction for that query to estimate the corresponding mean opinion score (MOS) (mosnet) and, subsequently, the error in MOS for a model’s prediction relative to the true RIR. We use a 5-second-long temporal window for all model predictions and true RIRs when estimating their MOS.

For our experiment with ambient environment sounds (Sec. 5.2 in main), we use ambient sounds from the ESC-50 (piczak2015dataset) dataset (e.g., dog barking, running water). For every test query, we randomly sample a location in the 3D scene for an ambient sound and play a randomly chosen 1 second long clip from the ESC-50 dataset at that location. To retrieve the observed binaural echo response (Sec. 3 and 4.1 in main) in this setting, we first convolve the clean echo RIR for each observation with the sinusoidal sweep sound, then mix it with the binaural the ambient sound for its pose , and finally deconvolve using the inverse sweep (Sec. 4.1 in main).

We will release our datasets.

7.5 Architecture and Training

Here, we provide our architecture and additional training details for reproducibility. We will release our code.

7.5.1 Model Architectures for RIR Prediction

Visual Encoder.

Our visual encoder is a ResNet-18 (he2016deep) model (Sec. 4.1 in main) that takes egocentric RGB and depth images from the observation set, which are concatenated channel-wise, as input and produces a 512-dimensional feature.

Acoustic Encoder.

Our acoustic encoder is another ResNet-18 (he2016deep) (Sec. 4.1 in main) that separately encodes the binaural log magnitude spectrogram for an echo RIR into a 512-dimensional feature.

Pose Encoder.

To embed an observation pose or a query source-receiver pair (Sec. 3 in main), we use sinusoidal positional encodings (vaswani2017attention) (Sec. 4.1

in main) with 8 frequencies, which generate a 16-dimensional feature vector (the positional encodings comprise both sine and cosine components with 8 features per component) for every attribute of an observation pose or a query (i.e.,

, , and ).

Modality Encoder.

For our modality embedding (Sec. 4.1 in main), we maintain a sparse lookup table of 8-dimensional learnable embeddings, which we index with to retrieve the visual modality embedding () and 1 to retrieve the acoustic modality embedding ().

Fusion Layer.

To generate the multimodal memory (Sec. 4.1 in main) for our context encoder (Sec. 4.1 in main), we separately concatenate the modality features (produced by for vision and for echo responses) for an observation, the corresponding sinusoidal pose embedding, and the modality embedding ( for visual features and for acoustic features), and project using a single linear layer to 1024-dimensional embedding space. Similarly, to generate the query encoding for our conditional RIR predictor (Sec. 4.2 in main), we use another linear layer to project the query’s sinusoidal positional encodings to a 1024-dimensional feature vector. Furthermore, we don’t use bias in any fusion layer.

Context Encoder.

Our context encoder (Sec. 4.1 in main) is a transformer encoder (vaswani2017attention)

with 6 layers, 8 attention heads, a hidden size of 2048 and ReLU 

(sun2015deeply; DBLP:conf/icml/NairH10) activations. Additionally, we use a dropout (JMLR:v15:srivastava14a) of 0.1 in our context encoder.

Conditional RIR Predictor.

Our conditional RIR predictor (Sec. 4.2 in main) has 2 components: 1) a transformer decoder (vaswani2017attention) to perform cross-attention on the implicit representation (Sec. 4.1 in main), which is produced by the previously described context encoder, using the query encoding (Sec. 4.2 in main), and 2) a multi-layer transpose convolution network (Sec. 4.2 in main) to upsample the decoder output and predict the magnitude spectrogram for the query in log space.

The transformer decoder (vaswani2017attention) has the same architecture as our context encoder.

The transpose convolution network

comprises 7 layers in total. The first 6 layers are transpose convolutions with a kernel size of 4, stride of 2, input padding of 1, ReLU 

(sun2015deeply; DBLP:conf/icml/NairH10) activations and BatchNorm (pmlr-v37-ioffe15). The number of input channels for the transpose convolutions are 128, 512, 256, 128, 64 and 32, respectively. The last layer of is a convolution layer with a kernel size of 3, stride of 1, padding of 1 along the height dimension and 2 along the width dimension, and 16 input channels. Finally, we switch off bias in all layers of .

7.5.2 Model Architectures for Downstream Tasks

Sound Source Localization.

We use a ResNet-18 (he2016deep) feature encoder that takes the log magnitude spectrogram of an RIR (predicted or ground truth as input). We take the encoded features and feed them to a single linear layer that predicts the location coordinates of a query’s source relative to the query’s receiver pose.

Depth Estimation.

Following VisualEchoes (gao2020visualechoes), we use a U-net (ronneberger2015u) that takes the log magnitude spectrogram of an echo as input and predicts the depth map (Sec. 5.3 in main) as seen from the echo’s pose. The encoder of our U-net has 6 layers. The first layer is a convolution with a kernel size of 3, stride of 2, padding of 1 along the height dimension, and 2 input channels. The remaining 5 layers are convolutions with a kernel size of 4, padding of 1 and stride of 2. These 5 layers have 64, 64, 128, 256 and 512 input channels, respectively. Each convolution is followed by a ReLU activation (sun2015deeply; DBLP:conf/icml/NairH10) and a BatchNorm (pmlr-v37-ioffe15).

The decoder of the U-net has 5 transpose convolution layers. Each transpose convolution has a kernel size of 4, stride of 2 and input padding of 1. Except for the last layer that uses a sigmoid activation function to generate depth maps, which are normalized such that all pixels are in the range of

, each transpose convolution has a ReLU activation (sun2015deeply; DBLP:conf/icml/NairH10) and a BatchNorm (pmlr-v37-ioffe15). The decoder layers have 512, 1024, 512, 256 and 128 channels, respectively. We use skip connections between the encoder and the decoder starting with their second layer. We switch off bias in both the encoder and decoder.

7.6 Training Hyperparameters

In addition to the training details specified in main (Sec. 4.3 in main), we use a batch size of 24 during training. Furthermore, for every entry of the batch, we query our model with 60 arbitrary source-receiver pairs for the same observation set, which effectively increases the batch size further and improves training speed. Other training hyperparameters specific to our Adam (kingma2014adam) optimizer include , and .