DeepAI
Log In Sign Up

Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis

Measuring the acoustic characteristics of a space is often done by capturing its impulse response (IR), a representation of how a full-range stimulus sound excites it. This is the first work that generates an IR from a single image, which we call Image2Reverb. This IR is then applied to other signals using convolution, simulating the reverberant characteristics of the space shown in the image. Recording these IRs is both time-intensive and expensive, and often infeasible for inaccessible locations. We use an end-to-end neural network architecture to generate plausible audio impulse responses from single images of acoustic environments. We evaluate our method both by comparisons to ground truth data and by human expert evaluation. We demonstrate our approach by generating plausible impulse responses from diverse settings and formats including well known places, musical halls, rooms in paintings, images from animations and computer games, synthetic environments generated from text, panoramic images, and video conference backgrounds.

READ FULL TEXT VIEW PDF

page 12

page 13

page 14

page 15

page 17

page 18

page 21

page 23

05/18/2022

MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes

We propose a mesh-based neural network (MESH2IR) to generate acoustic im...
10/25/2020

IR-GAN: Room Impulse Response Generator for Speech Augmentation

We present a Generative Adversarial Network (GAN) based room impulse res...
06/21/2021

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Many previous audio-visual voice-related works focus on speech, ignoring...
02/14/2022

Visual Acoustic Matching

We introduce the visual acoustic matching task, in which an audio clip i...
06/30/2022

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

In this paper, we propose a novel end-to-end user-defined keyword spotti...
10/27/2022

Rigid-Body Sound Synthesis with Differentiable Modal Resonators

Physical models of rigid bodies are used for sound synthesis in applicat...

1 Introduction

An effective and widely used method of simulating acoustic spaces relies on audio impulse responses (IRs) and convolution [41, 34]

. Audio IRs are recorded measurements of how an environment responds to an acoustic stimulus. IRs can be measured by recording a space during a burst of white noise like a clap, a balloon pop, or a sinusoid swept across the range of human hearing

[31]

. Accurately capturing these room impulse responses requires time, specialized equipment, knowledge, and planning. Directly recording these measurements may be entirely infeasible in continuously inhabited or inaccessible spaces of interest. End-to-end IR estimation has far ranging applications relevant to fields including music production, speech processing, and generating immersive extended reality environments. Our Image2Reverb system directly synthesizes IRs from images of acoustic environments. This approach removes the barriers to entry, namely cost and time, opening the door for a broad range of applications.

Figure 2: Impulse response overview. (A) Sound waves propagate across multiple paths as they interact with and reflect off their environment. These paths include the direct path from source to listener, early reflections including 1st and higher order reflections (after reflecting off 1 or more surfaces) and a more diffuse tail as they trail off and become more densely packed in time. These reflections make up the impulse response of the environment illustrated (B) schematically and (C) as a waveform.

In this work we model IR generation as a cross-modal paired-example domain adaptation problem and apply a conditional GAN [10, 12, 25] to synthesize plausible audio impulse responses conditioned on images of spaces. Next we describe important areas of related work that inform our approach.

1.1 Related Work

Artificial reverberation.

Historically, recording studios built reverberant chambers with speakers and microphones to apply reverb to pre-recorded audio directly within a physical space [33]. Reverberation circuits, first proposed in the 1960s, use a network of filters and delay lines to mimic a reverberant space [37]. Later, Digital algorithmic approaches applied numerical methods to simulate similar effects. Conversely, convolution reverb relies on audio recordings of a space’s response to a broadband stimulus, typically a noise burst or sine sweep. This results in a digital replica of a space’s reverberant characteristics, which can then be applied to any audio signal [1].

Convolutional neural networks have been used for estimating several reverb statistics from images [17, 18], though not to predict the full audio impulse response. This work is based on the finding that experienced acoustic engineers readily estimate a space’s IR or reverberant characteristics from an image [16]. Room geometry has also been estimated from 360-degree images of four specific rooms [32], and used to create virtual acoustic environments which are compared with ground-truth recordings, though again IRs are not directly synthesized from the images. A related line of work synthesizes spatial audio based on visual information [20, 8, 15]. Prior work exists on synthesis of IRs using RNNs [35]

, autoencoders

[39], and GANs: IR-GAN [30] uses parameters from real world IRs to generate new synthetic IRs; whereas our work is the first to synthesize an audio impulse response directly from an image.

Generative models for audio.

Recent work has shown that GANs are amenable to audio generation and can result in more globally coherent outputs [6]. GANSynth [7] generates an audio sequence in parallel via a progressive GAN architecture allowing faster than real-time synthesis and higher efficiency than the autoregressive WaveNet [42]

architecture. Unlike WaveNet which uses a time-distributed latent coding, GANSynth synthesizes an entire audio segment from a single latent vector. Given our need for global coherence, we create a fixed-length representation of our input data and adapt our generator model from the GANSynth architecture.

Measured IRs have been approximated with shaped noise [19, 3]. While room IRs exhibit statistical regularities [40] that can be modeled stochastically, the domain of this modeling is time and frequency limited [2], and may not reflect all characteristics of real-world recorded IRs. Simulating reverb with ray tracing is possible but prohibitively expensive for typical applications [36]. By directly approximating measured audio IRs at the spectrogram level, our outputs are immediately applicable to tasks such as convolution reverb, which applies the reverberant characteristics of the IR to another audio signal.

Cross-modal translation.

Between visual and auditory domains, conditional GANs have been used for translating between images and audio samples of people playing instruments [4]. Our work builds on this by applying state-of-the-art architectural approaches for scene analysis and high quality audio synthesis, suitable for our purposes.

2 Methods

In this section we describe the dataset, model, and algorithm.

2.1 Dataset

Data aggregation.

We curated a dataset of 265 different spaces totalling 1169 images and 738 IRs. From these, we produced a total of 11234 paired examples with a train-validation-test split of 9743-154-1957. These are assembled from sources including the OpenAIR dataset [26], libraries available as training, and web scraping. Many examples amount to weak supervision, due to the low availability of data: for example, we may have a “kitchen” impulse response without an image of the kitchen in which it was recorded. In this case, we augmented with plausible kitchen scenes, judged by the researchers, gathered via web scraping and manual filtering. Although this dataset contains high variability in several reverberant parameters, e.g. early reflections and source-microphone distance, it allows us to learn characteristics of late-field reverberation.

Data preprocessing.

Images needed to be filtered manually to remove duplicates, mismatches such as external pictures of an indoor space, examples with significant occlusive “clutter” or excessive foreground activity, and intrusive watermarks. We then normalized, center-cropped at the max width or height possible, and downsampled to 224x224 pixels. We converted the audio IR files to monaural signals; in the case of Ambisonic B-Format sources we extracted the (omnidirectional) channel, and for stereo sources we computed the arithmetic mean of channels. In some cases, 360-degree images were available and in these instances we extract rectilinear projections, bringing them in line with the standard 2D images in our dataset.

Audio representation.

Our audio representation is a log magnitude spectrogram. We first resample the audio files to 22.050kHz and truncate them to 5.94s in duration. This is sufficient to capture general structure and estimate reverberant characteristics for most examples. We then apply a short-time Fourier transform with window size (

) and hop size (), before trimming the Nyquist bin, resulting in square 512x512 spectrograms. Finally, we take where represents the magnitude spectrogram; audio IRs typically contain uncorrelated phase, which does not offer structure which is typically not replicated based on the magnitude spectrum.

2.2 Model

Components.

Our model employs a conditional GAN with an image encoder that takes images as input and produces spectrograms. This overall design, with an encoder, generator, and conditional discriminator, is similar to that which Mentzer et al. [24] applied to obtain state-of-the-art results on image compression, and has been applied in other domains. The generator and discriminator are deep convolutional networks based on the GANSynth [7] model for neural audio synthesis (non-progressive variant), with modifications to suit our dataset, dimensions, and training procedure.

The encoder module combines image feature extraction with depth estimation to produce latent vectors from two-dimensional images of scenes. For depth estimation, we use the pretrained Monodepth2 network

[9], a monocular depth-estimation encoder-decoder network which produces a one-channel depth map corresponding to our input image. The main feature extractor is a ResNet50 [13] pretrained on Places365 [44] which takes a four-channel representation of our scene including the depth channel (4x224x224). We add randomly initialized weights to accommodate the additional input channel for the depth map. Since we are fine-tuning the entire network, albeit at a low learning rate, we expect it will learn the relevant features during optimization. Our architecture’s components are shown in Figure 3.

Figure 3: System architecture. Our system consists of autoencoder and GAN networks. Left: An input image is converted into 4 channels: red, green, blue and depth. The depth map is estimated by Monodepth2, a pre-trained encoder-decoder network. Right: Our model employs a conditional GAN. An image feature encoder is given the RGB and depth images and produces part of the Generator’s latent vector which is then concatenated with noise. The Discriminator applies the image latent vector label at an intermediate stage via concatenation to make a conditional real/fake prediction, calculating loss and optimizing the Encoder, Generator, and Discriminator.
Input:
Monodepth2: ; Encoder ; Generator: ; Discriminator: OR ;
Parameters: (weight variables);
Output:
Monodepth2: ; Encoder: ; Generator: G(z); Discriminator: OR ;
for 

number of epochs

 do
       Sample training images;
       Get depth ;
       Append depth features to RGB channels ();
       Encoder image to feature-vector ();
       Append noise (;
       Generate spectrogram ();
       Forward pass through discriminator with either fake or real spectrogram ( OR );
       Backward pass: update parameters for discriminator (), generator (), and encoder ();
end for
ALGORITHM 1 Forward and backward passes through the Image2Reverb model. Notation is explained in Table 1.
Notation Definition
x input image
estimated depth map
concatenation operator
image with depth map ()
Real spectrogram
Encoder, Generator, Discriminator
Monodepth2 Encoder-Decoder
weights for a model
Noise,
Latent vector, encoder output and noise
()
Table 1: Notation and definitions for variables indicated in different parts of this paper.

Objectives.

We use the least-squares GAN formulation (LSGAN) [22]. For the discriminator:

(1)

For the generator, we introduce two additional terms to encourage realistic and high-quality output. First, we add the

difference, scaled by a hyperparameter (

in our case). This is a common approach in image and audio settings. Second, we introduce a domain-specific term that performs an estimation of the values, the time it takes for the reverberation to decay by , for the real and generated samples, and returns the absolute percent error between the two scaled by a hyperparameter ( again). We term the differentiable proxy measure . To compute this for log-spectrogram , we first get the linear spectrogram and then sum along the time axis to obtain a fullband amplitude envelope. We use Schroeder’s backward integration method to obtain a decay curve from the squared signal, and linearly extrapolate from the point to get a estimate. In all:

(2)

Training.

We train our model on 8 NVIDIA 1080 Ti GPUs. Three Adam optimizers for each of the Generator, Discriminator, and Encoder were used to optimize the networks’ parameter weights. Hyperparameters are noted in Table 2. We make our models and code publicly available 111Models and code: https://github.com/nikhilsinghmus/image2reverb.

Parameter Value
4e-4
2e-4
1e-5
(0.0, 0.99)
1e-8
Table 2: Hyperparameters for the Generator, Discriminator, and Encoder initial learning rates, the optimizer beta (), and epsilon () for the Adam optimizers we use (one each for )

3 Results

Using Image2Reverb we are able to generate perceptually plausible impulse responses for a diverse set of environments. In this section, we provide input-output examples to demonstrate the capabilities and applications of our model and also review results of a multi-stage evaluation integrating domain-specific quantitative metrics and expert ratings. Our goal is to examine output quality and conditional consistency, generally considered important for conditional GANs [5].

3.1 Examples

We present several collections consisting of diverse examples in our supplementary material, with inputs curated to illustrate a range of settings of interest including famous spaces, musical environments, and entirely virtual spaces. All examples are made available as audiovisual collections222Audiovisual samples: https://dev5839.d12qej5d2ijqlh.amplifyapp.com and were generated with a model trained in around 12 hours, with 200 epochs on a cloud instance. Figure 4 shows examples from our test set that were used in our expert evaluation (4 of 8, one from each category of: Small, Medium, Large, and Outdoor). We convolve a spoken word anechoic signal with the generated IRs for the reader to hear. Figure 5 takes images of diverse scenes (art, animation, historical/recognizable places) as inputs. Figure 6 demonstrates how sections of 360-degree equirectangular images are cropped, projected, and passed through our model to generate IRs of spaces relevant to immersive virtual reality environments.

We strongly encourage the reader to explore these examples on the accompanying web page. We include examples of musical performance spaces, artistic depictions (drawings, paintings), 3D animation scenes, synthetic images from OpenAI’s DALL•E, as well as real-world settings that present challenges (e.g. illusions painted on walls, reflections, etc.). These are largely created with real-world environments for which we may not have ground truth IRs, demonstrating how familiar and unusual scenes can be transformed in this way.

Figure 4: Ground-truth measured IRs vs generated IRs. Columns show input images, depth maps, measured IRs with corresponding convolved speech, and generated IRs with corresponding convolved speech. Larger indoor spaces here tend to exhibit greater times with longer measured impulse responses. The outdoor scene has a very short measured IR and corresponding generated IR. Input images are all examples that were used in the expert survey and were drawn from the test set.
Figure 5: Generated IR examples. Columns show input images, depth maps, generated IRs, and a dry anechoic speech signal before and after the generated IR was applied to the signal via convolution. The input images come from a variety of spaces which we show here to illustrate possible applications of our model. Some images are synthetic in nature including: an oil painting, a 3D animation still, and a video game screenshot. Other images come from real-world scenes like a church (where music is often heard), a famous yet inaccessible space (SpaceX), and an outdoor desert scene. Larger indoor spaces tend to exhibit greater times with longer impulse responses as seen here.
Figure 6: VR. Impulse responses generated from an equirectangular 360-degree image by sampling points on a sphere, cropping and applying a rectilinear projection to the resulting image, and feeding them into our model. This demonstrates how our model directly generates realistic impulse responses of panoramic virtual reality compatible images. Future work may allow generation of impulse responses using an entire 360-degree image, though at present there is a lack of paired impulse responses and 360 degree images available for training.

3.2 Ablation Study

To understand the contribution of key architectural components and decisions, we perform a study to characterize how removing each of these affects test set estimation after 50 training epochs. The three components are the depth maps, the objective term, and the pretrained Places365 weights for the ResNet50 encoder. Figure 7 shows error distributions over the test set for each of these model variants, and Table 3

reports descriptive statistics.

Our model reflects better mean error (closer to 0%) and less dispersion (a lower standard deviation) than the other variants. The former is well within the just noticeable difference (JND) bounds for

, often estimated as being around 25-30% for a musical signal [23]. Additionally, this is an upper bound on authenticity: a more rigorous goal then perceptual plausibility [28]. The lower standard deviation indicates generally more consistent performance from this model across different examples, even in the presence of some that cause relatively large estimation errors due to incorrect interpretation of relevant qualities in the image, or inaccurate/noisy synthesis or estimation.

Figure 7: estimation error distributions, in percentages, from each of our model versions. estimates how long it takes the reverberation to decay by . “Main” indicates our architecture as described earlier, “No Depth” indicates a version not using depth maps, “No T60P” omits the differentiable objective term, and “No Places” does not use the pretrained Places365 weights for the ResNet50 encoder. “NN” applies a nearest-neighbor approach with Places365-ResNet50 embeddings for images (error values are clipped to 2000% for clarity). Corresponding descriptive statistics are given in Table 3.
Main -Depth - -P365 NN
Err (%) -6.03 -9.17 -7.1 43.15 149
78.8 83.1 85.97 144.3 491.02
Table 3: estimation error statistics, in percentages, from each of our model versions. “Main” indicates our architecture as described earlier, “-Depth” indicates a version not using depth maps,“-T60P” omits the differentiable objective term, and “-P365” does not use the pretrained Places365 weights for the ResNet50 encoder. “NN” indicates a nearest-neighbor approach with Places365-ResNet50 embeddings for images. For mean and median, values closer to 0 reflect better performance. For the standard deviation, lower values reflect better performance. Distributions are visualized in Figure 7.

3.3 Expert Evaluation

Following the finding that experienced acoustic engineers readily estimate a space’s reverberant characteristics from an image [16], we designed an experiment to evaluate our results. We note that this experiment is designed to estimate comparative perceptual plausibility, rather than (physical) authenticity (e.g. by side-by-side comparison to assess whether any difference can be heard). These goals have been differentiated in prior work [28]. We selected two arbitrary examples from each of the four scene categories and recruited a panel of 31 experts, defined as those with significant audio experience, to participate in a within-subjects study. For each of these examples, we convolved an arbitrary anechoic signal with the output IR, as well as the ground truth IR. These 16 samples were presented in randomized order and participants were instructed to rate each on a scale from 1 to 5 based on 1) reverberation quality, and 2) realism or “match” between their expected reverb based on the image and the presented signal with reverb applied. Participants answered one reverb-related screening question to demonstrate eligibility, and two attention check questions at the end of the survey. The four scene categories are: Large, Medium, Outdoor, and Small. These demonstrate diversity in visual-reverb relationships. The dependent variables are quality and match ratings, and the independent variables are IR source (real vs. fake) and scene category (the four options listed previously). We first test our data for normality with D’Agostino and Pearson’s omnibus test [27], which indicates that our data is statistically normal ().

A two-way repeated-measures ANOVA revealed a statistically significant interaction between IR source and scene category for both quality ratings, , , and match ratings, , (reported -values are adjusted with the Greenhouse-Geisser correction [11]). This indicates that statistically significant differences between ratings for real and fake IR reverbs depend on the scene category. Per-participant ratings and rating changes, overall and by scene, are shown in Figure 8.

Subsequent tests for simple main effects with paired two one-sided tests indicate that real vs. fake ratings are statistically equivalent () for large and small quality ratings, and large, medium, and small match ratings. These tests are carried out with an of 1 (testing for whether the means of the two populations differ by at least 1). Results are shown in Table 4. Notably, outdoor scenes appear to contribute to the rating differences between real and fake IRs. We conjecture this is due to outdoor scenes being too different a regime from the vast majority of our data, which are indoor, to model effectively. Additionally, medium-sized scenes appear to contribute to differences in quality.

Rating Scene DoF
Quality Large 56
Quality Medium 56
Quality Outdoor 56
Quality Small 56
Match Large 56
Match Medium 56
Match Outdoor 56
Match Small 56
Table 4: Simple main effect tests for equivalence between real and generated IRs across different categories of scenes. We use paired two one-sided tests with bounds () of 1 and Bonferroni-adjusted p-values. These results indicate that real vs. fake ratings are statistically equivalent within one rating unit, which is the resolution of the rating scale, for large and small quality ratings, and large, medium, and small match ratings. Notably, outdoor scenes appear to contribute to the overall main effect between real and fake IRs. Additionally, medium-sized scenes appear to contribute to differences in quality.
Figure 8: Expert evaluation results. Paired plots showing per-participant quality and match differences in rating for each scene category. Green lines indicate a higher rating for real IRs, red lines for generated IRs, and grey lines indicate equivalent ratings for both.

3.4 Model Behavior and Interpretation

Effect of varying depth.

We compare the full estimated depth map with constant depth maps filled with either 0 or 0.5 (chosen based on the approximate lower and upper bounds of our data). We survey the distributions of generated IRs’ values over our test set, the results of which are shown in Figure 9. Table 5 reports descriptive statistics for these distributions, showing that the main model’s output IRs’ decay times are biased lower by the 0-depth input and higher by the 0.5-depth input respectively. These may indicate some potential for steering the model in interactive settings. We do note, however, that behavior with constant depth values greater than 0.5 is less predictable. This may be due to the presence of outdoor scenes, for which the scene’s depth may not be correlated with IR duration.

Figure 9: Effect of Depth on . Distributions of estimated values for the model with estimated depth maps, as well as constant depth maps set to either 0 or 0.5. The latter is near a typical maximum depth value in our scenes, with only a few examples reaching 0.7 and above. The relationship between depth and is non-linear, and is likely confounded by outdoor scenes which may pair significant depth with low times. However, we observe that manipulating the depth value allows us to “suggest” a smaller or larger scene in general, or bias the output of the model in either direction (i.e. 0.5 produces longer IRs than both 0 and the actual estimated depth map). Table 5 shows corresponding descriptive statistics. These results indicate a level of “steerability” for the model’s behavior in human-in-the-loop settings.
Main Depth 0 Depth 0.5
(s) 2.07 2.01 3.62
1.54 0.87 2.36
2.69 2.00 3.07
Table 5: Descriptive statistics for the model with estimated depth maps, as well as constant depth maps set to either 0 or 0.5. The full depth map’s results are between that of the 0 and 0.5 depth maps. Figure 9 visulizes the corresponding distributions.

Effect of transfer learning.

To understand which visual features are important to our encoder, we use Gradient-weighted Class Activation Mapping (Grad-CAM) [38]. Grad-CAM is a popularly applied strategy for visually interpreting convolutional neural networks by localizing important regions contributing to a given target feature (or class in a classification setting). We produce such maps for our test images with both the ResNet50 pre-trained on Places365 dataset, as well as the final encoder model. All resulting pairs exhibit noticeable differences; we check for this with the structural similarity index (SSIM) metric [43], which is below 0.98 for all examples.

Figure 10: Grad-CAMs for images passed through both the pre-trained Places365 ResNet50 encoder and our fine-tuned encoder, showing movement towards significant reflective areas. These are for (A) a small environment, and (B) a large environment. The fine-tuned model’s activations highlight larger reflective surfaces: the depth of the staircase for (A) in contrast to the railing that may be more optimal for scene identification, and the far wall-to-ceiling corner and surrounding areas for (B).
Figure 11: Grad-CAMs for images passed through both the pre-trained Places365 ResNet50 encoder and our fine-tuned encoder, showing movement towards more textured areas. These are for (A) an indoor environment, and (B) an outdoor environment, with the former apparently containing significant absorption and the latter containing few reflective surfaces. In both cases, textured areas are highlighted. These may be associated with absorption, diffusion, and more sparse reflections depending on the scene.

We qualitatively survey these and identify two broad change regimes, which are illustrated with particular examples. First, we observe that the greatest-valued feature is often associated with activations of visual regions corresponding to large reflective surfaces. Examples are shown in Figure 10. Often, these are walls ceilings, windows, and other surfaces in reflective environments. Second, we find that textured areas are highlighted in less reflective environments. Examples of these are shown in Figure 11. These may correspond to sparser reflections and diffusion.

Limitations and future work.

Many images of spaces may offer inaccurate portrayals of the relevant properties (size, shape, materials, etc.), or may be misleading (examples in supplementary material), leading to erroneous estimations. Our dataset also contains much variation in other relevant parameters (e.g. and ) in a way we cannot semantically connect to paired images, given the sources of our data. New audio IR datasets collected with carefully corresponding photos may allow us to effectively model these characteristics precisely.

4 Conclusion

We introduced Image2Reverb, a system that is able to directly synthesize audio impulse responses from single images. These are directly applied in downstream convolution reverb settings to simulate depicted environments, with applications to XR, music production, television and film post-production, video games, videoconferencing, and other media. Our quantitative and human-expert evaluation shows significant strengths, and we discuss the method’s limitations. We demonstrate that end-to-end image-based synthesis of plausible audio impulse responses is feasible, given such diverse applications. We hope our results provide a helpful benchmark for the community and future work and inspire creative applications.

References

  • [1] Rolf Anderegg, Norbert Felber, Wolfgang Fichtner, and Ulrich Franke. Implementation of high-order convolution algorithms with low latency on silicon chips. In Audio Engineering Society Convention 117. Audio Engineering Society, 2004.
  • [2] Roland Badeau. Common mathematical framework for stochastic reverberation models. The Journal of the Acoustical Society of America, 145(4):2733–2745, 2019.
  • [3] Nicholas J Bryan. Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2020.
  • [4] Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu. Deep cross-modal audio-visual generation. CoRR, abs/1704.08292, 2017.
  • [5] Terrance DeVries, Adriana Romero, Luis Pineda, Graham W. Taylor, and Michal Drozdzal. On the evaluation of conditional {gan}s, 2020.
  • [6] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In International Conference on Learning Representations, 2019.
  • [7] Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. GANSynth: Adversarial neural audio synthesis. In International Conference on Learning Representations, 2019.
  • [8] Ruohan Gao and Kristen Grauman. 2.5d visual sound. In CVPR, 2019.
  • [9] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. Digging into self-supervised monocular depth prediction. In

    The International Conference on Computer Vision (ICCV)

    , October 2019.
  • [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [11] Samuel W Greenhouse and Seymour Geisser. On methods in the analysis of profile data. Psychometrika, 24(2):95–112, 1959.
  • [12] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye.

    A review on generative adversarial networks: Algorithms, theory, and applications, 2020.

  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  • [15] H. Kim, L. Remaggi, P. J. B. Jackson, and A. Hilton. Immersive spatial audio reproduction for vr/ar using room acoustic modelling from 360° images. In 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pages 120–126, 2019.
  • [16] Homare Kon and Hideki Koike. Deep neural networks for cross-modal estimations of acoustic reverberation characteristics from two-dimensional images. In Audio Engineering Society, 05 2018.
  • [17] Homare Kon and Hideki Koike. Estimation of late reverberation characteristics from a single two-dimensional environmental image using convolutional neural networks. Journal of the Audio Engineering Society, 67:540–548, 08 2019.
  • [18] Homare Kon and Hideki Koike. An auditory scaling method for reverb synthesis from a single two-dimensional image. Acoustical Science and Technology, 41(4):675–685, 2020.
  • [19] Keun Sup Lee, Nicholas J Bryan, and Jonathan S Abel. Approximating measured reverberation using a hybrid fixed/switched convolution structure. In Proceedings of the 13th International Conference on Digital Audio Effects (DAFx’10), 2010.
  • [20] Dingzeyu Li, Timothy R. Langlois, and Changxi Zheng. Scene-aware audio for 360° videos. ACM Trans. Graph., 37(4), July 2018.
  • [21] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

    Journal of machine learning research

    , 9(Nov):2579–2605, 2008.
  • [22] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017.
  • [23] Z. Meng, F. Zhao, and M. He. The just noticeable difference of noise length and reverberation perception. In 2006 International Symposium on Communications and Information Technologies, pages 418–421, 2006.
  • [24] Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compression. Advances in Neural Information Processing Systems, 33, 2020.
  • [25] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [26] Damian T Murphy and Simon Shelley. Openair: An interactive auralization web resource and database. In Audio Engineering Society Convention 129. Audio Engineering Society, 2010.
  • [27] Egon S Pearson, Ralph B D “’AGOSTINO, and Kimiko O Bowman. Tests for departure from normality: Comparison of powers. Biometrika, 64(2):231–246, 1977.
  • [28] Renato S Pellegrini. Quality assessment of auditory virtual environments. In Proceedings of the 2001 International Conference on Auditory Display, 2001.
  • [29] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, and Scott Gray. Dall·e: Creating images from text. OpenAI Blog, 2021.
  • [30] Anton Ratnarajah, Zhenyu Tang, and Dinesh Manocha. Ir-gan: Room impulse response generator for speech augmentation, 2021.
  • [31] Andrew Reilly and David McGrath. Convolution processing for realistic reverberation. In Audio Engineering Society Convention 98. Audio Engineering Society, 1995.
  • [32] Luca Remaggi, Hansung Kim, Philip JB Jackson, and Adrian Hilton. Reproducing real world acoustics in virtual reality using spherical cameras. In Proceedings of the 2019 AES International Conference on Immersive and Interactive Audio. Audio Engineering Society, 2019.
  • [33] Michael Rettinger. Reverberation chambers for broadcasting and recording studios. Journal of the Audio Engineering Society, 5(1):18–22, 1957.
  • [34] H Robjohns. Sony dre s777 sampling digital reverb. Sound on Sound, 15, 1999.
  • [35] Kaushal Sali and Alexander Lerch.

    Generating impulse responses using recurrent neural networks, 2020.

  • [36] Carl Schissler and Dinesh Manocha. Interactive sound propagation and rendering for large multi-source scenes. ACM Trans. Graph., 36(4), Sept. 2016.
  • [37] M. R. Schroeder and B. F. Logan. “Colorless” Artificial Reverberation. IRE Transactions on Audio, 9(6):209–214, 1961.
  • [38] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • [39] Christian Steinmetz. Neuralreverberator, 2018.
  • [40] James Traer and Josh H McDermott. Statistics of natural reverberation enable perceptual separation of sound and space. Proceedings of the National Academy of Sciences, 113(48):E7856–E7865, 2016.
  • [41] Vesa Valimaki, Julian D Parker, Lauri Savioja, Julius O Smith, and Jonathan S Abel. Fifty years of artificial reverberation. IEEE Transactions on Audio, Speech, and Language Processing, 20(5):1421–1448, 2012.
  • [42] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016.
  • [43] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [44] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.

    Learning deep features for scene recognition using places database.

    Advances in Neural Information Processing Systems, 1(January):487–495, 2014.

Appendix A: Supplementary Material

As supplementary material, we present and review a number of input/output examples across several categories with distinct properties333Link to audiovisual examples page: https://dev5839.d12qej5d2ijqlh.amplifyapp.com/. A summary of these results is shown in Table 6. We additionally present a more detailed diagram of our architecture, shown in Fig. 22.

Finally, to gain a qualitative view of intra-scene and adjacent-scene consistency, we plot our test set input images according to the corresponding output audio characteristics by a visualization shown in Figure 23. We produce multiband estimations from all output IRs, and then used t-SNE [21]

to reduce the data dimensionality to two dimensions. We then solve a linear assignment problem to transform this into a grid representation. Several instances of within-scene clusters are visible, as well as closeness of related scenes. This suggests that while our method does make errors (outliers are also visible), it learns to treat similar scenes similarly while capturing variation.

Topic Figure # Images
Famous and iconic places 12 6
Musical environments 13 6
Artistic renderings 14 6
DALL•E-generated spaces 15 6
Limitations (i.e. challenging examples) 16 4
Animated scenes 17 6
Virtual backgrounds 18 6
Historical places 19 5
Video games 20 4
Common and identifiable scenes 21 6
Total 55
Table 6: Additional Results.
Figure 12: Famous and iconic spaces. Columns show input images, depth maps, generated IRs, and a dry anechoic speech signal before and after the generated IR was applied to the signal via convolution. The input images come from spaces that may be impractical or impossible to record in. The indoor spaces here show longer impulse responses compared to the outdoor scenes which is typically observed and expected in real-world settings. Larger indoor spaces also tend to exhibit greater times with longer impulse responses which we see here, though the ISS image has a longer impulse response than we expect.
Figure 13: Music. Columns show input images, depth maps, generated IRs, and an anechoic vocal singing signal before and after the generated IR was applied to the signal via convolution. The input images come from spaces relevant to music including a typical small room, an acoustically treated rehearsal space, an auditorium, a church, and 2 large concert halls. Generally, larger spaces tend to exhibit longer decay times in the output, however some examples such as the concert halls with visible acoustic treatment appear to have a shorter decay than more reverberant spaces like the church or auditorium with more reflective surfaces. The final concert hall shows an atypical impulse response with a visible discontinuity in the IR tail. This is not commonly observed among our model outputs, but illustrates the nature of artifacts which can occasionally occur.
Figure 14: Art. Columns show input images, depth maps, generated IRs, and an anechoic operatic singing signal before and after the generated IR was applied to the signal via convolution. Images here are drawings, paintings and a vintage art photograph ca. 1850. Artistic depictions of spaces were not included in our training dataset. In many cases, plausible impulse responses are generated from such input images. In general, larger depicted spaces, like the church in the bottom row, exhibit longer decay times as is observed with standard 2D photographs.
Figure 15: DALL·E. Images generated from text by DALL·E [29] used here as input images. The same corresponding input text was synthesized via text-to-speech as our signal of interest and convolved with the generated IR. This reflects synthetic speech in a synthetic environment, indicating a path for synthesizing realistic IRs from text. It also shows how our model might work with other state-of-the-art generative media models to produce more consistent and realistic results in different domains.
Figure 16: Challenging images. Input images containing street murals, reflections, and shadows demonstrating cases where depth is inaccurately estimated. (A) A painted doorway giving the illusion of depth. (B) A wall with a mural of a street and tree where the depth of the wall is inaccurately estimated. (C) A low-angle photo of a reflective puddle. (D) An outdoor street image with strong shadows which results in a depth map and generated IR more similar to a room than an outdoor space. These more extreme scenarios are chosen to clearly illustrate the limitations of our approach.
Figure 17: Animated films. Scenes from Blender open animation films used as input images (speech convolved with generated IRs). Columns show input image, calculated depth map, spectrogram of generated IR, an anechoic passage reading sample, and the same passage with the generated IR applied via convolution. In general, we find that our model plausibly estimates the reverberant characteristics of these spaces. For example, the wooden small space is very brief. The barbershop appears longer due to some artefacts, but the broadband decay is relatively quick as can be heard in the audio. Seemingly larger spaces again correspond to longer IRs. This is a case of Real2Sim transfer, where we can approximate IRs directly that sound as measured IRs, but in virtual environments where this measurement is not possible.
Figure 18: Virtual backgrounds. Images which may serve as virtual backgrounds used as input images to our model. These reflect spaces that may be used for videoconferencing or other online meetings. Realistic IRs may be generated and used in these contexts to increase the sense of being in a shared space with others.
Figure 19: Historical and notable places. Additional examples of unusual and historical spaces which may be difficult or impossible to obtain IRs from.
Figure 20: Video games. Impulse responses generated and applied via convolution from screenshots of four 3D video games. Video games are one example of a virtual space that might benefit from easily generated impulse responses. While the medium sized room from Counter-Strike and the large hallway from Halo 2 may be plausible IRs, the large hall shown in the Skyrim screenshot and the cavern in the Minecraft example do not have correspondingly long reverberant tails as would be expected showing possible examples of where the scale of the space was not accurately estimated. 3D rendered images were not included in our dataset but are a ripe area of future work which might greatly increase the performance of our model on both real scenes and virtual scenes such as these video game examples.
Figure 21: Common and identifiable scenes. Input images and the resulting IRs are shown and convolved with an anechoic speech signal. Input images here reflect spaces that are regularly encountered in everyday life yet may not often be recorded in. These types of scenes are useful for audio post-production as they may be commonly found in movies and television shows. Small and outdoor scenes are observed to have very brief IRs while in comparison, the larger building interior has a much longer output IR as expected.
Figure 22: Detailed overview of Image2Reverb model architecture. Left: the ResNet50 encoder pre-trained on Places365 (figure at left adapted from [13]

). Right: the generator and discriminator. The output of the encoder consists of 365 features, to which we concatenate noise to produce a 512d latent vector. The generator and discriminator contain upsampling and downsampling convolutions respectively. A leaky rectified linear unit (LReLU), with

, is used after each convolutional layer in the model in both the discriminator and the generator with the final layer of the generator using a activation. PN denotes pixelwise normalization, which we use in the generator. The composition of blocks is based on ProGAN [14]. The final step in the discriminator is a fully connected layer with a linear activation (scalar output).
Figure 23: Manifold-based visualization of our test set. We compute multi-band estimates for output audio IRs for each image, and then perform nonlinear dimensionality reduction with t-SNE to obtain two-dimensional feature vectors for each example. We produce a grid by solving a linear assignment problem, as is commonly done to visualize large image datasets. Our visualization shows local clusters of same and similar scenes in many cases, but also some variation within scenes. In some outdoor settings, this variation grows considerably large, resulting in increased scattering. In other cases, we observe closeness between different views of the same scene and similar scenes.