Echo-Reconstruction: Audio-Augmented 3D Scene Reconstruction

10/05/2021 ∙ by Justin Wilson, et al. ∙ University of Maryland University of North Carolina at Chapel Hill 8

Reflective and textureless surfaces such as windows, mirrors, and walls can be a challenge for object and scene reconstruction. These surfaces are often poorly reconstructed and filled with depth discontinuities and holes, making it difficult to cohesively reconstruct scenes that contain these planar discontinuities. We propose Echoreconstruction, an audio-visual method that uses the reflections of sound to aid in geometry and audio reconstruction for virtual conferencing, teleimmersion, and other AR/VR experience. The mobile phone prototype emits pulsed audio, while recording video for RGB-based 3D reconstruction and audio-visual classification. Reflected sound and images from the video are input into our audio (EchoCNN-A) and audio-visual (EchoCNN-AV) convolutional neural networks for surface and sound source detection, depth estimation, and material classification. The inferences from these classifications enhance scene 3D reconstructions containing open spaces and reflective surfaces by depth filtering, inpainting, and placement of unmixed sound sources in the scene. Our prototype, VR demo, and experimental results from real-world and virtual scenes with challenging surfaces and sound indicate high success rates on classification of material, depth estimation, and closed/open surfaces, leading to considerable visual and audio improvement in 3D scenes (see Figure 1).



There are no comments yet.


page 1

page 2

page 4

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Scenes containing open and reflective surfaces, such as windows and mirrors, can enhance AR/VR immersion in terms of both graphics and sound; for example, a window open in spring compared to closed in winter. However, they also present a unique set of challenges. First, they are difficult to detect and reconstruct due to their transparency and high reflectivity. Distinguishing between glass (e.g. window) and an opening in the space is an important part of the audio-visual experience for AR/VR engagement. Also, illumination, background objects, and min/max depth ranges can be confounding factors.

Reconstruction of scenes for teleimmersion have led to advances in detection (Lea et al., 2016), segmentation (Golodetz* et al., 2015; Arnab et al., 2015), and semantic understanding (Song et al., 2017) and are used to generate large-scale, labeled datasets of object (Wu et al., 2015) and scene  (Dai et al., 2017) geometric models to further aid training and sensing in a 3D environment. Advances have also been made to account for challenging surfaces  (Sinha et al., 2012; Whelan et al., 2018; Chabra et al., 2019). Yet, scenes containing open and reflective surfaces, such as windows and mirrors, remain an open research area. Our work augments existing visual methods by adding audio context of surface detection, depth, and material estimation for recreating a virtual environment from a real one.

Previous work has used sound to better understand objects in scenes. For instance, impact sounds from interacting with objects in a scene to perform segmentation (Arnab et al., 2015) and to emulate the sensory interactions of human information processing (Zhang et al., 2017). Audio has also been used to compute material (Ren et al., 2013), object (Zhang et al., 2017), scene (Schissler et al., 2018), and acoustical (Tang et al., 2020) properties. Moreover, using both audio and visual sensory inputs has proven more effective; for example, multi-modal learning for object classification (Sterling et al., 2018; Wilson et al., 2019) and object tracking (Wilson and Lin, 2020).

Fusing multiple modalities, such as vision and sound, provide a wider range of possibilities than either single modality alone. In this work, we show that augmenting vision-based techniques with audio, referred to as “EchoCNN,” can detect open or reflective surfaces, its depth, and material, thereby enhancing 3D object and scene reconstruction for AR/VR systems. We highlight key results below:

  • EchoReconstruction, a staged audio-visual 3D reconstruction pipeline that uses mobile devices to enhance scene geometry containing windows, mirrors, and open surfaces with depth filtering and inpainting based on EchoCNN inferences (section 3);

  • EchoCNN, a fused audio-visual CNN architecture for classifying open/closed surfaces, their depth, and material or sound source placement (

    section 4);

  • Automated data collection process and audio-visual ground truth data for real and synthetic scenes containing windows and mirrors (section 5).

Using EchoReconstruction, we have been able to achieve consistently higher accuracy (up to 100%) in classification of open/closed surfaces, depth estimation, and materials in both real-world scenes and controlled experiments, resulting in considerably improved 3D scene reconstruction with glass doors, windows and mirrors.

Figure 2. Top row: closed window in winter. Middle row: opened in spring. Bottom row: controlled experiment virtual scene. Column 1: mobile echoreconstruction prototype in real world and virtual scenes with a video of emitted pulsed audio and images from the scene. Column 2: initial RGB based 3D reconstruction using state-of-the-art visual methods (live (Tanskanen et al., 2013) or photogrammetric (Metashape, 2020)). Column 3: our audio-visual EchoCNN convolutional neural network classifies open or closed surface, depth, and material for inpainting to resolve planar discontinuities caused by reflective surfaces, such as windows and mirrors. Column 4: semantic rendering of the window given material estimation and point of view. Green arrows highlight areas enhanced by our method, such as detecting closed/open parts of a window and filling a reflective mirror.

2. Related Work

Previous research in 3D reconstruction, audio-based classifications, and echolocation are discussed in this section in addition to existing techniques for reconstructing open and reflective surfaces.

2.1. 3D reconstruction

Object and scene reconstruction methods generate 3D scans using RGB and RGB-D data. For example, Structure from Motion (SFM) (Westoby et al., 2012), Multi-View Stereo (MVS) (Seitz et al., 2006), and Shape from Shading (Zhang et al., 1999) are all techniques to scan a scene and its objects. Static (Newcombe et al., 2011; Golodetz* et al., 2015) and dynamic (Newcombe et al., 2015; Dai et al., 2017) scenes can also be scanned in real-time using commodity sensors such as the Microsoft Kinect and GPU hardware. 3D scene reconstructions have also been performed with sound based on time of flight sensing (Crocco et al., 2016). Not only has this previous research generated large amounts of 3D scene (Silberman et al., 2012; Song et al., 2017) and object (Singh et al., 2014; Lai et al., 2011; Wu et al., 2015) data, they also benefit from these datasets by using them for training vision-based neural networks for classification, segmentation, and other downstream tasks. Depth estimation algorithms (Eigen and Fergus, 2014; Alhashim and Wonka, 2018; Chabra et al., 2019) also create 3D reconstructions by fusing depth maps using ICP and volumetric fusion (Izadi et al., 2011).

2.1.1. Glass and mirror reconstruction

Reflective surfaces produce identifiable audio and visual artifacts that can be used to help their detection. For example, researchers have developed algorithms to detect reflections in images taken through glass using correlations of 8-by-8 pixel blocks (Shih et al., 2015), image gradients (Kopf et al., 2013), two layer renderings (Sinha et al., 2012), polarization imaging reflectometry (Riviere et al., 2017), and diffraction effects (Toisoul and Ghosh, 2017). Adding hardware,  (Sutherland, 1968) used ultrasonic sensor logic to track continuous wave ultrasound and  (Zhang et al., 2017) to detect obstacles such as glass and mirrors by using frequencies outside of the human audible range. More recently, reflective surfaces have been detected by utilizing a mirrored variation of an AprilTag (Olson, 2011; Wang and Olson, 2016)(Whelan et al., 2018) use the reflective surface to their advantage by recognizing the AprilTag attached to their Kinect scanning device when it appears in the scene. Depth jumps and incomplete reconstructions have also been used (Lysenkov et al., 2012). However, vision based approaches require the right illumination, non-blurred imagery, and limited clutter behind the surface that may limit the reflection. We show that sound creates a distinct audio signal, providing reconstruction methods complementary data about the presence of windows and mirrors without additional sensors.

Example 3D Reconstruction Methods
Type Methods
Active (RGB-D) KinectFusion, DynamicFusion, BundleFusion
Passive (RGB) SLAM, SFM, (Tanskanen et al., 2013)
ScanNet, (Whelan et al., 2018)
Stereo MVS, StereoDRNet
Lidar (Kada and Mckinley, 2009)
Ultrasonic (Zhang et al., 2017)
Time of flight (Crocco et al., 2016)
Table 1. 3D reconstruction methods by type such as passive (RGB), active (RGB-D), or other sensor (e.g. ultrasound, lidar, etc.); single or multiple views; and static or dynamic scenes.

2.2. Acoustic imaging and audio-based classifiers

We begin with an introduction into sound propagation, room acoustics, and audio-visual classifiers.

Figure 3. Staged approach

to enhance 3D virtual scene and object reconstruction using audio-visual data. Our echoreconstruction prototype consists of two smartphones - one recording (top) and one emitting/reconstructing (bottom). As the bottom smartphone moves to reconstruct the scene and emits 100 ms pulsed audio , the top smartphone is used to record video of the direct and reflecting sound. The receiving audio is split into 1.0 second intervals to allow for reverberation. These audio intervals are converted into mel-scaled spectrograms and passed through a multimodal echoreconstruction convolutional neural network (we refer to as EchoCNN) comprised of 2D convolutional, max pooling, fully connected, and softmax layers. EchoCNN classifications inform depth filtering and hole filling steps to resolve planar discontinuities in scans caused by reflective surfaces, such as windows and mirrors. Binary classification is used to predict if a window is open or closed. Multi-class classification is used for depth and material estimation.

Acoustics: various models have been developed to simulate sound propagation in a 3D environment, such as wave-based (Mehra et al., 2015), ray tracing based (Rungta et al., 2016), sound source clustering (Tsingos et al., 2004), multipole equivalent source methods (James et al., 2006), and a single point multipole expansion method (Zheng and James, 2011), representing outgoing pressure fields. (Godoy et al., 2018) uses acoustics and a smartphone for an app to detect car location and distance from walking pedestrians using temporal dynamics. (Bianco et al., 2019)

further discusses theory and applications of machine learning in acoustics. Computational imaging approaches have also used acoustics for non-line-of-sight imaging 

(Lindell et al., 2019), 3D room geometry reconstruction from audio-visual sensors (Kim et al., 2017), and acoustic imaging on a mobile device (Mao et al., 2018). To reconstruct windows and mirrors, our work uses room acoustics given the surface materials of the room (Schissler et al., 2018) and distance from sound source. However, prior work and downstream processes often require a watertight reconstruction which can be difficult to generate in the presence of glass. Our approach addresses these issues using an integrated audio-visual CNN that can detect discontinuity, depth, and materials.

Audio-based classification and reconstruction: using principles from sound synthesis, propagation, and room acoustics, audio classifiers have been developed for environmental sound (Gemmeke et al., 2017; Piczak, 2015; Salamon et al., 2014), material (Arnab et al., 2015), and object shape (Zhang et al., 2017) classification. For audio-based reconstruction, Bat-G net uses ultrasonic echoes to train an auditory encoder and 3D decoder for 3D image reconstruction (Hwang et al., 2019). Audio input can take the form of raw audio, spectral shape descriptors (Michael et al., 2005; Cowling and Sitte, 2003; Smith III, 2020), or frequency spectral coefficients that we also adopt. In our method, we use reflecting sound to perform surface detection, depth estimation, and material classification.

Audio-visual learning

: similar to its applications in natural language processing (NLP) and visual questing & answering systems 

(Kim et al., 2016, 2020; Hannan et al., 2020), multi-modal learning using both audio-visual sensory inputs has also been used for classification tasks (Sterling et al., 2018; Wilson et al., 2019), audio-visual zooming (Nair et al., 2019), and sound source separation (Ephrat et al., 2018; Lee and Seung, 2000) which have also isolated waves for specific generation tasks. Although similar in spirit, our audio-visual method, “Echoreconstruction,” differs from the existing methods by learning absorption and reflectance properties to detect a reflective surface, its depth, and material.

3. Technical Approach

In this work, we adopt “echolocation” as an analog for our echoreconstruction method. According to (Egan, 1988), echo is defined as distinct reflections of the original sound with a sufficient sound level to be clearly heard above the general reverberation. Although perceptible echo is abated because of precedence (known as the Haas effect) (Long, 2014), returning sound waves are received after reflecting off of a solid surface. We use these distinct, reflecting sounds to design a staged approach of audio and audio-visual convolutional neural networks. EchoCNN-A and EchoCNN-AV can be used to estimate depth based on reverberation times (Figure 9), recognize material based on frequency and amplitude, and handle both static and dynamic scenes with moving objects based on Doppler shift. All of which enhance scene and object reconstruction by detecting planar discontinuities from open or closed surfaces and then estimating depth and material.

3.1. Echolocation

Echolocation is the use of reflected sound to locate and identify objects, particularly used by animals like dolphins and bats. According to (Szabo, 2014), bats emit ultrasound pulses, ranging between 20-150 kHz, to catch an insect prey with a resolution of 2-15 mm. This involves signal processing such as:

  1. Doppler shift (the relative speed of the target),

  2. time delay (distance to the target), and

  3. frequency and amplitude in relation to distance (target object size and type recognition);

where the Doppler shift (or effect) is the perceived change in frequency (Doppler frequency minus transmitted frequency ) as a sound source with velocity moves toward or away from the listener/observer with velocity and angle .

3.2. Staged classification and reconstruction pipeline

As depicted in Figure 3, we take a staged approach to enhance scene and object reconstruction using audio-visual data. Our echoreconstruction prototype consists of two smartphones - one recording (top) and one emitting/reconstructing (bottom). Each audio emission is 100 ms of sound followed by 900 ms of silence to allow for the receiving microphone to capture reflections and reverberations (subsection 3.3). After the 3D scan is complete, an .obj file containing geometry and texture information is generated. 1 second frames are extracted from the recorded video to generate audio and visual input into the EchoCNN neural networks (section 4). These networks are independently trained to detect whether a surface is open or closed, estimate depth to the surface from the sound source, and classify the material of the surface. Using mobile accelerometer data and a multi-scale neural network, such as (Eigen and Fergus, 2014), but using audio as a coarse global output refined using finer-scale visual data to augment depth estimation will be explored as future work.

Figure 4. Mel-scaled spectrograms of recorded impulses of different sound sources used. From left to right: narrow to disperse spectra. Not shown are other pure tone frequencies, chirp, pink noise, and brownian noise. Horizontal axis is time and vertical axis is frequency.

3.3. Sound source

A smartphone emits recordings of human experimenter voice, whistle, hand clap, pure tones (ranging from 63 Hz to 16 kHz), chirps, and noise (white, pink, and brownian). All of which can be generated as either pulsed (PW) or continuous waves (CW). PW is preferred for theoretical and empirical reasons. First, the transmission frequency may experience considerable downshift as a result of absorption and diffraction effects (Szabo, 2014). Therefore, using pulsed waves independent for each emission is theoretically better than continuous waves compared to . Furthermore, section 6 shows superior PW results over CW for the given classification tasks.

Pure tones were generated with default 0.8 out of 1 amplitudes using the Audacity computer program and center frequencies of 63 Hz, 125 Hz, 250 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, 8 kHz, and 16 kHz. Human voice ranges from about 63 Hz to 1 kHz (Long, 2014) (125 Hz to 8 kHz (Egan, 1988)) and an untrained whistler between 500 Hz to 5 kHz (Nilsson et al., 2008)

. Chirps were linearly interpolated from 440 Hz to 1320 Hz in 100 ms. A hand clap is an impulsive sound that yields a flat spectrum 

(Long, 2014). All sound sources were recorded and played back with max volume (Figure 4). While recorded sounds were used for consistency, we plan to add live audio for augmentation and future ease of use during reconstruction. Please see our supplementary materials for spectrograms across all sound sources.

Audio input: audio was generated in pulsed waves (PW). One smartphone to emit the sound while performing a RGB-based reconstruction and the second smartphone to capture video. As future work, a single mobile device or Microsoft Kinect paired with audio chirps could be used for audio-visual capture and reconstruction instead of two separate devices. Each pulsed wave emitted into the scene was a total of 1 second consisting of an 100 ms impulse followed by silence. 1 second audio frames is based on the Sabine Formula of reverberation time for a compact room of like dimensions calculated as:


where is the reverberation time (time required for sound to decay 60 dB after source has stopped), is room volume (), and is the total room absorption at a given frequency (e.g. 250 Hz). For the bathroom scene, and , which is the sum of sound absorption from the materials in Table 2.

Total room absorption a using at 250 Hz
Real bathroom scene S a (sabins)
Painted walls 432 x 0.10 = 43.20
Tile floor 175 x 0.01 = 1.75
Glass 60 x 0.25 = 15.00
Ceramic 39 x 0.02 = 0.78
Mirror 34 x 0.25 = 8.50
Total a = 69.23 sabins
Table 2. According to the Sabine Formula (subsection 3.3), reverberation time can be calculated as room volume V divided by total room absorption a. For an indoor sound source in a reverberant field, a is the total room absorption at a given frequency (sabins), S is the surface area (), and is the sound absorption coefficient at a given frequency (decimal percent). At 250 Hz, the total room absorption a for our real-world bathroom scene is 69.23 sabins.

Visual input: images were captured from the same smartphone video as the audio recordings. Each corresponding image was cropped and grayscaled for illumination invariance and data augmentation. Image dimensions were 64 by 25 pixels. Visual data served as inputs for visual only and audio-visual model variation EchoCNN-AV.

3.4. Initial 3D Reconstruction

We evaluated the following smartphone based reconstruction applications to obtain an initial 3D geometry for which our method would enhance. The Astrivis application, based on (Tanskanen et al., 2013), generates better live 3D geometries for closed object rather than scene reconstructions since it limits feature points per scan. On the other hand, Agisoft Metashape produces scene reconstructions offline from smartphone video. Enabling the software’s depth point and guided camera matching features further improved reconstructed geometries.

4. Model Architecture

To augment visually based approaches, we use a multimodal CNN with mel-scaled spectrogram and image inputs. First, we perform surface detection to determine if a space with depth jumps and holes is in error or in fact open (i.e. open/closed classification). In the event of error, we estimate distance from recorder to surface using audio-visual data for depth filtering and inpainting. Finally, we determine the material. All of these classifications are performed using our audio and audio-visual convolutional neural networks, referred to as EchoCNN-A and EchoCNN-AV (Fig. 3).

Figure 5. Sample visualizations of the filters for the two convolutional layers in the audio-based EchoCNN-A neural network. The model learns filters for octave bands, frequencies, reflections, reverberations, and damping.

Audio sub-network: our frame-based EchoCNN-A consists of a single convolutional layer followed by two dense layers with feature normalization. Sampled at to cover the full audible range, audio frames are 1 second mel-scaled spectrograms with STFT coefficients (section 4). Each audio example is classified independently and 1 second intervals to reflect an estimated reverberation time based on a compact room size (subsection 3.3). With a 2048 sample Hann window (N), 25% overlap, and hop length (), this results in a frequency dimension of 21.5 Hz (section 4) and temporal dimension of 12 ms (section 4) or 12% of each 100 ms pulsed audio. Each spectrogram is individually normalized and downsampled to a size of 62 frequency bins by 25 time bins.

We define the frequency spectral coefficients (Mller, 2015) as:


for time frame and Fourier coefficient with real-valued DT signal , sampled window function of length , and hop size  (Mller, 2015). R denotes continuous time and Z denotes discrete time. Equal to , spectrograms have been demonstrated to perform well as inputs into convolutional neural networks (CNNs) (Huzaifah, 2017). Their horizontal axis is time and vertical axis is frequency.


A hop length of achieves a reasonable temporal resolution and data volume of generated spectral coefficients (Mller, 2015). Temporal resolution is important in order to detect when a reflecting sound reaches the receiver. Therefore, we decided to use a shorter window length instead of for instance. This resulted in a shorter hop length and accepting the trade-off of a higher temporal dimension for increased data volume.

Visual sub-network: while audio information is generally useful for all three classifications tasks (Table 3

) visual information is particularly useful to aid material classification. We use ImageNet 

(Krizhevsky et al., 2012) as a visual-based baseline to compare to our audio and audio-visual methods. It also serves as an input into our audio-visual merge layer. Future work will explore whether or not another image classification method is better suited as a baseline and to fuse with audio.

Figure 6. Listener at different distances from sound source (from 0.5 to 3 m) in a virtual environment (left: bathroom, middle: kitchen, right: bedroom) used to generate synthetic audio-visual data. This dataset is comprised of multiple 12-second video clips in front of reflective surfaces at increments from 0.5 m to 3 m for 15 different sound sources. Absorption and transmission coefficients were set on materials (e.g. mirror, thick glass, ordinary glass) inside and outside of rooms in the virtual scenes. In addition to open/closed, depth, and material, we make synthetic, unmixed reflection separation data (direct, early, or late) available for future research.
Figure 7. Spectrograms from a recorded hand clap in front of an interior glass shower door and exterior glass window. For the interior door, reflected sounds experience intensified damping as we go from opened (left) to closed (middle) and then from 3 feet to 1 foot depth (right). Damping increases with fewer late reverberations and intensity increases with more early reflections. For the exterior window, closing it decreases outside noise up to a distance.

Merge layer: we evaluated concatenation and multi-modal factorized bilinear (MFB) pooling (Yu et al., 2017)

to fuse audio and visual fully connected layers. Concatenation of the two vectors serves as a straightforward baseline. MFB allows for additional learning in the form of a weighted projection matrix factorized into two low-rank matrices.


where k is the factor or latent dimensionality with index i of the factorized matrices, is the Hadmard product or element-wise multiplication, and is an all-one vector.

4.1. Loss Function

For open/closed predictions, categorical cross entropy loss (subsection 4.1) is used instead of binary if estimating the extent of the surface opening (e.g. all the way open, halfway, or closed). A regression model is not used for depth estimation because ground truth data is collected in discrete 0.5 m or 1 ft increments within the free field for better noise reduction (Egan, 1988). The Softmax function is used for output activations.


where M is number of classes, y indicator for correct classification, and p for predicted probability that observation (o) is of class (c).

Input: EchoCNN classifications (open/closed, depth, and material) and initial 3D reconstruction (.obj and.jpg).
Output: Enhanced 3D echoreconstruction (updated .obj’).
Variables: .obj = initial 3D reconstruction, .obj’ = enhanced 3D echoreconstruction, based on EchoCNN inputs, , in .obj’           .obj’ = .obj for each of audiovisual input  do
      if (( == ) and ( in (,)) and (overlap(,previous ) ))  then
           if  then
                = convex hull of depth() =
                     for each in convex hull of planar in .obj’  do
                          depth() =
                          end for
                          end if
                          end if
                          end for
          Assumptions: Scan of initial 3D reconstruction (.obj) with partials of object perimeter. Discontinuity detection using color and vertices on planes based on input. Future work, define a mapping from classifications to 3D geometry using RGB-D data, tracking, or Iterative Closest Point (ICP) (Izadi et al., 2011). Description: compare discontinuities in reconstructed geometry with EchoCNN inferences. If EchoCNN classifies a hole as closed, select planar vertices surrounding gap at EchoCNN estimated depth, create a new mesh based on its convex hull, and assign material based on EchoCNN estimated material.
ALGORITHM 1 Echoreconstruction via EchoCNN inference
Accuracy of Reflecting Sounds used for Open/Closed, Depth Estimation, and Material Classification in Real World Scenes
Open/Closed Depth Estimation Sound Material
Method Input Shower Window Overall 3 ft 2 ft 1 ft Overall Glass Mirror
kNN (Cover and Hart, 1967) A 56.5% 100% 21.3% 16% 21% 25% 44.0% 47.5% 52.4%
Linear SVM (Bottou, 2010) A 61.5% 91.7% 37.6% 38% 32% 41% 51.9% 46.0% 57.1%
SoundNet5 (Aytar et al., 2016) A 45.2% 46.6% 39.7% 40% 71% 8% 71.0% 98.4% 1.6%
SoundNet8 (Aytar et al., 2016) A 50.7% 46.6% 42.5% 92% 0% 33% 44.4% 16.4% 85.7%
EchoCNN-A (Ours) A 71.2% 100% 71.8% 86% 54% 76% 77.4% 62.3% 92.0%
ImageNet (Krizhevsky et al., 2012) V 78.1% 96.1% 45.2% 52% 83% 0% 80.6% 60.7% 100%
Acoustic Classification (Schissler et al., 2018) AV N/A N/A N/A N/A N/A N/A ———- 48% * ———-
EchoCNN-AV Cat (Ours) AV 100% 100% 89.5% 95% 100% 73% 100% 100% 100%
EchoCNN-AV MFB (Ours) AV 100% 100% 84.9% 54% 100% 100% 80.6% 60.7% 100%
Table 3. Multiple models (ours is EchoCNN) and baselines were evaluated for audio and audio-visual based scene reconstruction analysis. Overall, 71.2% of held out reflecting sounds and 100% of audio-visual frames were correctly classified as an open or closed interior surface (i.e. glass shower door). Open/closed classification is even higher for external facing windows due to outside noise. According to (Long, 2014), 10 dB of exterior to interior noise reduction can be attributed to closed compared to open windows. Example shower and window views in Figure 10. 71.8% of 1-second audio frames were correctly classified as 1 ft, 2 ft, or 3 ft away from surface based on audio alone; 89.5% when concatenating with its corresponding image. Finally, 77.4% and 100% of audio and audio-visual inputs correctly labeled the surface material. * According to (Schissler et al., 2018), 48% of the triangles in the scene are correctly classified, where its classification is more granular and covers more material classes.

4.2. Depth filtering and planar inpainting

The outputs of our EchoCNN inform enhancements for 3D reconstruction (algorithm 1). If depth jumps in the reconstruction are first classified as an open surface, then no change is required other than filtering loose geometry and small components. Otherwise, there is a planar discontinuity (e.g. window or mirror) that needs to be filled. With depth estimated by EchoCNN, we filter the initial 3D mesh to within a threshold of that depth. This gives us the plane size needed to fill. Finally, EchoCNN classifies its surface material.

4.3. Implementation details

We implemented all EchoCNN and baseline models with Tensorflow 

(Abadi et al., 2015)

and Keras 

(Chollet and others, 2015)

. Training was performed using a TITAN X GPU running on Ubuntu 16.04.5 LTS. We used categorical cross entropy loss with Stochastic Gradient Descent optimized by ADAM 

(Kingma and Ba, 2014)

. Using a batch size of 32, remaining hyperparameters were tuned manually based on a separate validation set. We make our real-world and synthetic datasets available to aid future research in this area.

5. Datasets and Applications

Our audio-based EchoCNN-A and audio-visual EchoCNN-AV convolutional neural networks are trained across nine octave bands with center frequencies 63 Hz, 125 Hz, 250 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, 8 kHz, and 16 kHz. Training is done using these pulsed pure tone impulses along with experimenter hand clap. The hold out test data is comprised of sound sources excluded from training - white noise, experimenter whistle, and voice. The test set contains sound sources not in the training set to evaluate generalization.

Figure 8. EchoCNN-A (Left

) Confusion matrix to classify open/closed for an interior glass shower door. Open predictions (86%) were more accurate than closed (56%). (

Right) Confusion matrix to classify depth from same interior glass door. Notice that our EchoCNN is learning to differentiate distance based on reflecting sounds from pulsed ambient waves of a smartphone.

5.1. Real and synthetic datasets

Real: training data is comprised of 1 second pulsed spectrograms (Figure 7) from recorded pure tones, experimenter hand claps, brownian noise, and pink noise (N=857). Training and test examples were collected via video recordings and labeled for material, open/closed, and in 1 ft depth increments based on the distance from the surface. Nine octaves of pure tones, hand claps, and white noise cover a disperse range of frequencies and were used to train our models.

Accuracy of Reflecting Sounds used for Classification in Controlled Experiment (Gl = Glass)
Open/Closed Depth Est. (+/- 0.5 m) .
Method Input Thick Gl Thin Gl Thick Gl Thin Gl Material Est
kNN (Cover and Hart, 1967) A 53.8% 64.1% 11.5% 21.4% 66.5%
Linear SVM (Bottou, 2010) A 54.7% 63.2% 11.5% 20.5% 61.1%
SoundNet5 (Aytar et al., 2016) A 60.0% 40.1% 18.8% 19.1% 67.4%
SoundNet8 (Aytar et al., 2016) A 60.0% 42.6% 25.0% 19.1% 34.0%
EchoCNN-A (Ours) A 61.1% 65.1% 44.4% 44.6% 68.1%
ImageNet (Krizhevsky et al., 2012) V 95.8% 80.8% 83.3% 66.7% 87.5%
Acoustic Classification (Schissler et al., 2018) AV N/A N/A N/A N/A – 48% * –
EchoCNN-AV Cat (Ours) AV 98.9% 100% 99.4% 92.2% 76.6%
EchoCNN-AV MFB (Ours) AV 100% 100% 100% 99.0% 100%
Table 4. Multiple models (ours is EchoCNN) and baselines were evaluated for audio and audio-visual based scene reconstruction analysis. * According to (Schissler et al., 2018), 48% of the triangles in the scene are correctly classified, where its classification is more granular and covers more material classes. Compared to other existing methods, ours is able to correctly classify open/closed surfaces and depth estimation at nearly 100%, while achieving much higher accuracy in material estimation as well.

The hold out test dataset consists of 1-second pulsed spectrograms from recorded experimenter voice, whistle, chirp, and white noise (N=431). Voice and whistle recordings were chosen for the hold out test set to ease future transition to live and hands-free emitted sounds during reconstruction. Hold out test data is excluded from training and only evaluated during testing. While the same hold out sets were used for visual and audio-visual evaluation, unheard is not the same as unseen. Unheard audio can have the same visual appearance between training and test. Other new training and test datasets for visual and audio-visual methods will be future work.

Synthetic: The automated synthetic data collection was performed in Unreal Engine 4.25 where SteamAudio employs a ray-based geometric sound propagation approach, with support for dynamic geometry using the Intel Embree CPU-based ray-tracer. We describe similar prior work (Schissler and Manocha, 2011) here for more details on this approach. Given scene materials (e.g. carpet, glass, painted, tile, etc.), a sound source (e.g. voice), environmental geometry, and listener position, we generate impulse responses for a given scene of varying sizes. From each listener, specular and diffuse rays are randomly generated and traced into the scene. The energy-time curve for simulated impulse response is the sum of these rays:


where is the sound intensity for path j and frequency band f, is the propagation delay time for path j, and is the Dirac delta function or impulse function. As these sound rays collide in the scene, their paths change based on absorption and scattering coefficients of the colliding objects. Common acoustic material properties can be referenced in (Egan, 1988). We assume a sound absorption coefficient, for open windows.

Along with sound intensity , a weight matrix is computed corresponding to materials within the scene. Each entry is the average number of reflections from material m for all paths that arrived at the listener. It is defined as:


where is the number of times rays on path j collide with material m, weighted according to the sound intensity of the path j. To mirror our real-world data, sound source directivity was disabled. Future work is needed to compare ambient and directed sound sources. This data may also be used for material sound separation.

Given a 720p 30fps video walkthrough of the virtual environment (VE) with the camera moving along a keyframed spline, we reconstruct the virtual scene by extracting the individual frames of the video and using Agisoft Metashape (v1.7)’s reconstruction pipeline to solve for each image’s camera transform. Metashape, previously known as PhotoScan, is considered state-of-the-art in commercial photogrammetry software. The general process (Linder, 2009) is: create a sparse point cloud containing only keypoints and solve for the transforms of cameras that can see the keypoints, create a dense cloud by matching more features between keypoint-seeing cameras and the rest of them, project the dense cloud depth data to each camera to build per-camera depth maps, use the depth maps to build a mesh, and create a texture map by projecting the image frames that best see each polygon onto the mesh (Catmull, 1974).

We disable motion blur of the camera in order to have more usable frames, but real camera data generally requires blurry frames to be removed to avoid noisy reconstructions, especially at low framerates. For accurate visual feedback of the specular surfaces, we also enable UE4’s DirectX12 ray-tracing for reflective and translucent surfaces. We used a PC with the following specs for reconstruction: GTX 1080 GPU, i9-9900k CPU, 64gb RAM, Windows 10 x64. It takes approximately 2 hours to process a 720p image sequence of 2000 frames from start to finish with this setup. We use three sections of the ”HQ Residential House” environment on the Unreal Marketplace for synthetic data. The kitchen and bathroom sequences result in about 2,000 images when they are extracted from the video at a step size of 3, and the master bedroom has about 4,000.

Figure 9. From left to right: audio input (i.e. mel-scaled spectrogram) which would produce the highest activation for a given depth class from 1 ft, 2 ft, and 3 ft away from an object. Longer reverberation times tend to occur at lower frequencies (3 ft) than at high frequencies (1 and 2 ft) due to typical high frequency damping and absorption.
Figure 10. We evaluated our method using a controlled experiment of virtual scenes (row 1: bedroom, row 2: kitchen, and row 3: bathroom) and applied to real world scenes (row 4: physical bedroom). Column 1: input video recording reflecting sounds emitted from the camera as the scene is scanned. Column 2: an initial reconstruction using state-of-the-art commercial MetaShape application. We tested glass, mirror, and other objects and surfaces within each scene at different depths, materials, and open/closed. Column 3: audio-augmented echoreconstruction after depth inpainting and semantic rendering is applied during post-processing. Failure case: while virtual bedroom window remained open (success, green), TV was not enhanced (failure, red) and in supplemental.
Figure 11. EchoCNN may also be used to reconstruct the audio of a virtual scene from a video of a physical room. Instead of depth estimation, our method can be trained to approximate sound source position, which is especially useful for objects that are outside of the camera field of view. Ground truth (green dots) and estimated (red dots) sound source placements are shown (top). Seen and heard sound source (TV) from the video capture is placed more accurately than unseen but heard sound sources (cradle and laptop). Please see our supplementary video for a VR demo.

5.2. Applications in VR Systems

When using a head mounted display (HMD) users are alerted when approaching the boundaries in physical space. However, if room setup does not accurately reflect these boundaries or changes occur after setup, a user risks walking into unseen real-world objects such as glass and walls. Using our method, transmitted sound from the HMD could be used to locate physical objects and appropriately notify the user as an added safety measure. Audio directly from the real-world environment could also be used for depth estimation. The sounds unmixed and placed in the virtual environment, reconstructing both the scene geometry and sound sources (Figure 11). Finally, seasonal variations in the 3D sound and visual reconstruction of a window open in the spring and closed in the winter also enhance the AR/VR experience. See Figure 2 and supplementary demo video.

6. Experiments and Results

Overall, 71.2% of hold out reflecting sounds and 100% of audio-visual frames were correctly classified as an open or closed boundary in the home (Table 3). 71.8% of 1 second audio frames were correctly classified as 1 ft, 2 ft, or 3 ft away from the surface based on audio alone; 89.5% when concatenating with its corresponding image. Finally, 77.4% of audio and 100% of audio-visual inputs correctly labeled the surface material.

ImageNet, a visual only baseline, is higher at 78.1% than audio-only EchoCNN-A for open/closed classification. This is partly due to the fact that the hold out set was to test audio generalization (i.e. unheard sound sources). But unheard sound sources does not guarantee unseen visual data. Images similar to those found in training are present in test. A hold out set based on image (e.g. different depths) should be evaluated as future work.

6.1. Experimental setup

Listener (top smartphone, e.g. Galaxy Note 4) and sound source (bottom smartphone, e.g. iPhone 6) are separated vertically by 7 cm. Pulsed sounds are emitted 3 feet, 2 feet, and 1 feet away from the reconstructing surface. Three feet was selected to remain in the free field. Beyond that, there will be less noise reduction due to reflecting sounds in the reverberant field (Egan, 1988). Within a few feet of the reconstructing surface also create finer detail reconstructions.

We labeled our data based on scene, sound source, and surface properties - type of surface, material, and depth from sound source. The training set included pulsed sounds of pure tone frequencies, a single hand clap, brownian noise, and pink noise. The hold out test set consisted of voice, whistle, chirp, and white noise. For rooms with different sound-absorbing treatments, our real-world recordings include a bedroom (e.g. carpet and painted) and bathroom (e.g. tiled).

6.2. Activation Maximization

The objective of activation maximization is to generate an input that maximizes layer activations for a given class. This provides insights into the types of patterns the neural network is learning. Figure 9 shows the different inputs that would maximize EchoCNN activations for depth estimation. Notice lower frequencies tend to occur at 3 ft (longer reverberation times) than at 1 and 2 ft (high frequencies) due to typical high frequency damping and absorption.

6.3. Analysis

Using audio, we noticed noise reduction between winter and spring due to more foliage on the trees. We also observed flutter echoes, which can be heard as a ”rattle” or ”clicking” from a hand clap and have been simulated in spatial audio (Halmrast, 2019). They became more pronounced the closer to the wall surface in the bathroom scene. Background UV textures are placed at a fixed 1 ft (0.3 m) behind estimated surface depth. Audio unable to augment failure cases of the shower from initial RGB-based reconstructions using either (Tanskanen et al., 2013) or (Metashape, 2020). We leave calculating the background depth as future work. We compare our 3D reconstructions to depth estimates based on related work.

6.4. Results by source frequency and object size

We evaluate a range of source frequencies to account for different sound wave behavior based on the size of the reconstructing objects. For example, if an object is much smaller than the wavelength, the sound flows around it rather than scattering (Long, 2014):


where is wavelength (ft) of sound in air at a specific frequency, is frequency (1 Hz), and is speed of sound in air (ft/s). Dynamically setting source frequency based on object size would be future work.

7. Conclusion and Future Work

To the best of our knowledge, our work introduces the first audio and audio-visual techniques for enhancing scene reconstructions that contain windows and mirrors. Our smartphone prototype and staged EchoReconstruction pipeline emits and receives pulsed audio from a variety of sound sources for surface detection, depth estimation, and material classification. These classifications enhance scene and object 3D reconstruction by resolving planar discontinuities caused by open spaces and reflective surfaces using depth filtering and planar filling. Our system performs well compared to baseline methods given experiment results for real-world and virtual scenes containing windows, mirrors, and open surfaces. We intend to publicly release our real and synthetic audio-visual ground truth data in addition to reflection separation data (direct, early, or late reverberations) for future research.

This work offer many exciting possibilities in teleimmersion, teleconferencing, and many other AR/VR applications, where the improved quality and accessibility of scanning a room with mobile phones can significantly enhance the presence of users. This work can be integrated into VR headsets with cameras and microphones, enabling a remote walk-through of a space like museums, architectures, future homes, or a cultural/heritage site, for example. The scanning process could give feedback in real time so the user wearing the HMD could see the quality of the reconstruction and move to areas with issues, like mirrors and open windows.

Future Work: To further extend this research, performing audio emission, reception, and 3D reconstruction simultaneously in real time instead of having a staged approach is an alternative to explore. This approach could enable mapping classifications to 3D geometry more densely than fusing RGB-D, tracking, or Iterative Closest Point (ICP) (Izadi et al., 2011). An integrated approach, such as a multi-scale neural network (Eigen and Fergus, 2014) using audio for coarse and visual for finer predictions, may not only be more efficient but also more effective by using audio feedback as part of the reconstruction code. Another possible avenue of exploration is to investigate the impact of live audio for training and/or testing our neural network variations. With a defined set of output classes for EchoCNN, alternative baselines such as Non-Negative Matrix Factorization (NMF), source separation techniques, and the pYIN algorithm (Mauch and Dixon, 2014) to extract the fundamental frequency , i.e. the frequency of the lowest partial of the sound, are suggested as future directions. Finally, our current implementation holds out voice and whistle data, which is different from the audio used during training. However, unheard sounds does not equate to unseen images. Therefore, some insights can be possibly gained by experimenting with a different training dataset for testing audio-only, visual-only, and audio-visual methods.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from External Links: Link Cited by: §4.3.
  • I. Alhashim and P. Wonka (2018)

    High quality monocular depth estimation via transfer learning

    arXiv e-prints abs/1812.11941. External Links: Link, 1812.11941 Cited by: §2.1.
  • A. Arnab, M. Sapienza, S. Golodetz, J. Valentin, O. Miksik, S. Izadi, and P. Torr (2015) Joint object-material category segmentation from audio-visual cues. In Proceedings of the British Machine Vision Conference (BMVC), pp. . External Links: Document Cited by: §1, §1, §2.2.
  • Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, Cited by: Table 3, Table 4.
  • M. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. Roch, S. Gannot, and C. Deledalle (2019) Machine learning in acoustics: theory and applications. The Journal of the Acoustical Society of America 146, pp. 3590–3628. External Links: Document Cited by: §2.2.
  • L. Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Y. Lechevallier and G. Saporta (Eds.), Paris, France, pp. 177–187. External Links: Link, Document Cited by: Table 3, Table 4.
  • E. Catmull (1974) A subdivision algorithm for computer display of curved surfaces. Technical report UTAH UNIV SALT LAKE CITY SCHOOL OF COMPUTING. Cited by: §5.1.
  • R. Chabra, J. Straub, C. Sweeney, R. A. Newcombe, and H. Fuchs (2019) StereoDRNet: dilated residual stereo net. CoRR abs/1904.02251. External Links: Link, 1904.02251 Cited by: §1, §2.1.
  • F. Chollet et al. (2015) Keras. Note: Cited by: §4.3.
  • T. Cover and P. Hart (1967) Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, pp. 21–27. External Links: Document Cited by: Table 3, Table 4.
  • M. Cowling and R. Sitte (2003) Comparison of techniques for environmental sound recognition. Pattern Recognition Letters 24 (15), pp. 2895 – 2907. External Links: ISSN 0167-8655, Document, Link Cited by: §2.2.
  • M. Crocco, A. Trucco, and A. Del Bue (2016) Uncalibrated 3d room reconstruction from sound. CoRR abs/1606.06258. External Links: Link, 1606.06258 Cited by: §2.1, Table 1.
  • A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In

    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

    Cited by: §1.
  • A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt (2017) BundleFusion: real-time globally consistent 3d reconstruction using on-the-fly surface re-integration. ACM Transactions on Graphics 36, pp. 1. External Links: Document Cited by: §2.1.
  • M. D. Egan (1988) Architectural acoustics. edition, , Vol. , McGraw-Hill Custom Publishing, . Note: Cited by: §3.3, §3, §4.1, §5.1, §6.1.
  • D. Eigen and R. Fergus (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. CoRR abs/1411.4734. External Links: Link, 1411.4734 Cited by: §2.1, §3.2, §7.
  • A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein (2018) Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. CoRR abs/1804.03619. External Links: Link, 1804.03619 Cited by: §2.2.
  • J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, pp. 776–780. External Links: Document Cited by: §2.2.
  • D. Godoy, B. Islam, S. Xia, M. T. Islam, R. Chandrasekaran, Y. Chen, S. Nirjon, P. Kinget, and X. Jiang (2018) PAWS: a wearable acoustic system for pedestrian safety. pp. 237–248. External Links: Document Cited by: §2.2.
  • S. Golodetz*, M. Sapienza*, J. P. C. Valentin, V. Vineet, M. Cheng, A. Arnab, V. A. Prisacariu, O. Kähler, C. Y. Ren, D. W. Murray, S. Izadi, and P. H. S. Torr (2015) SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes. Technical report Technical Report TVG-2015-1, Department of Engineering Science, University of Oxford. Note: Released as arXiv e-print 1510.03727 Cited by: §1, §2.1.
  • T. Halmrast (2019) A very simple way to simulate the timbre of flutter echoes in spatial audio. Cited by: §6.3.
  • D. Hannan, A. Jain, and M. Bansal (2020) ManyModalQA: modality disambiguation and qa over diverse inputs. External Links: 2001.08034 Cited by: §2.2.
  • M. Huzaifah (2017) Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. CoRR abs/1706.07156. External Links: Link, 1706.07156 Cited by: §4.
  • G. Hwang, S. Kim, and H. Bae (2019) Bat-g net: bat-inspired high-resolution 3d image reconstruction using ultrasonic echoes. In NeurIPS, Cited by: §2.2.
  • S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon (2011) KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, New York, NY, USA, pp. 559–568. External Links: ISBN 9781450307161, Link, Document Cited by: §2.1, §7, 1.
  • D. L. James, J. Barbič, and D. K. Pai (2006) Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources. In ACM SIGGRAPH 2006 Papers, SIGGRAPH ’06, New York, NY, USA, pp. 987–995. External Links: ISBN 1-59593-364-6, Link, Document Cited by: §2.2.
  • M. Kada and L. Mckinley (2009) 3D building reconstruction from lidar based on a cell decomposition approach. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 38, pp. . Cited by: Table 1.
  • H. Kim, L. Remaggi, P. J. B. Jackson, F. M. Fazi, and A. Hilton (2017) 3D room geometry reconstruction using audio-visual sensors. pp. 621–629. External Links: Document Cited by: §2.2.
  • H. Kim, H. Tan, and M. Bansal (2020) Modality-balanced models for visual dialogue. External Links: 2001.06354 Cited by: §2.2.
  • J. Kim, S. Lee, D. Kwak, M. Heo, J. Kim, J. Ha, and B. Zhang (2016) Multimodal residual learning for visual qa. External Links: 1606.01455 Cited by: §2.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Link Cited by: §4.3.
  • J. Kopf, F. Langguth, D. Scharstein, R. Szeliski, and M. Goesele (2013) Image-based rendering in the gradient domain. ACM Transactions on Graphics (TOG) 32, pp. 199:1–199:9. External Links: Document Cited by: §2.1.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: Table 3, §4, Table 4.
  • K. Lai, L. Bo, X. Ren, and D. Fox (2011) A large-scale hierarchical multi-view rgb-d object dataset. pp. 1817–1824. External Links: Document Cited by: §2.1.
  • C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager (2016) Temporal convolutional networks for action segmentation and detection. CoRR abs/1611.05267. External Links: Link, 1611.05267 Cited by: §1.
  • D. D. Lee and H. S. Seung (2000) Algorithms for non-negative matrix factorization. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, Cambridge, MA, USA, pp. 535–541. External Links: Link Cited by: §2.2.
  • D. B. Lindell, G. Wetzstein, and V. Koltun (2019) Acoustic non-line-of-sight imaging. pp. 6773–6782. External Links: Document Cited by: §2.2.
  • W. Linder (2009) Digital photogrammetry. Vol. 1, Springer. Cited by: §5.1.
  • M. Long (2014) Architectural acoustics. 2nd. edition, , Vol. , Academic Press, . Note: Cited by: §3.3, §3, Table 3, §6.4.
  • I. Lysenkov, V. Eruhimov, and G. R. Bradski (2012)

    Recognition and pose estimation of rigid transparent objects with a kinect sensor

    In Robotics: Science and Systems, Cited by: §2.1.1.
  • W. Mao, M. Wang, and L. Qiu (2018) AIM: acoustic imaging on a mobile. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’18, New York, NY, USA, pp. 468–481. External Links: ISBN 9781450357203, Link, Document Cited by: §2.2.
  • M. Mauch and S. Dixon (2014) PYIN: a fundamental frequency estimator using probabilistic threshold distributions. pp. 659–663. External Links: ISBN 978-1-4799-2893-4, Document Cited by: §7.
  • R. Mehra, A. Rungta, A. Golas, M. Lin, and D. Manocha (2015) WAVE: interactive wave-based sound propagation for virtual environments. Visualization and Computer Graphics, IEEE Transactions on 21, pp. 434–442. External Links: Document Cited by: §2.2.
  • Metashape (2020) AgiSoft metashape standard. Vol. (Version 1.6.2) (Software). External Links: Link Cited by: Figure 2, §6.3.
  • B. Michael, A. Silvia, S. Launer, and N. Dillier (2005) Sound classification in hearing aids inspired by auditory scene analysis. EURASIP Journal on Advances in Signal Processing 18. External Links: Document Cited by: §2.2.
  • M. Mller (2015) Fundamentals of music processing: audio, analysis, algorithms, applications. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 3319219448 Cited by: §4, §4, §4.
  • A. A. Nair, A. Reiter, C. Zheng, and S. Nayar (2019) Audiovisual zooming: what you see is what you hear. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, New York, NY, USA, pp. 1107–1118. External Links: ISBN 978-1-4503-6889-6, Link, Document Cited by: §2.2.
  • R. Newcombe, A. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, S. Hodges, D. Kim, and A. Fitzgibbon (2011) KinectFusion: real-time dense surface mapping and tracking. pp. 127–136. External Links: Document Cited by: §2.1.
  • R. Newcombe, D. Fox, and S. Seitz (2015) DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. pp. 343–352. External Links: Document Cited by: §2.1.
  • M. Nilsson, J. Bartunek, J. Nordberg, and I. Claesson (2008) Human whistle detection and frequency estimation. Image and Signal Processing, Congress on 5, pp. 737–741. External Links: ISBN 978-0-7695-3119-9, Document Cited by: §3.3.
  • E. Olson (2011) AprilTag: a robust and flexible visual fiducial system. pp. 3400 – 3407. External Links: Document Cited by: §2.1.1.
  • K. J. Piczak (2015) ESC: dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, New York, NY, USA, pp. 1015–1018. External Links: ISBN 978-1-4503-3459-4, Link, Document Cited by: §2.2.
  • Z. Ren, H. Yeh, and M. C. Lin (2013) Example-guided physically based modal sound synthesis. ACM Trans. Graph. 32 (1), pp. 1:1–1:16. External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • J. Riviere, I. Reshetouski, L. Filipi, and A. Ghosh (2017) Polarization imaging reflectometry in the wild. ACM Transactions on Graphics (TOG) 36, pp. 1 – 14. Cited by: §2.1.1.
  • A. Rungta, C. Schissler, R. Mehra, C. Malloy, M. Lin, and D. Manocha (2016) SynCoPation: interactive synthesis-coupled sound propagation. IEEE Transactions on Visualization and Computer Graphics 22 (4), pp. 1346–1355. External Links: ISSN 1077-2626, Link, Document Cited by: §2.2.
  • J. Salamon, C. Jacoby, and J. P. Bello (2014) A dataset and taxonomy for urban sound research. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, New York, NY, USA, pp. 1041–1044. External Links: ISBN 978-1-4503-3063-3, Link, Document Cited by: §2.2.
  • C. Schissler, C. Loftin, and D. Manocha (2018) Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics 24, pp. 1246–1259. Cited by: §1, §2.2, Table 3, Table 4.
  • C. Schissler and D. Manocha (2011) GSound: interactive sound propagation for games. Proceedings of the AES International Conference, pp. . Cited by: §5.1.
  • S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski (2006) A comparison and evaluation of multi-view stereo reconstruction algorithms.. Vol. 1, pp. 519–528. External Links: Document Cited by: §2.1.
  • Y. Shih, D. Krishnan, F. Durand, and W. Freeman (2015) Reflection removal using ghosting cues. pp. 3193–3201. External Links: Document Cited by: §2.1.1.
  • N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, pp. 746–760. External Links: Document Cited by: §2.1.
  • A. Singh, J. Sha, K. Narayan, T. Achim, and P. Abbeel (2014) BigBIRD: a large-scale 3d database of object instances. pp. 509–516. External Links: Document Cited by: §2.1.
  • S. Sinha, J. Kopf, M. Goesele, D. Scharstein, and R. Szeliski (2012) Image-based rendering for scenes with reflections. ACM Transactions on Graphics - TOG 31, pp. 1–10. External Links: Document Cited by: §1, §2.1.1.
  • J. O. Smith III (2020) Physical audio signal processing. Note: Cited by: §2.2.
  • S. Song, F. Yu, A. Zeng, A. Chang, M. Savva, and T. Funkhouser (2017) Semantic scene completion from a single depth image. pp. 190–198. External Links: Document Cited by: §1, §2.1.
  • A. Sterling, J. Wilson, S. Lowe, and M. C. Lin (2018) ISNN: impact sound neural network for audio-visual object classification. In Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Cham, pp. 578–595. External Links: ISBN 978-3-030-01267-0 Cited by: §1, §2.2.
  • I. E. Sutherland (1968) A head-mounted three dimensional display. In AFIPS ’68 (Fall, part I), Cited by: §2.1.1.
  • T. L. Szabo (2014) Chapter 12 - nonlinear acoustics and imaging. In Diagnostic Ultrasound Imaging: Inside Out (Second Edition), T. L. Szabo (Ed.), pp. 501 – 563. External Links: ISBN 978-0-12-396487-8, Document, Link Cited by: §3.1, §3.3.
  • Z. Tang, N. J. Bryan, D. Li, T. R. Langlois, and D. Manocha (2020) Scene-aware audio rendering via deep acoustic analysis. IEEE Transactions on Visualization and Computer Graphics 26 (5), pp. 1991–2001. Cited by: §1.
  • P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer, and M. Pollefeys (2013) Live metric 3d reconstruction on mobile phones. pp. 65–72. External Links: Document Cited by: Figure 2, Table 1, §3.4, §6.3.
  • A. Toisoul and A. Ghosh (2017) Practical acquisition and rendering of diffraction effects in surface reflectance. ACM Transactions on Graphics 36, pp. 1. External Links: Document Cited by: §2.1.1.
  • N. Tsingos, E. Gallo, and G. Drettakis (2004) Perceptual audio rendering of complex virtual environments. ACM Trans. Graph. 23 (3), pp. 249–258. External Links: ISSN 0730-0301, Link, Document Cited by: §2.2.
  • J. Wang and E. Olson (2016) AprilTag 2: efficient and robust fiducial detection. pp. 4193–4198. External Links: Document Cited by: §2.1.1.
  • M.J. Westoby, J. Brasington, N.F. Glasser, M.J. Hambrey, and J.M. Reynolds (2012) ‘Structure-from-motion’ photogrammetry: a low-cost, effective tool for geoscience applications. Geomorphology 179 (), pp. 300 – 314. Note: External Links: ISSN 0169-555X, Document, Link Cited by: §2.1.
  • T. Whelan, M. Goesele, S. J. Lovegrove, J. Straub, S. Green, R. Szeliski, S. Butterfield, S. Verma, and R. Newcombe (2018) Reconstructing scenes with mirror and glass surfaces. ACM Trans. Graph. 37 (4), pp. 102:1–102:11. External Links: ISSN 0730-0301, Link, Document Cited by: §1, §2.1.1, Table 1.
  • J. Wilson and M. C. Lin (2020) AVOT: audio-visual object tracking of multiple objects for robotics. In ICRA 2020, Cited by: §1.
  • J. Wilson, A. Sterling, and M. Lin (2019) Analyzing liquid pouring sequences via audio-visual neural networks. pp. 7702–7709. External Links: Document Cited by: §1, §2.2.
  • Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D shapenets: a deep representation for volumetric shapes. pp. 1912–1920. External Links: Document Cited by: §1, §2.1.
  • Z. Yu, J. Yu, J. Fan, and D. Tao (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. CoRR abs/1708.01471. External Links: Link, 1708.01471 Cited by: §4.
  • R. Zhang, P. Tsai, J. E. Cryer, and M. Shah (1999) Shape-from-shading: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (8), pp. 690–706. External Links: Document, ISSN 0162-8828 Cited by: §2.1.
  • Y. Zhang, M. Ye, D. Manocha, and R. Yang (2017) 3D reconstruction in the presence of glass and mirrors by acoustic and visual fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, pp. 1–1. External Links: Document Cited by: §2.1.1, Table 1.
  • Z. Zhang, J. Wu, Q. Li, Z. Huang, J. Traer, J. H. McDermott, J. B. Tenenbaum, and W. T. Freeman (2017) Generative modeling of audible shapes for object perception. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1260–1269. Cited by: §1, §2.2.
  • C. Zheng and D. L. James (2011) Toward high-quality modal contact sound. ACM Trans. Graph. 30 (4), pp. 38:1–38:12. External Links: ISSN 0730-0301, Link, Document Cited by: §2.2.