Scenes containing open and reflective surfaces, such as windows and mirrors, can enhance AR/VR immersion in terms of both graphics and sound; for example, a window open in spring compared to closed in winter. However, they also present a unique set of challenges. First, they are difficult to detect and reconstruct due to their transparency and high reflectivity. Distinguishing between glass (e.g. window) and an opening in the space is an important part of the audio-visual experience for AR/VR engagement. Also, illumination, background objects, and min/max depth ranges can be confounding factors.
Reconstruction of scenes for teleimmersion have led to advances in detection (Lea et al., 2016), segmentation (Golodetz* et al., 2015; Arnab et al., 2015), and semantic understanding (Song et al., 2017) and are used to generate large-scale, labeled datasets of object (Wu et al., 2015) and scene (Dai et al., 2017) geometric models to further aid training and sensing in a 3D environment. Advances have also been made to account for challenging surfaces (Sinha et al., 2012; Whelan et al., 2018; Chabra et al., 2019). Yet, scenes containing open and reflective surfaces, such as windows and mirrors, remain an open research area. Our work augments existing visual methods by adding audio context of surface detection, depth, and material estimation for recreating a virtual environment from a real one.
Previous work has used sound to better understand objects in scenes. For instance, impact sounds from interacting with objects in a scene to perform segmentation (Arnab et al., 2015) and to emulate the sensory interactions of human information processing (Zhang et al., 2017). Audio has also been used to compute material (Ren et al., 2013), object (Zhang et al., 2017), scene (Schissler et al., 2018), and acoustical (Tang et al., 2020) properties. Moreover, using both audio and visual sensory inputs has proven more effective; for example, multi-modal learning for object classification (Sterling et al., 2018; Wilson et al., 2019) and object tracking (Wilson and Lin, 2020).
Fusing multiple modalities, such as vision and sound, provide a wider range of possibilities than either single modality alone. In this work, we show that augmenting vision-based techniques with audio, referred to as “EchoCNN,” can detect open or reflective surfaces, its depth, and material, thereby enhancing 3D object and scene reconstruction for AR/VR systems. We highlight key results below:
EchoReconstruction, a staged audio-visual 3D reconstruction pipeline that uses mobile devices to enhance scene geometry containing windows, mirrors, and open surfaces with depth filtering and inpainting based on EchoCNN inferences (section 3);
Automated data collection process and audio-visual ground truth data for real and synthetic scenes containing windows and mirrors (section 5).
Using EchoReconstruction, we have been able to achieve consistently higher accuracy (up to 100%) in classification of open/closed surfaces, depth estimation, and materials in both real-world scenes and controlled experiments, resulting in considerably improved 3D scene reconstruction with glass doors, windows and mirrors.
2. Related Work
Previous research in 3D reconstruction, audio-based classifications, and echolocation are discussed in this section in addition to existing techniques for reconstructing open and reflective surfaces.
2.1. 3D reconstruction
Object and scene reconstruction methods generate 3D scans using RGB and RGB-D data. For example, Structure from Motion (SFM) (Westoby et al., 2012), Multi-View Stereo (MVS) (Seitz et al., 2006), and Shape from Shading (Zhang et al., 1999) are all techniques to scan a scene and its objects. Static (Newcombe et al., 2011; Golodetz* et al., 2015) and dynamic (Newcombe et al., 2015; Dai et al., 2017) scenes can also be scanned in real-time using commodity sensors such as the Microsoft Kinect and GPU hardware. 3D scene reconstructions have also been performed with sound based on time of flight sensing (Crocco et al., 2016). Not only has this previous research generated large amounts of 3D scene (Silberman et al., 2012; Song et al., 2017) and object (Singh et al., 2014; Lai et al., 2011; Wu et al., 2015) data, they also benefit from these datasets by using them for training vision-based neural networks for classification, segmentation, and other downstream tasks. Depth estimation algorithms (Eigen and Fergus, 2014; Alhashim and Wonka, 2018; Chabra et al., 2019) also create 3D reconstructions by fusing depth maps using ICP and volumetric fusion (Izadi et al., 2011).
2.1.1. Glass and mirror reconstruction
Reflective surfaces produce identifiable audio and visual artifacts that can be used to help their detection. For example, researchers have developed algorithms to detect reflections in images taken through glass using correlations of 8-by-8 pixel blocks (Shih et al., 2015), image gradients (Kopf et al., 2013), two layer renderings (Sinha et al., 2012), polarization imaging reflectometry (Riviere et al., 2017), and diffraction effects (Toisoul and Ghosh, 2017). Adding hardware, (Sutherland, 1968) used ultrasonic sensor logic to track continuous wave ultrasound and (Zhang et al., 2017) to detect obstacles such as glass and mirrors by using frequencies outside of the human audible range. More recently, reflective surfaces have been detected by utilizing a mirrored variation of an AprilTag (Olson, 2011; Wang and Olson, 2016). (Whelan et al., 2018) use the reflective surface to their advantage by recognizing the AprilTag attached to their Kinect scanning device when it appears in the scene. Depth jumps and incomplete reconstructions have also been used (Lysenkov et al., 2012). However, vision based approaches require the right illumination, non-blurred imagery, and limited clutter behind the surface that may limit the reflection. We show that sound creates a distinct audio signal, providing reconstruction methods complementary data about the presence of windows and mirrors without additional sensors.
|Example 3D Reconstruction Methods|
|Active (RGB-D)||KinectFusion, DynamicFusion, BundleFusion|
|Passive (RGB)||SLAM, SFM, (Tanskanen et al., 2013)|
|ScanNet, (Whelan et al., 2018)|
|Lidar||(Kada and Mckinley, 2009)|
|Ultrasonic||(Zhang et al., 2017)|
|Time of flight||(Crocco et al., 2016)|
2.2. Acoustic imaging and audio-based classifiers
We begin with an introduction into sound propagation, room acoustics, and audio-visual classifiers.
Acoustics: various models have been developed to simulate sound propagation in a 3D environment, such as wave-based (Mehra et al., 2015), ray tracing based (Rungta et al., 2016), sound source clustering (Tsingos et al., 2004), multipole equivalent source methods (James et al., 2006), and a single point multipole expansion method (Zheng and James, 2011), representing outgoing pressure fields. (Godoy et al., 2018) uses acoustics and a smartphone for an app to detect car location and distance from walking pedestrians using temporal dynamics. (Bianco et al., 2019)
further discusses theory and applications of machine learning in acoustics. Computational imaging approaches have also used acoustics for non-line-of-sight imaging(Lindell et al., 2019), 3D room geometry reconstruction from audio-visual sensors (Kim et al., 2017), and acoustic imaging on a mobile device (Mao et al., 2018). To reconstruct windows and mirrors, our work uses room acoustics given the surface materials of the room (Schissler et al., 2018) and distance from sound source. However, prior work and downstream processes often require a watertight reconstruction which can be difficult to generate in the presence of glass. Our approach addresses these issues using an integrated audio-visual CNN that can detect discontinuity, depth, and materials.
Audio-based classification and reconstruction: using principles from sound synthesis, propagation, and room acoustics, audio classifiers have been developed for environmental sound (Gemmeke et al., 2017; Piczak, 2015; Salamon et al., 2014), material (Arnab et al., 2015), and object shape (Zhang et al., 2017) classification. For audio-based reconstruction, Bat-G net uses ultrasonic echoes to train an auditory encoder and 3D decoder for 3D image reconstruction (Hwang et al., 2019). Audio input can take the form of raw audio, spectral shape descriptors (Michael et al., 2005; Cowling and Sitte, 2003; Smith III, 2020), or frequency spectral coefficients that we also adopt. In our method, we use reflecting sound to perform surface detection, depth estimation, and material classification.
: similar to its applications in natural language processing (NLP) and visual questing & answering systems(Kim et al., 2016, 2020; Hannan et al., 2020), multi-modal learning using both audio-visual sensory inputs has also been used for classification tasks (Sterling et al., 2018; Wilson et al., 2019), audio-visual zooming (Nair et al., 2019), and sound source separation (Ephrat et al., 2018; Lee and Seung, 2000) which have also isolated waves for specific generation tasks. Although similar in spirit, our audio-visual method, “Echoreconstruction,” differs from the existing methods by learning absorption and reflectance properties to detect a reflective surface, its depth, and material.
3. Technical Approach
In this work, we adopt “echolocation” as an analog for our echoreconstruction method. According to (Egan, 1988), echo is defined as distinct reflections of the original sound with a sufficient sound level to be clearly heard above the general reverberation. Although perceptible echo is abated because of precedence (known as the Haas effect) (Long, 2014), returning sound waves are received after reflecting off of a solid surface. We use these distinct, reflecting sounds to design a staged approach of audio and audio-visual convolutional neural networks. EchoCNN-A and EchoCNN-AV can be used to estimate depth based on reverberation times (Figure 9), recognize material based on frequency and amplitude, and handle both static and dynamic scenes with moving objects based on Doppler shift. All of which enhance scene and object reconstruction by detecting planar discontinuities from open or closed surfaces and then estimating depth and material.
Echolocation is the use of reflected sound to locate and identify objects, particularly used by animals like dolphins and bats. According to (Szabo, 2014), bats emit ultrasound pulses, ranging between 20-150 kHz, to catch an insect prey with a resolution of 2-15 mm. This involves signal processing such as:
Doppler shift (the relative speed of the target),
time delay (distance to the target), and
frequency and amplitude in relation to distance (target object size and type recognition);
where the Doppler shift (or effect) is the perceived change in frequency (Doppler frequency minus transmitted frequency ) as a sound source with velocity moves toward or away from the listener/observer with velocity and angle .
3.2. Staged classification and reconstruction pipeline
As depicted in Figure 3, we take a staged approach to enhance scene and object reconstruction using audio-visual data. Our echoreconstruction prototype consists of two smartphones - one recording (top) and one emitting/reconstructing (bottom). Each audio emission is 100 ms of sound followed by 900 ms of silence to allow for the receiving microphone to capture reflections and reverberations (subsection 3.3). After the 3D scan is complete, an .obj file containing geometry and texture information is generated. 1 second frames are extracted from the recorded video to generate audio and visual input into the EchoCNN neural networks (section 4). These networks are independently trained to detect whether a surface is open or closed, estimate depth to the surface from the sound source, and classify the material of the surface. Using mobile accelerometer data and a multi-scale neural network, such as (Eigen and Fergus, 2014), but using audio as a coarse global output refined using finer-scale visual data to augment depth estimation will be explored as future work.
3.3. Sound source
A smartphone emits recordings of human experimenter voice, whistle, hand clap, pure tones (ranging from 63 Hz to 16 kHz), chirps, and noise (white, pink, and brownian). All of which can be generated as either pulsed (PW) or continuous waves (CW). PW is preferred for theoretical and empirical reasons. First, the transmission frequency may experience considerable downshift as a result of absorption and diffraction effects (Szabo, 2014). Therefore, using pulsed waves independent for each emission is theoretically better than continuous waves compared to . Furthermore, section 6 shows superior PW results over CW for the given classification tasks.
Pure tones were generated with default 0.8 out of 1 amplitudes using the Audacity computer program and center frequencies of 63 Hz, 125 Hz, 250 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, 8 kHz, and 16 kHz. Human voice ranges from about 63 Hz to 1 kHz (Long, 2014) (125 Hz to 8 kHz (Egan, 1988)) and an untrained whistler between 500 Hz to 5 kHz (Nilsson et al., 2008)
. Chirps were linearly interpolated from 440 Hz to 1320 Hz in 100 ms. A hand clap is an impulsive sound that yields a flat spectrum(Long, 2014). All sound sources were recorded and played back with max volume (Figure 4). While recorded sounds were used for consistency, we plan to add live audio for augmentation and future ease of use during reconstruction. Please see our supplementary materials for spectrograms across all sound sources.
Audio input: audio was generated in pulsed waves (PW). One smartphone to emit the sound while performing a RGB-based reconstruction and the second smartphone to capture video. As future work, a single mobile device or Microsoft Kinect paired with audio chirps could be used for audio-visual capture and reconstruction instead of two separate devices. Each pulsed wave emitted into the scene was a total of 1 second consisting of an 100 ms impulse followed by silence. 1 second audio frames is based on the Sabine Formula of reverberation time for a compact room of like dimensions calculated as:
where is the reverberation time (time required for sound to decay 60 dB after source has stopped), is room volume (), and is the total room absorption at a given frequency (e.g. 250 Hz). For the bathroom scene, and , which is the sum of sound absorption from the materials in Table 2.
|Total room absorption a using at 250 Hz|
|Real bathroom scene||S||a (sabins)|
|Painted walls||432 x||0.10 =||43.20|
|Tile floor||175 x||0.01 =||1.75|
|Glass||60 x||0.25 =||15.00|
|Ceramic||39 x||0.02 =||0.78|
|Mirror||34 x||0.25 =||8.50|
|Total a =||69.23 sabins|
Visual input: images were captured from the same smartphone video as the audio recordings. Each corresponding image was cropped and grayscaled for illumination invariance and data augmentation. Image dimensions were 64 by 25 pixels. Visual data served as inputs for visual only and audio-visual model variation EchoCNN-AV.
3.4. Initial 3D Reconstruction
We evaluated the following smartphone based reconstruction applications to obtain an initial 3D geometry for which our method would enhance. The Astrivis application, based on (Tanskanen et al., 2013), generates better live 3D geometries for closed object rather than scene reconstructions since it limits feature points per scan. On the other hand, Agisoft Metashape produces scene reconstructions offline from smartphone video. Enabling the software’s depth point and guided camera matching features further improved reconstructed geometries.
4. Model Architecture
To augment visually based approaches, we use a multimodal CNN with mel-scaled spectrogram and image inputs. First, we perform surface detection to determine if a space with depth jumps and holes is in error or in fact open (i.e. open/closed classification). In the event of error, we estimate distance from recorder to surface using audio-visual data for depth filtering and inpainting. Finally, we determine the material. All of these classifications are performed using our audio and audio-visual convolutional neural networks, referred to as EchoCNN-A and EchoCNN-AV (Fig. 3).
Audio sub-network: our frame-based EchoCNN-A consists of a single convolutional layer followed by two dense layers with feature normalization. Sampled at to cover the full audible range, audio frames are 1 second mel-scaled spectrograms with STFT coefficients (section 4). Each audio example is classified independently and 1 second intervals to reflect an estimated reverberation time based on a compact room size (subsection 3.3). With a 2048 sample Hann window (N), 25% overlap, and hop length (), this results in a frequency dimension of 21.5 Hz (section 4) and temporal dimension of 12 ms (section 4) or 12% of each 100 ms pulsed audio. Each spectrogram is individually normalized and downsampled to a size of 62 frequency bins by 25 time bins.
We define the frequency spectral coefficients (Mller, 2015) as:
for time frame and Fourier coefficient with real-valued DT signal , sampled window function of length , and hop size (Mller, 2015). R denotes continuous time and Z denotes discrete time. Equal to , spectrograms have been demonstrated to perform well as inputs into convolutional neural networks (CNNs) (Huzaifah, 2017). Their horizontal axis is time and vertical axis is frequency.
A hop length of achieves a reasonable temporal resolution and data volume of generated spectral coefficients (Mller, 2015). Temporal resolution is important in order to detect when a reflecting sound reaches the receiver. Therefore, we decided to use a shorter window length instead of for instance. This resulted in a shorter hop length and accepting the trade-off of a higher temporal dimension for increased data volume.
Visual sub-network: while audio information is generally useful for all three classifications tasks (Table 3
) visual information is particularly useful to aid material classification. We use ImageNet(Krizhevsky et al., 2012) as a visual-based baseline to compare to our audio and audio-visual methods. It also serves as an input into our audio-visual merge layer. Future work will explore whether or not another image classification method is better suited as a baseline and to fuse with audio.
Merge layer: we evaluated concatenation and multi-modal factorized bilinear (MFB) pooling (Yu et al., 2017)
to fuse audio and visual fully connected layers. Concatenation of the two vectors serves as a straightforward baseline. MFB allows for additional learning in the form of a weighted projection matrix factorized into two low-rank matrices.
where k is the factor or latent dimensionality with index i of the factorized matrices, is the Hadmard product or element-wise multiplication, and is an all-one vector.
4.1. Loss Function
For open/closed predictions, categorical cross entropy loss (subsection 4.1) is used instead of binary if estimating the extent of the surface opening (e.g. all the way open, halfway, or closed). A regression model is not used for depth estimation because ground truth data is collected in discrete 0.5 m or 1 ft increments within the free field for better noise reduction (Egan, 1988). The Softmax function is used for output activations.
where M is number of classes, y indicator for correct classification, and p for predicted probability that observation (o) is of class (c).
|Accuracy of Reflecting Sounds used for Open/Closed, Depth Estimation, and Material Classification in Real World Scenes|
|Open/Closed||Depth Estimation||Sound Material|
|Method||Input||Shower||Window||Overall||3 ft||2 ft||1 ft||Overall||Glass||Mirror|
|kNN (Cover and Hart, 1967)||A||56.5%||100%||21.3%||16%||21%||25%||44.0%||47.5%||52.4%|
|Linear SVM (Bottou, 2010)||A||61.5%||91.7%||37.6%||38%||32%||41%||51.9%||46.0%||57.1%|
|SoundNet5 (Aytar et al., 2016)||A||45.2%||46.6%||39.7%||40%||71%||8%||71.0%||98.4%||1.6%|
|SoundNet8 (Aytar et al., 2016)||A||50.7%||46.6%||42.5%||92%||0%||33%||44.4%||16.4%||85.7%|
|ImageNet (Krizhevsky et al., 2012)||V||78.1%||96.1%||45.2%||52%||83%||0%||80.6%||60.7%||100%|
|Acoustic Classification (Schissler et al., 2018)||AV||N/A||N/A||N/A||N/A||N/A||N/A||———- 48% * ———-|
|EchoCNN-AV Cat (Ours)||AV||100%||100%||89.5%||95%||100%||73%||100%||100%||100%|
|EchoCNN-AV MFB (Ours)||AV||100%||100%||84.9%||54%||100%||100%||80.6%||60.7%||100%|
4.2. Depth filtering and planar inpainting
The outputs of our EchoCNN inform enhancements for 3D reconstruction (algorithm 1). If depth jumps in the reconstruction are first classified as an open surface, then no change is required other than filtering loose geometry and small components. Otherwise, there is a planar discontinuity (e.g. window or mirror) that needs to be filled. With depth estimated by EchoCNN, we filter the initial 3D mesh to within a threshold of that depth. This gives us the plane size needed to fill. Finally, EchoCNN classifies its surface material.
4.3. Implementation details
We implemented all EchoCNN and baseline models with Tensorflow(Abadi et al., 2015)
and Keras(Chollet and others, 2015)
. Training was performed using a TITAN X GPU running on Ubuntu 16.04.5 LTS. We used categorical cross entropy loss with Stochastic Gradient Descent optimized by ADAM(Kingma and Ba, 2014)
. Using a batch size of 32, remaining hyperparameters were tuned manually based on a separate validation set. We make our real-world and synthetic datasets available to aid future research in this area.
5. Datasets and Applications
Our audio-based EchoCNN-A and audio-visual EchoCNN-AV convolutional neural networks are trained across nine octave bands with center frequencies 63 Hz, 125 Hz, 250 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, 8 kHz, and 16 kHz. Training is done using these pulsed pure tone impulses along with experimenter hand clap. The hold out test data is comprised of sound sources excluded from training - white noise, experimenter whistle, and voice. The test set contains sound sources not in the training set to evaluate generalization.
5.1. Real and synthetic datasets
Real: training data is comprised of 1 second pulsed spectrograms (Figure 7) from recorded pure tones, experimenter hand claps, brownian noise, and pink noise (N=857). Training and test examples were collected via video recordings and labeled for material, open/closed, and in 1 ft depth increments based on the distance from the surface. Nine octaves of pure tones, hand claps, and white noise cover a disperse range of frequencies and were used to train our models.
|Accuracy of Reflecting Sounds used for Classification in Controlled Experiment (Gl = Glass)|
|Open/Closed||Depth Est. (+/- 0.5 m)||.|
|Method||Input||Thick Gl||Thin Gl||Thick Gl||Thin Gl||Material Est|
|kNN (Cover and Hart, 1967)||A||53.8%||64.1%||11.5%||21.4%||66.5%|
|Linear SVM (Bottou, 2010)||A||54.7%||63.2%||11.5%||20.5%||61.1%|
|SoundNet5 (Aytar et al., 2016)||A||60.0%||40.1%||18.8%||19.1%||67.4%|
|SoundNet8 (Aytar et al., 2016)||A||60.0%||42.6%||25.0%||19.1%||34.0%|
|ImageNet (Krizhevsky et al., 2012)||V||95.8%||80.8%||83.3%||66.7%||87.5%|
|Acoustic Classification (Schissler et al., 2018)||AV||N/A||N/A||N/A||N/A||– 48% * –|
|EchoCNN-AV Cat (Ours)||AV||98.9%||100%||99.4%||92.2%||76.6%|
|EchoCNN-AV MFB (Ours)||AV||100%||100%||100%||99.0%||100%|
The hold out test dataset consists of 1-second pulsed spectrograms from recorded experimenter voice, whistle, chirp, and white noise (N=431). Voice and whistle recordings were chosen for the hold out test set to ease future transition to live and hands-free emitted sounds during reconstruction. Hold out test data is excluded from training and only evaluated during testing. While the same hold out sets were used for visual and audio-visual evaluation, unheard is not the same as unseen. Unheard audio can have the same visual appearance between training and test. Other new training and test datasets for visual and audio-visual methods will be future work.
Synthetic: The automated synthetic data collection was performed in Unreal Engine 4.25 where SteamAudio employs a ray-based geometric sound propagation approach, with support for dynamic geometry using the Intel Embree CPU-based ray-tracer. We describe similar prior work (Schissler and Manocha, 2011) here for more details on this approach. Given scene materials (e.g. carpet, glass, painted, tile, etc.), a sound source (e.g. voice), environmental geometry, and listener position, we generate impulse responses for a given scene of varying sizes. From each listener, specular and diffuse rays are randomly generated and traced into the scene. The energy-time curve for simulated impulse response is the sum of these rays:
where is the sound intensity for path j and frequency band f, is the propagation delay time for path j, and is the Dirac delta function or impulse function. As these sound rays collide in the scene, their paths change based on absorption and scattering coefficients of the colliding objects. Common acoustic material properties can be referenced in (Egan, 1988). We assume a sound absorption coefficient, for open windows.
Along with sound intensity , a weight matrix is computed corresponding to materials within the scene. Each entry is the average number of reflections from material m for all paths that arrived at the listener. It is defined as:
where is the number of times rays on path j collide with material m, weighted according to the sound intensity of the path j. To mirror our real-world data, sound source directivity was disabled. Future work is needed to compare ambient and directed sound sources. This data may also be used for material sound separation.
Given a 720p 30fps video walkthrough of the virtual environment (VE) with the camera moving along a keyframed spline, we reconstruct the virtual scene by extracting the individual frames of the video and using Agisoft Metashape (v1.7)’s reconstruction pipeline to solve for each image’s camera transform. Metashape, previously known as PhotoScan, is considered state-of-the-art in commercial photogrammetry software. The general process (Linder, 2009) is: create a sparse point cloud containing only keypoints and solve for the transforms of cameras that can see the keypoints, create a dense cloud by matching more features between keypoint-seeing cameras and the rest of them, project the dense cloud depth data to each camera to build per-camera depth maps, use the depth maps to build a mesh, and create a texture map by projecting the image frames that best see each polygon onto the mesh (Catmull, 1974).
We disable motion blur of the camera in order to have more usable frames, but real camera data generally requires blurry frames to be removed to avoid noisy reconstructions, especially at low framerates. For accurate visual feedback of the specular surfaces, we also enable UE4’s DirectX12 ray-tracing for reflective and translucent surfaces. We used a PC with the following specs for reconstruction: GTX 1080 GPU, i9-9900k CPU, 64gb RAM, Windows 10 x64. It takes approximately 2 hours to process a 720p image sequence of 2000 frames from start to finish with this setup. We use three sections of the ”HQ Residential House” environment on the Unreal Marketplace for synthetic data. The kitchen and bathroom sequences result in about 2,000 images when they are extracted from the video at a step size of 3, and the master bedroom has about 4,000.
5.2. Applications in VR Systems
When using a head mounted display (HMD) users are alerted when approaching the boundaries in physical space. However, if room setup does not accurately reflect these boundaries or changes occur after setup, a user risks walking into unseen real-world objects such as glass and walls. Using our method, transmitted sound from the HMD could be used to locate physical objects and appropriately notify the user as an added safety measure. Audio directly from the real-world environment could also be used for depth estimation. The sounds unmixed and placed in the virtual environment, reconstructing both the scene geometry and sound sources (Figure 11). Finally, seasonal variations in the 3D sound and visual reconstruction of a window open in the spring and closed in the winter also enhance the AR/VR experience. See Figure 2 and supplementary demo video.
6. Experiments and Results
Overall, 71.2% of hold out reflecting sounds and 100% of audio-visual frames were correctly classified as an open or closed boundary in the home (Table 3). 71.8% of 1 second audio frames were correctly classified as 1 ft, 2 ft, or 3 ft away from the surface based on audio alone; 89.5% when concatenating with its corresponding image. Finally, 77.4% of audio and 100% of audio-visual inputs correctly labeled the surface material.
ImageNet, a visual only baseline, is higher at 78.1% than audio-only EchoCNN-A for open/closed classification. This is partly due to the fact that the hold out set was to test audio generalization (i.e. unheard sound sources). But unheard sound sources does not guarantee unseen visual data. Images similar to those found in training are present in test. A hold out set based on image (e.g. different depths) should be evaluated as future work.
6.1. Experimental setup
Listener (top smartphone, e.g. Galaxy Note 4) and sound source (bottom smartphone, e.g. iPhone 6) are separated vertically by 7 cm. Pulsed sounds are emitted 3 feet, 2 feet, and 1 feet away from the reconstructing surface. Three feet was selected to remain in the free field. Beyond that, there will be less noise reduction due to reflecting sounds in the reverberant field (Egan, 1988). Within a few feet of the reconstructing surface also create finer detail reconstructions.
We labeled our data based on scene, sound source, and surface properties - type of surface, material, and depth from sound source. The training set included pulsed sounds of pure tone frequencies, a single hand clap, brownian noise, and pink noise. The hold out test set consisted of voice, whistle, chirp, and white noise. For rooms with different sound-absorbing treatments, our real-world recordings include a bedroom (e.g. carpet and painted) and bathroom (e.g. tiled).
6.2. Activation Maximization
The objective of activation maximization is to generate an input that maximizes layer activations for a given class. This provides insights into the types of patterns the neural network is learning. Figure 9 shows the different inputs that would maximize EchoCNN activations for depth estimation. Notice lower frequencies tend to occur at 3 ft (longer reverberation times) than at 1 and 2 ft (high frequencies) due to typical high frequency damping and absorption.
Using audio, we noticed noise reduction between winter and spring due to more foliage on the trees. We also observed flutter echoes, which can be heard as a ”rattle” or ”clicking” from a hand clap and have been simulated in spatial audio (Halmrast, 2019). They became more pronounced the closer to the wall surface in the bathroom scene. Background UV textures are placed at a fixed 1 ft (0.3 m) behind estimated surface depth. Audio unable to augment failure cases of the shower from initial RGB-based reconstructions using either (Tanskanen et al., 2013) or (Metashape, 2020). We leave calculating the background depth as future work. We compare our 3D reconstructions to depth estimates based on related work.
6.4. Results by source frequency and object size
We evaluate a range of source frequencies to account for different sound wave behavior based on the size of the reconstructing objects. For example, if an object is much smaller than the wavelength, the sound flows around it rather than scattering (Long, 2014):
where is wavelength (ft) of sound in air at a specific frequency, is frequency (1 Hz), and is speed of sound in air (ft/s). Dynamically setting source frequency based on object size would be future work.
7. Conclusion and Future Work
To the best of our knowledge, our work introduces the first audio and audio-visual techniques for enhancing scene reconstructions that contain windows and mirrors. Our smartphone prototype and staged EchoReconstruction pipeline emits and receives pulsed audio from a variety of sound sources for surface detection, depth estimation, and material classification. These classifications enhance scene and object 3D reconstruction by resolving planar discontinuities caused by open spaces and reflective surfaces using depth filtering and planar filling. Our system performs well compared to baseline methods given experiment results for real-world and virtual scenes containing windows, mirrors, and open surfaces. We intend to publicly release our real and synthetic audio-visual ground truth data in addition to reflection separation data (direct, early, or late reverberations) for future research.
This work offer many exciting possibilities in teleimmersion, teleconferencing, and many other AR/VR applications, where the improved quality and accessibility of scanning a room with mobile phones can significantly enhance the presence of users. This work can be integrated into VR headsets with cameras and microphones, enabling a remote walk-through of a space like museums, architectures, future homes, or a cultural/heritage site, for example. The scanning process could give feedback in real time so the user wearing the HMD could see the quality of the reconstruction and move to areas with issues, like mirrors and open windows.
Future Work: To further extend this research, performing audio emission, reception, and 3D reconstruction simultaneously in real time instead of having a staged approach is an alternative to explore. This approach could enable mapping classifications to 3D geometry more densely than fusing RGB-D, tracking, or Iterative Closest Point (ICP) (Izadi et al., 2011). An integrated approach, such as a multi-scale neural network (Eigen and Fergus, 2014) using audio for coarse and visual for finer predictions, may not only be more efficient but also more effective by using audio feedback as part of the reconstruction code. Another possible avenue of exploration is to investigate the impact of live audio for training and/or testing our neural network variations. With a defined set of output classes for EchoCNN, alternative baselines such as Non-Negative Matrix Factorization (NMF), source separation techniques, and the pYIN algorithm (Mauch and Dixon, 2014) to extract the fundamental frequency , i.e. the frequency of the lowest partial of the sound, are suggested as future directions. Finally, our current implementation holds out voice and whistle data, which is different from the audio used during training. However, unheard sounds does not equate to unseen images. Therefore, some insights can be possibly gained by experimenting with a different training dataset for testing audio-only, visual-only, and audio-visual methods.
- TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §4.3.
High quality monocular depth estimation via transfer learning. arXiv e-prints abs/1812.11941. External Links: Cited by: §2.1.
- Joint object-material category segmentation from audio-visual cues. In Proceedings of the British Machine Vision Conference (BMVC), pp. . External Links: Cited by: §1, §1, §2.2.
- Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, Cited by: Table 3, Table 4.
- Machine learning in acoustics: theory and applications. The Journal of the Acoustical Society of America 146, pp. 3590–3628. External Links: Cited by: §2.2.
- Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Y. Lechevallier and G. Saporta (Eds.), Paris, France, pp. 177–187. External Links: Cited by: Table 3, Table 4.
- A subdivision algorithm for computer display of curved surfaces. Technical report UTAH UNIV SALT LAKE CITY SCHOOL OF COMPUTING. Cited by: §5.1.
- StereoDRNet: dilated residual stereo net. CoRR abs/1904.02251. External Links: Cited by: §1, §2.1.
- Keras. Note: https://keras.io Cited by: §4.3.
- Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, pp. 21–27. External Links: Cited by: Table 3, Table 4.
- Comparison of techniques for environmental sound recognition. Pattern Recognition Letters 24 (15), pp. 2895 – 2907. External Links: Cited by: §2.2.
- Uncalibrated 3d room reconstruction from sound. CoRR abs/1606.06258. External Links: Cited by: §2.1, Table 1.
ScanNet: richly-annotated 3d reconstructions of indoor scenes.
Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §1.
- BundleFusion: real-time globally consistent 3d reconstruction using on-the-fly surface re-integration. ACM Transactions on Graphics 36, pp. 1. External Links: Cited by: §2.1.
- Architectural acoustics. edition, , Vol. , McGraw-Hill Custom Publishing, . Note: Cited by: §3.3, §3, §4.1, §5.1, §6.1.
- Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. CoRR abs/1411.4734. External Links: Cited by: §2.1, §3.2, §7.
- Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. CoRR abs/1804.03619. External Links: Cited by: §2.2.
- Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, pp. 776–780. External Links: Cited by: §2.2.
- PAWS: a wearable acoustic system for pedestrian safety. pp. 237–248. External Links: Cited by: §2.2.
- SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes. Technical report Technical Report TVG-2015-1, Department of Engineering Science, University of Oxford. Note: Released as arXiv e-print 1510.03727 Cited by: §1, §2.1.
- A very simple way to simulate the timbre of flutter echoes in spatial audio. Cited by: §6.3.
- ManyModalQA: modality disambiguation and qa over diverse inputs. External Links: Cited by: §2.2.
- Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. CoRR abs/1706.07156. External Links: Cited by: §4.
- Bat-g net: bat-inspired high-resolution 3d image reconstruction using ultrasonic echoes. In NeurIPS, Cited by: §2.2.
- KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, New York, NY, USA, pp. 559–568. External Links: Cited by: §2.1, §7, 1.
- Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources. In ACM SIGGRAPH 2006 Papers, SIGGRAPH ’06, New York, NY, USA, pp. 987–995. External Links: Cited by: §2.2.
- 3D building reconstruction from lidar based on a cell decomposition approach. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 38, pp. . Cited by: Table 1.
- 3D room geometry reconstruction using audio-visual sensors. pp. 621–629. External Links: Cited by: §2.2.
- Modality-balanced models for visual dialogue. External Links: Cited by: §2.2.
- Multimodal residual learning for visual qa. External Links: Cited by: §2.2.
- Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Cited by: §4.3.
- Image-based rendering in the gradient domain. ACM Transactions on Graphics (TOG) 32, pp. 199:1–199:9. External Links: Cited by: §2.1.1.
- ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: Table 3, §4, Table 4.
- A large-scale hierarchical multi-view rgb-d object dataset. pp. 1817–1824. External Links: Cited by: §2.1.
- Temporal convolutional networks for action segmentation and detection. CoRR abs/1611.05267. External Links: Cited by: §1.
- Algorithms for non-negative matrix factorization. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, Cambridge, MA, USA, pp. 535–541. External Links: Cited by: §2.2.
- Acoustic non-line-of-sight imaging. pp. 6773–6782. External Links: Cited by: §2.2.
- Digital photogrammetry. Vol. 1, Springer. Cited by: §5.1.
- Architectural acoustics. 2nd. edition, , Vol. , Academic Press, . Note: Cited by: §3.3, §3, Table 3, §6.4.
Recognition and pose estimation of rigid transparent objects with a kinect sensor. In Robotics: Science and Systems, Cited by: §2.1.1.
- AIM: acoustic imaging on a mobile. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’18, New York, NY, USA, pp. 468–481. External Links: Cited by: §2.2.
- PYIN: a fundamental frequency estimator using probabilistic threshold distributions. pp. 659–663. External Links: Cited by: §7.
- WAVE: interactive wave-based sound propagation for virtual environments. Visualization and Computer Graphics, IEEE Transactions on 21, pp. 434–442. External Links: Cited by: §2.2.
- AgiSoft metashape standard. Vol. (Version 1.6.2) (Software). External Links: Cited by: Figure 2, §6.3.
- Sound classification in hearing aids inspired by auditory scene analysis. EURASIP Journal on Advances in Signal Processing 18. External Links: Cited by: §2.2.
- Fundamentals of music processing: audio, analysis, algorithms, applications. 1st edition, Springer Publishing Company, Incorporated. External Links: Cited by: §4, §4, §4.
- Audiovisual zooming: what you see is what you hear. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, New York, NY, USA, pp. 1107–1118. External Links: Cited by: §2.2.
- KinectFusion: real-time dense surface mapping and tracking. pp. 127–136. External Links: Cited by: §2.1.
- DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. pp. 343–352. External Links: Cited by: §2.1.
- Human whistle detection and frequency estimation. Image and Signal Processing, Congress on 5, pp. 737–741. External Links: Cited by: §3.3.
- AprilTag: a robust and flexible visual fiducial system. pp. 3400 – 3407. External Links: Cited by: §2.1.1.
- ESC: dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, New York, NY, USA, pp. 1015–1018. External Links: Cited by: §2.2.
- Example-guided physically based modal sound synthesis. ACM Trans. Graph. 32 (1), pp. 1:1–1:16. External Links: Cited by: §1.
- Polarization imaging reflectometry in the wild. ACM Transactions on Graphics (TOG) 36, pp. 1 – 14. Cited by: §2.1.1.
- SynCoPation: interactive synthesis-coupled sound propagation. IEEE Transactions on Visualization and Computer Graphics 22 (4), pp. 1346–1355. External Links: Cited by: §2.2.
- A dataset and taxonomy for urban sound research. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, New York, NY, USA, pp. 1041–1044. External Links: Cited by: §2.2.
- Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics 24, pp. 1246–1259. Cited by: §1, §2.2, Table 3, Table 4.
- GSound: interactive sound propagation for games. Proceedings of the AES International Conference, pp. . Cited by: §5.1.
- A comparison and evaluation of multi-view stereo reconstruction algorithms.. Vol. 1, pp. 519–528. External Links: Cited by: §2.1.
- Reflection removal using ghosting cues. pp. 3193–3201. External Links: Cited by: §2.1.1.
- Indoor segmentation and support inference from rgbd images. In ECCV, pp. 746–760. External Links: Cited by: §2.1.
- BigBIRD: a large-scale 3d database of object instances. pp. 509–516. External Links: Cited by: §2.1.
- Image-based rendering for scenes with reflections. ACM Transactions on Graphics - TOG 31, pp. 1–10. External Links: Cited by: §1, §2.1.1.
- Physical audio signal processing. Note: https://ccrma.stanford.edu/~jos/pasp/ Cited by: §2.2.
- Semantic scene completion from a single depth image. pp. 190–198. External Links: Cited by: §1, §2.1.
- ISNN: impact sound neural network for audio-visual object classification. In Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Cham, pp. 578–595. External Links: Cited by: §1, §2.2.
- A head-mounted three dimensional display. In AFIPS ’68 (Fall, part I), Cited by: §2.1.1.
- Chapter 12 - nonlinear acoustics and imaging. In Diagnostic Ultrasound Imaging: Inside Out (Second Edition), T. L. Szabo (Ed.), pp. 501 – 563. External Links: Cited by: §3.1, §3.3.
- Scene-aware audio rendering via deep acoustic analysis. IEEE Transactions on Visualization and Computer Graphics 26 (5), pp. 1991–2001. Cited by: §1.
- Live metric 3d reconstruction on mobile phones. pp. 65–72. External Links: Cited by: Figure 2, Table 1, §3.4, §6.3.
- Practical acquisition and rendering of diffraction effects in surface reflectance. ACM Transactions on Graphics 36, pp. 1. External Links: Cited by: §2.1.1.
- Perceptual audio rendering of complex virtual environments. ACM Trans. Graph. 23 (3), pp. 249–258. External Links: Cited by: §2.2.
- AprilTag 2: efficient and robust fiducial detection. pp. 4193–4198. External Links: Cited by: §2.1.1.
- ‘Structure-from-motion’ photogrammetry: a low-cost, effective tool for geoscience applications. Geomorphology 179 (), pp. 300 – 314. Note: External Links: Cited by: §2.1.
- Reconstructing scenes with mirror and glass surfaces. ACM Trans. Graph. 37 (4), pp. 102:1–102:11. External Links: Cited by: §1, §2.1.1, Table 1.
- AVOT: audio-visual object tracking of multiple objects for robotics. In ICRA 2020, Cited by: §1.
- Analyzing liquid pouring sequences via audio-visual neural networks. pp. 7702–7709. External Links: Cited by: §1, §2.2.
- 3D shapenets: a deep representation for volumetric shapes. pp. 1912–1920. External Links: Cited by: §1, §2.1.
- Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. CoRR abs/1708.01471. External Links: Cited by: §4.
- Shape-from-shading: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (8), pp. 690–706. External Links: Cited by: §2.1.
- 3D reconstruction in the presence of glass and mirrors by acoustic and visual fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, pp. 1–1. External Links: Cited by: §2.1.1, Table 1.
- Generative modeling of audible shapes for object perception. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1260–1269. Cited by: §1, §2.2.
- Toward high-quality modal contact sound. ACM Trans. Graph. 30 (4), pp. 38:1–38:12. External Links: Cited by: §2.2.