3D-MOV: Audio-Visual LSTM Autoencoder for 3D Reconstruction of Multiple Objects from Video

10/05/2021 ∙ by Justin Wilson, et al. ∙ University of Maryland University of North Carolina at Chapel Hill 6

3D object reconstructions of transparent and concave structured objects, with inferred material properties, remains an open research problem for robot navigation in unstructured environments. In this paper, we propose a multimodal single- and multi-frame neural network for 3D reconstructions using audio-visual inputs. Our trained reconstruction LSTM autoencoder 3D-MOV accepts multiple inputs to account for a variety of surface types and views. Our neural network produces high-quality 3D reconstructions using voxel representation. Based on Intersection-over-Union (IoU), we evaluate against other baseline methods using synthetic audio-visual datasets ShapeNet and Sound20K with impact sounds and bounding box annotations. To the best of our knowledge, our single- and multi-frame model is the first audio-visual reconstruction neural network for 3D geometry and material representation.



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks trained on single- or multi-view images have enabled 3D reconstruction of objects and scenes using RGB and RGBD approaches for robotics and other 3D vision-based applications. These models generate 3D geometry volumetrically [6, 11, 64] and in the form of point clouds [18, 44, 46]. With these reconstructions, additional networks have been developed to use the 3D geometry as inputs for object detection, classification, and segmentation in 3D environments [3, 47]. However, existing methods still encounter a few challenging scenarios for 3D shape reconstruction [6].

One such challenge is occlusion in cluttered environments with multiple agents/objects in a scene. Another is spatial resolution. Volumetric methods such as voxelized reconstructions [37] are primarily limited by resolution. Point cloud representations of shape avoid issues of grid resolution, but instead need to cope with issues of point set size and approximations. Existing methods also are challenged by transparent and highly reflective or textured surfaces. Self-occlusions and occlusions from other objects can also hinder image-based networks, necessitating the possible adoption of multimodal neural networks.

To address these limitations, we propose to use audio-visual input for 3D shape and material reconstruction. A single view of an object is insufficient for 3D reconstruction as only one projection of the object can be seen, while multi-view input does not intrinsically model the spatial relationships between views. By providing a temporal sequence of video frames, we strengthen the relationships between views, aiding reconstruction. We also include audio as an input, in particular, impact sounds resulting from interactions between the object to be reconstructed and the surrounding environment. Impact sounds provide information about the material and internal structure of an object, offering complementary cues to the object’s visual appearance. We choose to represent our final 3D shape using voxel representation due to their state-of-the-art performance in classification tasks. To the best of our knowledge, our audio-visual network is the first to reconstruct multiple 3D objects from a single video.

Fig. 1: Our 3D-MOV neural network is a multimodal LSTM autoencoder optimized for 3D reconstructions of single ShapeNet objects and multiple objects from Sound20K video. During training, a LSTM autoencoder is trained to reconstruct 2D image and spectrogram inputs. 3D shape reconstructions are then generated by fine tuning the fused encodings of each modality for 3D voxel output. The network has recurrent LSTM layers for temporal consistency. Adding audio enhances learning for object tracking, material classification, and reconstruction when multiple objects collide, self-occlude, or are transparent.

Main Results: In this paper, we introduce a new method to reconstruct high-quality 3D objects from video, as a sequence of images and sounds. The main contributions of this work can be summarized as follows.

  • A multimodal LSTM autoencoder neural network for both geometry and material reconstruction from audio and visual data is introduced;

  • The resulting implementation has been tested on voxel, audio, and image datasets of objects over a range of different geometries and materials;

  • Experimental results of our approach demonstrate the reconstruction of single sounding objects and multiple colliding objects in a virtual scene;

  • Audio-augmented datasets with ground-truth objects and their tracking bounding boxes are made available for research in audio-visual reconstruction from video.

Ii Related Work

Computer vision research continues to push state-of-the-art reconstruction and segmentation of objects in a scene [12]. However, there still remain research opportunities in 3D reconstruction. Wide baselines limit the accuracy of feature correspondences between views. Challenging objects for reconstruction include thin or small objects (e.g. table legs), and classes of objects that are transparent, occluded, or have much higher shape variation than other classes (e.g. lamps, benches, and tables compared to cabinets, cars, and speakers for example). In this section, we review previous work relating to 3D reconstruction, multimodal neural networks, and reconstruction network structures.

Ii-a 3D Reconstruction

Deep learning techniques have produced state-of-the-art 3D scene and object reconstructions. These models take an image or series of images and generate a reconstructed output shape. Some methods produce a transformed image of the input, intrinsically representing the 3D object structure [42, 56, 36, 38, 35]. 3D voxel grids provide a shape representation which is easy to visualize and works well with convolution operations [11, 16, 51, 45, 22, 61]. In more recent work, point clouds have also been found to be a viable shape representation for reconstructed objects [21, 15].

Ii-B Multimodal Neural Networks

Neural networks with multiple modalities of inputs help cover a broader range of experimental setups and environments. Common examples include visual question answering [7], vision and touch [30], and other multisensory interactions [27]

. Multiple modes may also take the form of image-to-image translation, e.g. domain transfer 


. Using local and global cropped parts of the images (i.e. bounding boxes) have also been shown to serve as a mode of context to supervise learning 


Audio-visual specific multimodal neural networks have also proven effective for speech separation [14] as well as sound localization [69, 43, 28, 2]. Audio synthesis conditioned on images is also enabled as a result of these combined audio-visual datasets [70]

. Please see a survey and taxonomy on multimodal machine learning 

[5] and multimodal deep learning [40] for more information.

Fig. 2: We first separate audio-visual data using object tracking (Section III-A) and sound source separation (Section III-B). Features from audio and visual subnetworks for each object are aggregated by LSTM autoencoders and then fused using addition, concatenation, or a bilinear model [66]

. Finally, 3D geometry is reconstructed by a 3D decoder and audio classified material applied to all voxels.

Ii-C Reconstruction Network Structures

While single view networks perform relatively well for most object classes, objects with concave structures or classes of objects with large variations in shape tend to require more views. 3D-R2N2 [11] allows for both single and multi-view implementations given a single network. Other recurrent models include learning in video sequences [10, 19], Point Set Generation [15]

, and Pixel Recurrent Neural Network (PixelRNN) 

[58]. Methods have also been developed to ensure temporal consistency [65] and use generative techniques [17]. T-L network [16] and 3D-R2N2 [11] are most similar to our 3D-MOV reconstruction neural network. Building on these related works, we fuse audio as an additional input and temporal consistency in the form of LSTM layers (Fig. 2).

Iii Technical Approach

In this work, we reconstruct the 3D shape and material of sounding objects given images and impact sounds. Using audio and visual information, we present a method for reconstruction of single instance ModelNet objects augmented with audio and multiple objects colliding in a Sound20K scene from video. In this section, we cover visual representations from object tracking (Section III-A) and audio obtained from sound source separation of impact sounds (Section III-B) that serve as inputs into our 3D-MOV reconstruction network (Section IV).

Iii-a Object Tracking and Visual Representation

Since an entire video frame may contain too much background, we use object tracking to track and segment different objects. This tracking is performed using the Audio-Visual Object Tracker (AVOT) [60]. Similar to the Single Shot MultiBox Detector (SSD) [33]

, AVOT is a feed-forward convolutional neural network that classifies and scales a fixed number of anchor bounding boxes to track objects in a video. While 3D-MOV aggregates audio-visual features before decoding, AVOT fuses audio-visual inputs before its base network. With additional information from audio, AVOT defines an object based on both its geometry and material.

We use AVOT over other algorithms, such as YOLO [48] or Faster R-CNN [50], because of the availability of audio and need for higher object-tracking accuracy given occlusions caused by multiple objects colliding. Unlike CSR-DCF [34]

, AVOT automatically detects objects in the video without initial markup of bounding boxes. For future work, a scheduler network or a combination of object trackers is worth considering as well as use of Common Objects in Context (COCO

[32] and SUN RGB-D [54, 53, 26, 63]

datasets for initialization and transfer learning.

The output from tracking is a series of segmented image frames for each object, consisting of the contents of its tracked bounding box throughout the video. These segmented frames are grayscaled and resized to a consistent input size of 88 by 88 pixels. While resizing, we maintain aspect ratio and pad to square the image. These dimensions were automatically chosen to account for the size of objects in our Sound20K dataset and to capture their semantic information. Scenes included one, two, and three colliding objects with materials such as granite, slate, oak, and marble. For our single-frame, single impact sound evaluations, we resized ShapeNet’s 224 x 224 image size. For comparison, other image sizes from related work include MNIST, 28 x 28; 3D-R2N2, 127 x 127; ImageNet, 256 x 256.

Fig. 3:

For our single impact sound analysis using ShapeNet, we build multimodal datasets using modal sound synthesis to produce spectrograms for audio input and images of voxelized objects as an estimate of shape. Please note that audio used from ISNN 

[55] was generated for voxelized models as a result of the sound synthesis pipeline requiring watertight meshes. Unmixed Sound20K audio was available from the generated synthetic videos.
Fig. 4: Hidden layer representations and are trained to spatially encode object geometry and impact sounds, where i is each video frame. These learned weights are subsequently used during test time to generate 3D shapes from audio-visual inputs. For sequence modeling, LSTM layers are reliable for temporal consistency and establishing dependencies over time. More specifically, we use convolutional LSTM layers rather than fully connected to also preserve spatial information.

Iii-B Sound Source Separation of Impact Sounds and Audio Representation

For single frame reconstruction, we synthesize impact sounds on ShapeNet [55], illustrated in Fig. 3. For multiple frames, we take as input a Sound20K video showing one or more objects moving around a scene. These objects strike one another or the environment, producing impact sounds, which can be heard in the audio track of the video. We refer to these objects, dynamically moving through the scene and generating sound due to impact and collision, as sounding objects. Sound20K provides mixed and unmixed audio which can be used directly or to train algorithms for sound source separation [59, 29, 52]. While prior work to localize objects using audio-visual data exists [2, 69], automatically associating separated sounds with corresponding visual object tracks in the context of the reconstruction task remains an area of future work.

Initially, Sound20K and ShapeNet audio are available as time series data, sampled at 44.1 kHz to cover the full audible range. The audio is converted to mel-scaled spectrograms for neural network inputs, which effectively represent the spectral distribution of energy over time. Each spectrogram is 3 seconds for a single frame (ShapeNet) and 0.03 seconds per multi-frame (Sound20K) with an overlap of 25%. Audio spectrograms are aligned temporally with their corresponding image frames from video, forming the audio-visual input for queries. They are generated with discrete short-time Fourier transforms (STFTs) using a Hann window function.


for time frame and Fourier coefficient with real-valued DT signal , sampled window function for n of length , and hop size  [39].

Iii-B1 Single View, Single Impact Sound

Single-view inputs are based on ShapeNet, a repository of 3D CAD models based on WordNet categories. Evaluations were performed on voxelized versions of ShapeNet’s [8], ModelNet10 and ModelNet40 models [62], and image views of these datasets from 3D-R2N2 [11]. To generate audio for these objects to be used for our multi-modal 3D-MOV neural network, we use data from Impact Sound Neural Network [55]. This work synthesized impact sounds for voxelized ModelNet10 and ModelNet40 models [62] using modal analysis and sound synthesis. Modal analysis is precomputed to obtain modes of vibration for each object and sound synthesized with an amplitude determined at run-time given the hit point location on the object and impulse force. The modes are represented as damped sinusoidal waves where each mode has the form


where is the frequency of the mode, is the damping coefficient, is the excited amplitude, and is the initial phase.

Iii-B2 Multi-Frame, Multi-Impact

Multi-frame inputs to our system consist of Sound20K [68] videos that may contain multiple sounding objects, possibly of similar sizes, shapes, and/or materials. This synthetic video dataset contains audio and video data for multiple objects colliding in a scene. Sound20K consists of 20,378 videos generated by rigid-body simulation and impact sound synthesis pipeline [25]. Visually, Sound20K [68] objects can be separated from one another through tracking of bounding boxes. However, audio source separation can be more challenging, particularly for unknown objects. While Sound20K provides separate audio files for each object that can be used, the audio data can also be used to train sound source separation techniques [59, 29, 52] to learn to unmix audio to individual objects by geometry and material. As future work, we will compare the impact on reconstruction quality and performance if we were to use combined, unmixed audio for each object. We will also compare impact of using source separated sounds versus ground truth unmixed audio.

Fig. 5: We separately train audio and visual autoencoders to learn encodings and fine-tune for our 3D reconstruction task. We replace the 2D decoder by a five deconvolutional layer 3D decoder to generate a voxel grid. The separate audio-visual LSTM autoencoders are flattened and merged to form the dense layer. Here, the predicted 3D shape voxels are displayed based on a threshold of 0.3.

Iv 3d-Mov Network Structure

Our 3D-MOV network is a multi-modal LSTM autoencoder optimized for 3D reconstructions of multiple objects from video. Like 3D-R2N2 [11], it is recurrent and generates a 3D voxel representation. However, to the best of our knowledge, our 3D-MOV network is the first audio-visual reconstruction network for 3D object reconstruction. After object tracking and sound source separation, we separately train autoencoders to extract visual and audio feature from each frame (Section IV-A). While the 2D encoder weights are reused, the 2D decoders are discarded (blue rectangles in Fig. 5) and replaced with 3D decoders for learning to reconstruct voxel outputs of the tracked objects based on given 2D images and spectrograms. Using a merge layer such as addition, concatenation, or a bilinear model [66], our method 3D-MOV fuses the results of the audio and visual subnetworks comprised of LSTM autoencoders.

Iv-a Single Frame Feature Extraction

The autoencoder consists of two convolutional layers for spatial encoding followed by a LSTM convolutional layer for temporal encoding. As a general rule of thumb, we use small filters (3x3 and at most 5x5), except for the very first convolutional layer connected to the input, and strides of four and two for the two conv layers 

[31]. The decoder mirrors the encoder to reconstruct the image (Fig. 4

). After each convolutional layer, we employ layer normalization, which is equivalent to batch normalization for recurrent networks 

[4]. It normalizes the inputs across features and is defined as:


where is batch i, feature j of the input x across m features.

Iv-B Frame Aggregation

In chronological order, the training video frames make a temporal sequence. LSTM convolutional layers are used to preserve content and spatial information. To generate more training sequences, we perform data augmentation by concatenating frames with strides 1, 2, and 3. For example, we use a skipping stride of 2 to generate a sequence of every other frame. We use a 10-frame sequence size as a sliding window technique for aggregation of the encodings. The encoder weights learned here are used to then learn 3D decoder weights to output a 3D voxel reconstruction based on audio-visual inputs from audio-augmented ModelNet with impact sound synthesis and Sound20K video.

Iv-C Modality Fusion and 3D Decoder

After encoding our inputs with LSTM convolutional layers, we flatten to a fully connected layer for each audio and visual subnetwork. These dense layers are fused together prior to multiple Conv3D transpose layers for the 3D decoder. Prior work in multimodal deep learning, such as visual question and answering systems, have merged modalities for classification tasks using addition and MFB [66]. A 3D decoder accepts the fusion of audio-visual LSTM encodings and maps it to a voxel grid with five deconvolutional layers, similar to T-L Network [16]. Unlike T-L’s voxel grid, we use voxels for greater resolution and apply a single, audio-based material classification to all voxels. Deconvolution, also known as fractionally-strided or transposed convolution, results in a 3D voxel shape by broadcasting input through kernel  [67].


V Results

In this section, we present our implementation, training, and evaluation metrics along with 3D-MOV reconstructed objects (Fig. 

6). Please see the accompanying supplementary materials for more comparative analysis of loss and accuracy against baseline methods by datasets and numbers of views. For each of ShapeNet and Sound20K, we evaluate the network architecture in Section IV against audio, visual, and audio-visual methods using binary cross entropy loss and intersection over union (IoU) reconstruction accuracy.

Fig. 6: Reconstructed objects from using multiple frames and impact sounds. Please see our supplementary materials for a complete review of results for ShapeNet and Sound20K datasets using binary cross entropy loss and reconstruction accuracy comparing audio, visual, and audio-visual methods by number of views. Our method is able to obtain better reconstruction results for concave internal structures and scenes with multiple objects by fusing temporal audio-visual inputs.

V-a Implementation

Our framework was implemented using Tensorflow 


and Keras 

[9]. Training was run on Ubuntu 16.04.6 LTS with a single Titan X GPU. Voxel representations were rendered based on Matlab visualization code from 3D-GAN [61]. From Sound20K videos, images were grayscale with dimensions 84 x 84 x 1 and audio spectrograms were 64 x 25 x 1, zero padded to equivalent dimensions. Visual data was augmented with resizing, cropping, and skipping strides.

Fig. 7:

3D-MOV-AV reconstructed image and audio inputs for single view voxelized ModelNet10 classes (top, bathtub; middle, chair; bottom, bed). These results are fusing a single image and impact sound with an addition merge layer, training for 60 epochs on a single GPU, and using a voxel threshold of 0.4. 3D-MOV-AV performs best on ModelNet10 single views with audio augmenting visual data.

V-B Training

Since joint optimization can be difficult to perform, we train our reconstruction autoencoder and fused audio-visual networks separately and then jointly optimize to fine-tune the final network. Mean square error is used for the 2D reconstruction loss to train the encoder to reconstruct input images and audio spectrograms. Binary cross entropy loss is calculated between ground truth and reconstructed 3D voxel grids. During testing, we reconstruct from encoded vector representation of audio-visual inputs to a 3D voxel reconstruction output.

Previous work has used symmetry induced volume refinement to constrain and finalize GAN volumetric outputs [41]. Other methods have used multiple views to continuously refine the output [11]. Furthermore, most adversarial generating methods create examples by perturbing existing data, limiting the solution space. Our approach constrains the space of possible 3D reconstructions for objects in the scene by temporal consistency, aggregation, and fusion of audio and visual inputs.

Dataset ShapeNet [8]
Method Input 1 view 5 views
3D-MOV-A (Ours) A 21.2% N/A
3D-MOV-V (Ours) V 22.7% 22.5%
T-L Network [16] AV 18.0%* N/A
3D-MOV-AV (Ours) AV 32.6% 31.0%
Dataset Sound20K [68]
Method Input 10 views
3D-MOV-A (Ours) A 37.2%
3D-MOV-V (Ours) V 65.7%
3D-MOV-AV (Ours) AV 69.8%
TABLE I: 3D-MOV was evaluated for loss and reconstruction accuracy. A view consists of an image and audio frame. Very slight decreases in 3D-MOV accuracy for 1 or 5 ShapeNet views may suggest impact sounds of different hit points are needed rather than using the same sound across views. *We use the T-L Network [16] fused with audio as an overall baseline comparison. 3D-MOV-AV shows performance improvement over single-modality input on both datasets.

V-C Evaluation metrics

Methods were evaluated against voxel Intersection-over-Union (IoU), also known as the Jaccard index 

[24], between the 3D reconstruction and ground truth voxels as well as cross-entropy loss. This can be represented as area of overlap divided by the area of union. More formally:


where is the ground truth occupancy,

the Bernoulli distribution output at each voxel,

an indicator function, and t for threshold. Higher IoU means better reconstruction quality.

Vi Conclusions

To the best of our knowledge, this work is the first method to use audio and visual inputs from ShapeNet objects and Sound20K video of multiple objects in a scene to generate 3D object reconstructions with material from video. While multi-view approaches can improve reconstruction accuracy, transparent objects, interior concave structures, self-occlusions, and multiple objects remain a challenge. As objects collide, audio provides a complementary sensory cue that can enhance the reconstruction model to improve results. In this paper, we demonstrate that augmenting image encodings with corresponding impact sounds refine reconstructions of multimodal LSTM autoencoder neural network outputs.

Limitations: our approach is currently implemented and evaluated with fixed-grid shapes. Further experimentation with residual architectures [20], adaptive grids, and multi-scale reasoning [13] are worth exploring, though they each will introduce different sets of constraints and complexity. Material classification is predicted based on audio alone, given the textureless image renderings of the datasets used. Also, only a single material is inferred for the entire geometry rather than per voxel classification. Finally, the trade-off between additional views and additional auditory inputs could be further explored.

Future Work:

evaluation of other real-time object trackers, such as YOLO and Faster R-CNN, can be performed and trained on other existing datasets, such as COCO and SUN RGB-D. Further investigations can also examine how the error introduced by object tracking propagates to reconstruction error. Same applies to errors from sound source separation and being able to accurately associate unmixed sounds with their corresponding visual object tracks. Next, while audio helps classify the material of the reconstructed geometry, we assume a single material classification based on audio alone and apply that to all voxels. Research on classifying material per voxel using both audio and visual data could expand part segmentation research into reconstructing objects with different materials. Rather than being fully deterministic, fusing audio and visual information for generative models to reconstruct geometry and material may also be of interest to the research community. Then, there may be more than one possible 3D reconstruction for a given image or sound. Beyond reconstruction, audio may also enhance image and sound generation, as well as memory and attention models. For instance, image generation using an audio conditioned GAN and sound generation based on image conditioning could be explored, similar to WaveNet 

[57] local and global conditioning techniques. Finally, testing on real data in the wild and larger datasets of annotated audio and visual data allow for future research.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §V-A.
  • [2] R. Arandjelović and A. Zisserman (2017) Look, listen and learn. External Links: 1705.08168 Cited by: §II-B, §III-B.
  • [3] M. Atzmon, H. Maron, and Y. Lipman (2018) Point convolutional neural networks by extension operators. CoRR abs/1803.10091. External Links: Link, 1803.10091 Cited by: §I.
  • [4] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. External Links: 1607.06450 Cited by: §IV-A.
  • [5] T. Baltrusaitis, C. Ahuja, and L. Morency (2017) Multimodal machine learning: A survey and taxonomy. CoRR abs/1705.09406. External Links: Link, 1705.09406 Cited by: §II-B.
  • [6] D. Boscaini, J. Masci, E. Rodolà, and M. M. Bronstein (2016) Learning shape correspondence with anisotropic convolutional neural networks. CoRR abs/1605.06437. External Links: Link, 1605.06437 Cited by: §I.
  • [7] R. Cadene, H. Ben-younes, M. Cord, and N. Thome (2019) MUREL: multimodal relational reasoning for visual question answering. External Links: 1902.09487 Cited by: §II-B.
  • [8] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: an information-rich 3d model repository. External Links: 1512.03012 Cited by: §III-B1, §V-B.
  • [9] F. Chollet et al. (2015) Keras. Note: https://keras.io Cited by: §V-A.
  • [10] Y. S. Chong and Y. H. Tay (2017) Abnormal event detection in videos using spatiotemporal autoencoder. External Links: 1701.01546 Cited by: §II-C.
  • [11] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3D-r2n2: a unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §I, §II-A, §II-C, §III-B1, §IV, §V-B.
  • [12] A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner (2017) ScanComplete: large-scale scene completion and semantic segmentation for 3d scans. CoRR abs/1712.10215. External Links: Link, 1712.10215 Cited by: §II.
  • [13] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus (2015) Deep generative image models using a laplacian pyramid of adversarial networks. CoRR abs/1506.05751. External Links: Link, 1506.05751 Cited by: §VI.
  • [14] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein (2018) Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. CoRR abs/1804.03619. External Links: Link, 1804.03619 Cited by: §II-B.
  • [15] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 605–613. Cited by: §II-A, §II-C.
  • [16] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta (2016) Learning a predictable and generative vector representation for objects. CoRR abs/1603.08637. External Links: Link, 1603.08637 Cited by: §II-A, §II-C, §IV-C, §V-B, TABLE I.
  • [17] J. Gwak, C. B. Choy, A. Garg, M. Chandraker, and S. Savarese (2017)

    Weakly supervised generative adversarial networks for 3d reconstruction

    CoRR abs/1705.10904. External Links: Link, 1705.10904 Cited by: §II-C.
  • [18] Z. Han, X. Wang, Y. Liu, and M. Zwicker (2019) Multi-angle point cloud-vae: unsupervised feature learning for 3d point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. External Links: 1907.12704 Cited by: §I.
  • [19] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. External Links: 1604.04574 Cited by: §II-C.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §VI.
  • [21] P. Hedman, S. Alsisan, R. Szeliski, and J. Kopf (2017) Casual 3d photography. SIGGRAPH ASIA. Cited by: §II-A.
  • [22] R. Hu, Z. Yan, J. Zhang, O. van Kaick, A. Shamir, and H. Huang (2018-07) Predictive and generative neural networks for object functionality. ACM Transactions on Graphics 37, pp. 1–13. External Links: Document Cited by: §II-A.
  • [23] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. ECCV. Cited by: §II-B.
  • [24] P. Jaccard (1901) Distribution de la flore alpine dans le bassin des drouces et dans quelques regions voisines. Cited by: §V-C.
  • [25] D. L. James, J. Barbic, and D. K. Pai (2006) Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources. In ACM Transactions on Graphics, Cited by: §III-B2.
  • [26] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell (2011) A category-level 3-d object dataset: putting the kinect to work. ICCV. Cited by: §III-A.
  • [27] J. Klemen and C. D. Chambers (2012) Current perspectives and methods in studying neural mechanisms of multisensory interactions. Neuroscience & Biobehavioral Reviews 36 (1), pp. 111 – 133. External Links: ISSN 0149-7634, Document, Link Cited by: §II-B.
  • [28] T. Konno, K. Nishida, K. Itoyama, and K. Nakadai (2020-Jan.) Audio-visual 3d reconstruction framework for dynamic scenes. In Proceedings of the 2020 IEEE/SICE International Symposium on System Integration(SII 2020), Hawaii Convention Center, Honolulu, Hawaii, USA, pp. 802–807. External Links: Link Cited by: §II-B.
  • [29] A. Koretzky, K. R. Bokka, and N. S. Rajashekharappa (2017) Real-time adaptive audio source separation. External Links: Link Cited by: §III-B2, §III-B.
  • [30] M. A. Lee, Y. Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg (2019) Making sense of vision and touch: learning multimodal representations for contact-rich tasks. External Links: 1907.13098 Cited by: §II-B.
  • [31] F. Li, R. Krishna, and D. Xu (2020) Convolutional neural networks for visual recognition. External Links: Link Cited by: §IV-A.
  • [32] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. External Links: Link, 1405.0312 Cited by: §III-A.
  • [33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2015) SSD: single shot multibox detector. CoRR abs/1512.02325. External Links: Link, 1512.02325 Cited by: §III-A.
  • [34] A. Lukezic, T. Vojír, L. Cehovin, J. Matas, and M. Kristan (2016) Discriminative correlation filter with channel and spatial reliability. CoRR abs/1611.08461. External Links: Link, 1611.08461 Cited by: §III-A.
  • [35] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang (2017) 3D shape reconstruction from sketches via multi-view convolutional networks. In 3D Vision (3DV), 2017 International Conference on, pp. 67–77. Cited by: §II-A.
  • [36] X. Mao, Q. Li, and H. Xie (2017) AlignGAN: learning to align cross-domain images with conditional generative adversarial networks. CVPR. Cited by: §II-A.
  • [37] D. Maturana and S. Scherer (2015-09) VoxNet: a 3d convolutional neural network for real-time object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 922 – 928. Cited by: §I.
  • [38] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. External Links: Link, 1411.1784 Cited by: §II-A.
  • [39] M. Mller (2015) Fundamentals of music processing: audio, analysis, algorithms, applications. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 3319219448 Cited by: §III-B.
  • [40] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Madison, WI, USA, pp. 689–696. External Links: ISBN 9781450306195 Cited by: §II-B.
  • [41] C. Niu, J. Li, and K. Xu (2018) Im2Struct: recovering 3d shape structure from a single RGB image. CoRR abs/1804.05469. External Links: Link, 1804.05469 Cited by: §V-B.
  • [42] A. Odena, C. Olah, and J. Shlens (2017) Conditional image synthesis with auxiliary classifier gans. CVPR. Cited by: §II-A.
  • [43] A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. CoRR abs/1804.03641. External Links: Link, 1804.03641 Cited by: §II-B.
  • [44] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016) PointNet: deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593. Cited by: §I.
  • [45] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §II-A.
  • [46] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. External Links: 1706.02413 Cited by: §I.
  • [47] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2017) Frustum pointnets for 3d object detection from RGB-D data. CoRR abs/1711.08488. External Links: Link, 1711.08488 Cited by: §I.
  • [48] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. External Links: Link, 1506.02640 Cited by: §III-A.
  • [49] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee (2016) Learning what and where to draw. NIPS. Cited by: §II-B.
  • [50] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §III-A.
  • [51] G. Riegler, A. O. Ulusoy, and A. Geiger (2017) OctNet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-A.
  • [52] L. Scallie, A. Koretzky, K. R. Bokka, N. S. Rajashekharappa, and L. D. Bernal (2017) Virtual music experiences. External Links: Link Cited by: §III-B2, §III-B.
  • [53] N. Silberman, D. Holem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. ECCV. Cited by: §III-A.
  • [54] S. Song, S. Lichtenberg, and J. Xiao (2015)

    SUN rgb-d: a rgb-d scene understanding benchmark suite

    CVPR. Cited by: §III-A.
  • [55] A. Sterling, J. Wilson, S. Lowe, and M. C. Lin (2018) ISNN: impact sound neural network for audio-visual object classification. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Fig. 3, §III-B1, §III-B.
  • [56] S. Tsai (2018) Customizing an adversarial example generator with class-conditional gans. CVPR. Cited by: §II-A.
  • [57] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu (2016) WaveNet: a generative model for raw audio. Cited by: §VI.
  • [58] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016) Pixel recurrent neural networks. CoRR abs/1601.06759. External Links: Link, 1601.06759 Cited by: §II-C.
  • [59] Y. Wang, D. Wang, and K. Hu (2014) Real-time method for implementing deep neural network based speech separation. External Links: Link Cited by: §III-B2, §III-B.
  • [60] J. Wilson and M. C. Lin (2020) AVOT: audio-visual object tracking of multiple objects for robotics. In ICRA20, Cited by: §III-A.
  • [61] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. CoRR abs/1610.07584. External Links: Link, 1610.07584 Cited by: §II-A, §V-A.
  • [62] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §III-B1.
  • [63] J. Xiao, A. Owens, and A. Torralba (2013) SUN3D: a database of big spaces reconstructed using sfm and object labels. ICCV. Cited by: §III-A.
  • [64] Y. Xie, E. Franz, M. Chu, and N. Thuerey (2018)

    TempoGAN: A temporally coherent, volumetric GAN for super-resolution fluid flow

    CoRR abs/1801.09710. External Links: Link, 1801.09710 Cited by: §I.
  • [65] Y. Xie, Franz,Erik, M. Chu, and N. Thuerey (2018) TempoGAN: a temporally coherent, volumetric gan for super-resolution fluid flow. ACM Transactions on Graphics (TOG) 37 (4), pp. 95. Cited by: §II-C.
  • [66] Z. Yu, J. Yu, J. Fan, and D. Tao (2014) Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision (ICCV), Cited by: Fig. 2, §IV-C, §IV.
  • [67] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola (2020) Dive into deep learning. GitHub. Note: https://d2l.ai Cited by: §IV-C.
  • [68] Z. Zhang, J. Wu, Q. Li, Z. Huang, J. Traer, J. H. McDermott, J. B. Tenenbaum, and W. T. Freeman (2017) Generative modeling of audible shapes for object perception. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §III-B2, TABLE I.
  • [69] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018-09) The sound of pixels. In The European Conference on Computer Vision (ECCV), Cited by: §II-B, §III-B.
  • [70] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg (2018) Visual to sound: generating natural sound for videos in the wild. CVPR. Cited by: §II-B.