In this work we learn a transformation from a binaural sound signal to a visual scene in an end-to-end fashion using deep neural networks. Solving this challenge would benefit robot navigation, machine vision or even support human vision in low or no-light conditions. Contrary to comparable work, the trained system uses onlytwo simple low-cost consumer grade microphones to keep it small, mobile and easily reproducible. For 3D perception using the limited system, we make use of the natural spectral filters originating from the human pinnae by perceiving sound through an emulated human auditory system, seen in Fig. 1. After creating a dataset of synchronized binaural audio and stereo camera data, we train a generative adversarial neural network to reconstruct images from audio data alone. We found depth maps can be reconstructed with a reasonable level of detail, showing correct layouts of office scenes. Reconstructed grayscale images show surprisingly plausible floor-layouts even though obstacles lack finer details.
While most animals and humans form images of the world strongly based on visual information, some species are capable of forming images based on acoustic information. Bats, for one, have the ability to sense the world in using only their binaural acoustic system by continuously emitting short ultrasonic pulse trains and processing returning echos.
Previous work [1, 2] has proven that using an artificial pinnae pair from bats acting as complex direction-dependent spectral filters and using head related transfer functions (HRTF) it is indeed possible to resolve the placement of highly reflecting ultrasonic targets in space.
Likewise humans who have suffered from vision loss have been shown to develop capabilities of echolocation [3, 4] using palatal clicks much like dolphins and thus learning to distinguish reflective targets in space by listening to the returning echoes.
Inspired by echolocation found in nature, we emit frequency modulated (FM) chirps in the audible spectrum and record the returning signal with two microphones in artificial human ears. At the same time we record ground truth images of the scene ahead with a stereo camera. Given this audio-visual data, we learn generation of depth-map representations only from binaural audio. We also compare generation of grayscale images using monocular images as ground truth.
We do not expect our scene representation to provide a high level of detail but aim to generate a depth map which resolves features such as walls, furniture (rough contours), door openings and hallways correctly in azimuth, elevation and range. For a navigation system this can provide information, complementary to vision sensors, independent of light and at low additional cost. The approach is conceptually easy and can be used on embedded mobile platforms.
We also contribute a comparison of two input encodings, commonly used in the field, to map an audio signal into a rich latent feature space: raw waveforms and amplitude spectrograms. We then learn a generative model that transforms the audio features into the visual representation, to compare against the ground truth camera data. Finally, to generate more detailed and realistic looking predictions, we expand on our method with an adversarial discriminator .
Furthermore, we are committed to give public access to our dataset and source code concurrently with this publication. It contains about audio chirps and returning echoes, recorded in an indoor office space, synchronized and matched in time with RGB-D data.
Ii Related Work
Biosonar Imaging and Echolocation The work in [4, 2, 6, 7, 8] all investigate target echolocation in 2D or 3D space using ultrasonic FM chirps between . These approaches are inspired by echolocation abilities found in some animal species. Bats, for example, emit pulse trains of very short durations (typically ) and use received echoes to perceive information of their surroundings. In [7, 4] signal receiving microphones are placed in an artificial bat pinnae. The natural form of the bat pinnae have been shown to act as frequency filters useful for separating spatial information in both azimuth and elevation [9, 2]. This motivates our use of short FM chirps and artificial human pinnaes with integrated microphones.
In [6, 7], the objective is to autonomously drive a mobile robot while mapping and avoiding obstacles using azimuth and range information from ultrasonic sensors. The work uses echo information for binary obstacle classification to detect if the obstacle is a biological object (plant) or not. However, this is only done in 2D and retrieved information is limited to echo spectograms or cochleograms. No further steps are taken to reconstruct the surroundings visually.
ultrasonic echoes are recorded and dilated to be played back to a human subject in the audible spectrum. Experiments found after initial training, human subjects quickly picked up echolocation abilities to estimate azimuth, distance and, to some extend, elevation of targets. 3D target localization is also explored in[8, 2] where the former uses a microphone array rather than binaural audio.
end-to-end deep neural networks are trained to localize sound in images or videos. The task is to pinpoint the sources of sounds, e.g. from a piano in the corresponding audio track. They achieve remarkable results using self-supervision and show the potential of deep learning on paired audio-visual data.
Sound source localization in  is based on the input of an acoustic camera described in  – this device is a hybrid audio-visual sensor that provides RGB video, raw audio signals and a hybrid where the visual and acoustic information is aligned in time and space. In all above mentioned work, sensing of sound is passive.
3D sound source localization using emulated binaural hearing is addressed in . Here a model of human ears and HRTFs are used to accurately locate sound sources. The experiments tested azimuth from with resolution and elevation from with resolution.
Finally, the work in  investigates the relationship of speech and gestures. They use a monologue of a single speaker, e.g. a newscaster, to generate motions for the speaker’s arms and hands. Audio clips are translated into trajectories of coordinates of joints. Using an audio decoder and GAN approach is inspired by their work.
Acoustic Imaging Non-Line-of-Sight imaging is explored in . Here a microphone and speaker array is used to emit and record FM sound waves. The sound waves are within the audible spectrum with a chirp from and emitted to propagate into a wall, to the hidden object and back to the microphone array. The authors show that with a pure algorithmic approach, a hidden object can be successfully reconstructed at a resolution limited by the receiving microphone array. In contrast to this object reconstruction, we aim to capture the complete scene of an area ahead with a system, small enough to be mounted on mobile devices.
Iii Audio—RGB-D Dataset
In order to learn associations between active binaural audio and vision we introduce a novel dataset of synchronized echo returns, RGB images and depth maps. This section describes the dataset and the methodology to collect and prepare it.
Iii-a Audio-Visual Dataset
To generate our dataset we traverse available area of the office space by fixing our robot on a trolley and pushing it around. This is to initially avoid data quality being degraded by motor noise. The dataset includes hallways, open areas, conference rooms and office spaces. In total our training and validation dataset contains about . One audio instance spans and contains one audio chirp, its echoes and is paired with one time-synchronized image from the camera, see Fig. 1 and 3.
To avoid high correlation of training, validation and test data, we first split the training and validation data by location as shown in Fig 2. This yields training and validation instances. Our test dataset totals in instances and is collected by traversing another level of the same building. It presents similar narrow hall-ways but a different environment in terms of layout, interior and furniture.
Iii-B Data Collection Details
We emit linear FM waveform chirps: signal sweeps from within a duration of . The waveform characteristics are designed using the freely available software tool Audacity. For emitting the sound signal we use a consumer grade JBL Flip4 Bluetooth speaker.
Acting as antennaes (ears) we employ two low-cost consumer grade omni-directional USB Lavalier MAONO AU-410 microphones. Each microphone is mounted in a Soundlink silicone ear to effectively emulate an artificial human auditory system. For our experiments we record using PyAudio for Python with sampling frequency of and bits per sample.
To correspond audio data with the perceived visual scene, we use a ZED stereo camera to record RGB-D data. Camera, speaker and artificial human pinnaes are mounted on a small mobile robot as shown in Fig. 1. The microphones in the pinnaes are mounted approximately apart.
Iii-C Data Preparation
We choose the length of audio samples to include echoes traveling up to . We balance a trade-off between receiving echos from a relevant distance and reducing echos being reflected multiple times and thus having a longer travel time (Multipath-Effects). To this end, we extract audio windows of samples (corresponding to ) from the data.
The extracted audio windows are represented in two different forms; as raw waveforms and as amplitude spectrograms. The LibROSA library for Python is used to compute spectrograms with points for the FFT and a Hanning window of length . Fig. 3 shows a raw waveform and its corresponding amplitude spectrogram. The spectrogram shows the emitted chirp starting at and returning echoes at .
Depth maps are computed using the API of the stereo camera and are further normalized to values between 0 and 1. Measurements above and including are clipped to 1. Pixels where the camera is unable to produce a valid range measurement are set to 0.
Iv Audio to Visual Transformation
The input is first processed by an audio encoder after which a generator creates images from the latent audio feature representation. We further expand generation of images with an adversarial discriminator and contrast against results without it. The complete pipeline including the adversarial discriminator is shown in Fig. 4. The following subsections describe each component of our model in detail.
We note that in our experiments using spectrograms yields slightly better results over raw waveforms, but as we aim for a real-time capable system on embedded platforms we focus on results achieved using the less computationally expensive raw audio waveforms. However, for comparison we report on both approaches throughout this work.
Iv-a Audio Encoders
We present in the following two encoding alternatives: raw audio waveforms and spectrograms.
Waveforms Our waveform audio encoder is inspired by SoundNet . As shown in Fig. 5 we input the two audio waveforms (binaural signal) and concatenate the input by the channel dimension (early fusion). Next follows a series of 8 temporal convolutions to downsample the signal into a final output of a 1024-dimensional feature vector. Details of our audio encoder are summarized in Table I.
Spectrograms Analogous to the raw waveform, we downsample the time-dimension of the spectrograms by a series of convolutions to the point where the time dimension is 1. The output dimension of the spectrogram encoder is where is the frequency axis (y-axis of the spectrogram). This is dependent on the downsampling factor, i.e. the parameters used for the convolutions.
Iv-B Image Generation
The objective of the generator is to learn a mapping from latent audio features
to a visual representation of the scene as perceived by the stereo camera. When processing raw waveforms, we found simple upscaling from the audio feature vector to yield the best results. When processing spectrograms we found a UNet  encoder-decoder style network to yield best results.
For comparison, we investigated several resolutions for reconstructed images from to .
UNet To transform the output of the audio encoder to a representation suited for a UNet style network, the 1024-dimensional feature vector is reshaped into a tensor. In case of the spectrogram encoder where the output is and we employ a series of two densely connected linear layers of size 1024 and then reshape into the above mentioned tensor shape. The output of this generator is dependent on the target resolution, e.g. .
With the network, we downsample the
input through a series of layers combining double convolutions with batch normalization and ReLU non-linearities followed by each convolution. The upsampling layers employ a similar series with the first operation being a de-convolution (to up-sample) rather than a convolution.
Direct upsampling This casts the 1024-dimensional feature vector as a tensor and then employ a series of upsampling layers (see UNet above) to reach the target resolution. The layer configuration for a output are summarized in Table II.
|Layer||# of Filters||Filter size||Stride||Padding||Res.|
Iv-C Adversarial Discriminator
To generate more detailed and realistic predictions from our generator we add an adversarial discriminator , conditioned on the difference between the output of the generator and the ground truth collected from the stereo camera. As in 
we implement the discriminator as a PatchGAN to penalize structure at the scale of patches and locally enforce reconstruction of high frequency structure. Hence, our discriminator tries to classify if eachpatch is a ground truth sample or generated. We model our discriminator as a series of convolutions with characteristics dependent on the generators output resolution. We follow the convention in  to have each predicted patch correspond to a receptive field of approximately of the input size. Our discriminator for a configuration is summarized in Table III. Here the output is and has a receptive field for each patch of .
|Layer||# of Filters||Filter size||Stride||Padding|
V-a Generator Only - Pre-Study
In a preliminary study where we reconstruct small images of resolution, we found that early fusion (concatenation on input) of the input signals outperforms late fusion (concatenation by Conv8, see Fig. 5) when using raw waveforms. In addition we found that using spectograms as input yields slightly better loss than using raw waveforms. However, as stated in Section IV, we aim for real-time capabilities on embedded platforms and as such our main focus is on the least computational expensive method.
The output of the network is compared with the visual ground truth via a regression loss:
where is the left and right audio waveforms or spectrograms, is the ground truth from the stereo camera, is the audio encoder and is the generator.
For these experiments we use a batch size of , the Adam solver  with an initial learning rate set to and parameters and set to and respectively. All ReLUs are leaky with slope .
When using raw waveforms, direct upsampling and early fusion performs best (Table IV). For spectrograms, early fusion, downsampling to
and the UNet style generator performs best. We also compare the mean depth map of the training set with the test set reconstructions and random inputs drawn from a uniform distribution on the interval. These two best configurations are retrained for output dimensions of , and where we find moderately higher loss for larger depth maps (Table V). Reconstruction quality for these resolutions differ and is compared in detail in Fig. 6. More samples for in Fig. 7 show reconstruction of diverse scenes.
|Model||Waveform (D. Upsampling)||Spectrogram (UNet Style)|
V-B Generative Adversarial Network
The use of an adversarial discriminator improves the depth-maps qualitatively with more accurate details even though the average loss increases slightly. Furthermore, based on the audio information, we reconstruct grayscale images which show an approximation of how the room layout could look like. Even though features of visual appearance are not present in the audio signal, we achieve plausible floor and wall layouts using the discriminator.
As proposed in 
we use least squares loss rather than traditional sigmoid cross entropy loss function. This is to avoid vanishing gradients which consequentially will saturate learning. The GAN loss therefore becomes:
Our full objective therefore is:
where is a scaling factor. For these experiments we use , batch size , the Adam solver with learning rate set to and parameters and set to and respectively. All ReLUs are leaky with slope .
From an empirical study of the test results we find that our models reconstruct depth maps using only two microphones to a remarkable level of visual accuracy. We obtain disparity-like depth maps showing detailed room depth and obstacles such as walls and furniture. We even outperform our ground truth in some cases where the depth-from-stereo algorithm struggles to estimate disparity. This can be seen in Fig. 7, third row, where the ground truth grayscale shows the true room layout and the GAN captures the depth best. Corridors and open spaces can be distinguished and obstacles are visible, even though fine details are yet difficult to capture.
Generating grayscale images is a more difficult task and the amount of detail and information required is not expected to be present in echo returns. However, highly interesting is the ability of the trained model to generate plausible “free” floor areas and place “walls” with seemingly good performance. Objects are not recognizable but in lack of further information, the network correctly places an approximation where obstacles are. Note this is without being trained with any depth-related ground truth, i.e. with monocular grayscale images only.
Vi-a Current Limits of the Approach
How sound resonates, propagates and reflects in a room highly impacts performance. Some materials have dampening properties, leading to weak (or completely absorbed) echos. Facing corners, where hallways fork in different directions, pose a challenge because sound waves scatter off both sides of a corner. At short ranges (<), multipath echoes returning at the same time with similar amplitudes can create a superposition of echoes that is also difficult to resolve.
In areas with dense obstacles such as conference rooms with office chairs, the model often fails to predict meaningful content. Examples in Fig. 8 may show limits in reconstruction performance achievable with a binaural microphones.
Finally, we fitted all sensors on a mobile robot to collect data from a perspective which will enable driving in the future but did not use the robot’s own motor yet to minimize audible noise.
-  F. Schillebeeckx, F. De Mey, D. Vanderelst, and H. Peremans, “Biomimetic sonar: Binaural 3d localization using artificial bat pinnae,” I. J. Robotic Res., vol. 30, pp. 975–987, 07 2011.
-  I. Matsuo, J. Tani, and M. Yano, “A model of echolocation of multiple targets in 3d space from a single emission,” The Journal of the Acoustical Society of America, vol. 110, no. 1, pp. 607–624, 2001. [Online]. Available: https://doi.org/10.1121/1.1377294
-  R. Kuc and V. Kuc, “Modeling human echolocation of near-range targets with an audible sonar,” The Journal of the Acoustical Society of America, vol. 139, pp. 581–587, 02 2016.
-  J. Sohl-Dickstein, S. Teng, B. Gaub, C. C. Rodgers, C. Li, M. R. DeWeese, and N. S. Harper, “A device for human ultrasonic echolocation,” IEEE transactions on bio-medical engineering, vol. 62, 01 2015.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, pp. 2672–2680. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969033.2969125
-  I. Eliakim, Z. Cohen, G. Kósa, and Y. Yovel, “A fully autonomous terrestrial bat-like acoustic robot,” PLOS Computational Biology, vol. 14, p. e1006406, 09 2018.
-  J. Steckel and H. Peremans, “Batslam: Simultaneous localization and mapping using biomimetic sonar,” PloS one, vol. 8, p. e54076, 01 2013.
-  B. Fontaine, H. Peremans, and J. Steckel, “3d sparse imaging in biosonar scene analysis,” 04 2009.
-  J. M. Wotton and J. A. Simmons, “Spectral cues and perception of the vertical position of targets by the big brown bat, eptesicus fuscus,” The Journal of the Acoustical Society of America, vol. 107, no. 2, pp. 1034–1041, 2000. [Online]. Available: https://doi.org/10.1121/1.428283
-  Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in the wild,” Proc. CVPR Workshop: Sight and Sound, 06 2019.
-  A. Senocak, T. Oh, J. Kim, M. Yang, and I. S. Kweon, “Learning to localize sound source in visual scenes,” CoRR, vol. abs/1803.03849, 2018. [Online]. Available: http://arxiv.org/abs/1803.03849
-  A. F. Pérez, V. Sanguineti, P. Morerio, and V. Murino, “Audio-visual model distillation using acoustic images,” 04 2019.
-  A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Trans. Graph., vol. 37, no. 4, pp. 112:1–112:11, July 2018. [Online]. Available: http://doi.acm.org/10.1145/3197517.3201357
-  A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” arXiv preprint arXiv:1804.03641, 2018.
A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. Del Bue, and V. Murino, “Seeing the sound: A new multimodal imaging device for computer vision,”2015 IEEE International Conference on Computer Vision Workshop (ICCVW), 12 2015.
-  F. Keyrouz and K. Diepold, “An enhanced binaural 3d sound localization algorithm,” in 2006 IEEE International Symposium on Signal Processing and Information Technology, Aug 2006, pp. 662–665.
S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik, “Learning
individual styles of conversational gesture,” in
Computer Vision and Pattern Recognition (CVPR). IEEE, June 2019.
-  D. B. Lindell, G. Wetzstein, and V. Koltun, “Acoustic non-line-of-sight imaging,” Proc. CVPR, 2019.
-  Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 892–900. [Online]. Available: http://papers.nips.cc/paper/6146-soundnet-learning-sound-representations-from-unlabeled-video.pdf
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Cham: Springer International Publishing, 2015, pp. 234–241.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 12 2014.
-  X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 2813–2821.