BatVision: Learning to See 3D Spatial Layout with Two Ears

12/15/2019
by   Jesper Haahr Christensen, et al.
berkeley college
DTU
16

Virtual camera images showing the correct layout of a space ahead can be generated by purely listening to the reflections of chirping sounds. Many species evolved sophisticated non-visual perception while artificial systems fall behind. Radar and ultrasound are used where cameras fail, but provide very limited information or require large, complex and expensive sensors. Yet sound is used effortlessly by dolphins, bats, wales and humans as a sensor modality with many advantages over vision. However, it is challenging to harness useful and detailed information for machine perception. We train a network to generate representations of the world in 2D and 3D only from sounds, sent by one speaker and captured by two microphones. Inspired by examples from nature, we emit short frequency modulated sound chirps and record returning echoes through an artificial human pinnae pair. We then learn to generate disparity-like depth maps and grayscale images from the echoes in an end-to-end fashion. With only low-cost equipment, our models show good reconstruction performance while being robust to errors and even overcoming limitations of our vision-based ground truth. Finally, we introduce a large dataset consisting of binaural sound signals synchronised in time with both RGB images and depth maps.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

06/14/2020

BatVision with GCC-PHAT Features for Better Sound to Vision Predictions

Inspired by sophisticated echolocation abilities found in nature, we tra...
08/18/2020

Depth Completion with RGB Prior

Depth cameras are a prominent perception system for robotics, especially...
12/04/2017

Visual to Sound: Generating Natural Sound for Videos in the Wild

As two of the five traditional human senses (sight, hearing, taste, smel...
12/27/2019

Learning by Cheating

Vision-based urban driving is hard. The autonomous system needs to learn...
03/30/2020

LayoutMP3D: Layout Annotation of Matterport3D

Inferring the information of 3D layout from a single equirectangular pan...
03/06/2017

Sound-Word2Vec: Learning Word Representations Grounded in Sounds

To be able to interact better with humans, it is crucial for machines to...
10/09/2018

Functionally Modular and Interpretable Temporal Filtering for Robust Segmentation

The performance of autonomous systems heavily relies on their ability to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In this work we learn a transformation from a binaural sound signal to a visual scene in an end-to-end fashion using deep neural networks. Solving this challenge would benefit robot navigation, machine vision or even support human vision in low or no-light conditions. Contrary to comparable work, the trained system uses only

two simple low-cost consumer grade microphones to keep it small, mobile and easily reproducible. For 3D perception using the limited system, we make use of the natural spectral filters originating from the human pinnae by perceiving sound through an emulated human auditory system, seen in Fig. 1. After creating a dataset of synchronized binaural audio and stereo camera data, we train a generative adversarial neural network to reconstruct images from audio data alone. We found depth maps can be reconstructed with a reasonable level of detail, showing correct layouts of office scenes. Reconstructed grayscale images show surprisingly plausible floor-layouts even though obstacles lack finer details.

While most animals and humans form images of the world strongly based on visual information, some species are capable of forming images based on acoustic information. Bats, for one, have the ability to sense the world in using only their binaural acoustic system by continuously emitting short ultrasonic pulse trains and processing returning echos.

Previous work [1, 2] has proven that using an artificial pinnae pair from bats acting as complex direction-dependent spectral filters and using head related transfer functions (HRTF) it is indeed possible to resolve the placement of highly reflecting ultrasonic targets in space.

Fig. 1: Overview of the proposed method. Created sound chirps, reflecting off the environment, are parsed through a network to generate plausible depth-maps or grayscale images. Our mobile robot platform mounts all needed hardware though is moved passively at this stage to avoid motor noise.

Likewise humans who have suffered from vision loss have been shown to develop capabilities of echolocation [3, 4] using palatal clicks much like dolphins and thus learning to distinguish reflective targets in space by listening to the returning echoes.

Inspired by echolocation found in nature, we emit frequency modulated (FM) chirps in the audible spectrum and record the returning signal with two microphones in artificial human ears. At the same time we record ground truth images of the scene ahead with a stereo camera. Given this audio-visual data, we learn generation of depth-map representations only from binaural audio. We also compare generation of grayscale images using monocular images as ground truth.

We do not expect our scene representation to provide a high level of detail but aim to generate a depth map which resolves features such as walls, furniture (rough contours), door openings and hallways correctly in azimuth, elevation and range. For a navigation system this can provide information, complementary to vision sensors, independent of light and at low additional cost. The approach is conceptually easy and can be used on embedded mobile platforms.

We also contribute a comparison of two input encodings, commonly used in the field, to map an audio signal into a rich latent feature space: raw waveforms and amplitude spectrograms. We then learn a generative model that transforms the audio features into the visual representation, to compare against the ground truth camera data. Finally, to generate more detailed and realistic looking predictions, we expand on our method with an adversarial discriminator [5].

Furthermore, we are committed to give public access to our dataset and source code concurrently with this publication. It contains about audio chirps and returning echoes, recorded in an indoor office space, synchronized and matched in time with RGB-D data.

Ii Related Work

Biosonar Imaging and Echolocation  The work in [4, 2, 6, 7, 8] all investigate target echolocation in 2D or 3D space using ultrasonic FM chirps between . These approaches are inspired by echolocation abilities found in some animal species. Bats, for example, emit pulse trains of very short durations (typically ) and use received echoes to perceive information of their surroundings. In [7, 4] signal receiving microphones are placed in an artificial bat pinnae. The natural form of the bat pinnae have been shown to act as frequency filters useful for separating spatial information in both azimuth and elevation [9, 2]. This motivates our use of short FM chirps and artificial human pinnaes with integrated microphones.

In [6, 7], the objective is to autonomously drive a mobile robot while mapping and avoiding obstacles using azimuth and range information from ultrasonic sensors. The work uses echo information for binary obstacle classification to detect if the obstacle is a biological object (plant) or not. However, this is only done in 2D and retrieved information is limited to echo spectograms or cochleograms. No further steps are taken to reconstruct the surroundings visually.

In [4]

ultrasonic echoes are recorded and dilated to be played back to a human subject in the audible spectrum. Experiments found after initial training, human subjects quickly picked up echolocation abilities to estimate azimuth, distance and, to some extend, elevation of targets. 3D target localization is also explored in

[8, 2] where the former uses a microphone array rather than binaural audio.

Sound Source Localization  In more recent work [10, 11, 12, 13, 14]

end-to-end deep neural networks are trained to localize sound in images or videos. The task is to pinpoint the sources of sounds, e.g. from a piano in the corresponding audio track. They achieve remarkable results using self-supervision and show the potential of deep learning on paired audio-visual data.

Sound source localization in [12] is based on the input of an acoustic camera described in [15] – this device is a hybrid audio-visual sensor that provides RGB video, raw audio signals and a hybrid where the visual and acoustic information is aligned in time and space. In all above mentioned work, sensing of sound is passive.

3D sound source localization using emulated binaural hearing is addressed in [16]. Here a model of human ears and HRTFs are used to accurately locate sound sources. The experiments tested azimuth from with resolution and elevation from with resolution.

Finally, the work in [17] investigates the relationship of speech and gestures. They use a monologue of a single speaker, e.g. a newscaster, to generate motions for the speaker’s arms and hands. Audio clips are translated into trajectories of coordinates of joints. Using an audio decoder and GAN approach is inspired by their work.

Acoustic Imaging  Non-Line-of-Sight imaging is explored in [18]. Here a microphone and speaker array is used to emit and record FM sound waves. The sound waves are within the audible spectrum with a chirp from and emitted to propagate into a wall, to the hidden object and back to the microphone array. The authors show that with a pure algorithmic approach, a hidden object can be successfully reconstructed at a resolution limited by the receiving microphone array. In contrast to this object reconstruction, we aim to capture the complete scene of an area ahead with a system, small enough to be mounted on mobile devices.

Iii Audio—RGB-D Dataset

In order to learn associations between active binaural audio and vision we introduce a novel dataset of synchronized echo returns, RGB images and depth maps. This section describes the dataset and the methodology to collect and prepare it.

Iii-a Audio-Visual Dataset

To generate our dataset we traverse available area of the office space by fixing our robot on a trolley and pushing it around. This is to initially avoid data quality being degraded by motor noise. The dataset includes hallways, open areas, conference rooms and office spaces. In total our training and validation dataset contains about . One audio instance spans and contains one audio chirp, its echoes and is paired with one time-synchronized image from the camera, see Fig. 1 and 3.

To avoid high correlation of training, validation and test data, we first split the training and validation data by location as shown in Fig 2. This yields training and validation instances. Our test dataset totals in instances and is collected by traversing another level of the same building. It presents similar narrow hall-ways but a different environment in terms of layout, interior and furniture.

Fig. 2: Samples of Our Dataset. Training and validation data was collected in separate regions of the same floor. Test data was recorded on another floor and differs mostly in present obstacles.

Iii-B Data Collection Details

We emit linear FM waveform chirps: signal sweeps from within a duration of . The waveform characteristics are designed using the freely available software tool Audacity. For emitting the sound signal we use a consumer grade JBL Flip4 Bluetooth speaker.

Acting as antennaes (ears) we employ two low-cost consumer grade omni-directional USB Lavalier MAONO AU-410 microphones. Each microphone is mounted in a Soundlink silicone ear to effectively emulate an artificial human auditory system. For our experiments we record using PyAudio for Python with sampling frequency of and bits per sample.

To correspond audio data with the perceived visual scene, we use a ZED stereo camera to record RGB-D data. Camera, speaker and artificial human pinnaes are mounted on a small mobile robot as shown in Fig. 1. The microphones in the pinnaes are mounted approximately apart.

Iii-C Data Preparation

We choose the length of audio samples to include echoes traveling up to . We balance a trade-off between receiving echos from a relevant distance and reducing echos being reflected multiple times and thus having a longer travel time (Multipath-Effects). To this end, we extract audio windows of samples (corresponding to ) from the data.

The extracted audio windows are represented in two different forms; as raw waveforms and as amplitude spectrograms. The LibROSA library for Python is used to compute spectrograms with points for the FFT and a Hanning window of length . Fig. 3 shows a raw waveform and its corresponding amplitude spectrogram. The spectrogram shows the emitted chirp starting at and returning echoes at .

Depth maps are computed using the API of the stereo camera and are further normalized to values between 0 and 1. Measurements above and including are clipped to 1. Pixels where the camera is unable to produce a valid range measurement are set to 0.

Fig. 3: Audio data sample from a single microphone. Left: Raw audio waveform. Chirp at , echo appear after. Right: Amplitude spectrogram.

Iv Audio to Visual Transformation

The input is first processed by an audio encoder after which a generator creates images from the latent audio feature representation. We further expand generation of images with an adversarial discriminator and contrast against results without it. The complete pipeline including the adversarial discriminator is shown in Fig. 4. The following subsections describe each component of our model in detail.

We note that in our experiments using spectrograms yields slightly better results over raw waveforms, but as we aim for a real-time capable system on embedded platforms we focus on results achieved using the less computationally expensive raw audio waveforms. However, for comparison we report on both approaches throughout this work.

Fig. 4: Binaural audio to depth map translation model. A temporal convolutional audio encoder,

, downsamples the input to an audio feature vector. The generator,

, then predicts a corresponding depth map for the current visual scene. The discriminator, , enforces local reconstruction of high frequency structure at the scale of patches.

Iv-a Audio Encoders

We present in the following two encoding alternatives: raw audio waveforms and spectrograms.

Waveforms  Our waveform audio encoder is inspired by SoundNet [19]. As shown in Fig. 5 we input the two audio waveforms (binaural signal) and concatenate the input by the channel dimension (early fusion). Next follows a series of 8 temporal convolutions to downsample the signal into a final output of a 1024-dimensional feature vector. Details of our audio encoder are summarized in Table I.

Fig. 5: Waveform audio encoder. A number of convolutions on the left and right audio waveform reduce the temporal dimension and output a 1024-dimensional feature vector.

Spectrograms  Analogous to the raw waveform, we downsample the time-dimension of the spectrograms by a series of convolutions to the point where the time dimension is 1. The output dimension of the spectrogram encoder is where is the frequency axis (y-axis of the spectrogram). This is dependent on the downsampling factor, i.e. the parameters used for the convolutions.

Layer # of Filters Filter size Stride Padding
Conv1 32 228 2 114
Conv2 64 128 3 64
Conv3 128 64 3 32
Conv4 256 32 3 16
Conv5 256 16 3 8
Conv6 512 8 3 4
Conv7 512 4 3 2
Conv8 1024 3 3 1
TABLE I: The layer configuration of the waveform audio encoder.

Iv-B Image Generation

The objective of the generator is to learn a mapping from latent audio features to a visual representation of the scene as perceived by the stereo camera. When processing raw waveforms, we found simple upscaling from the audio feature vector to yield the best results. When processing spectrograms we found a UNet [20] encoder-decoder style network to yield best results. For comparison, we investigated several resolutions for reconstructed images from to .

UNet  To transform the output of the audio encoder to a representation suited for a UNet style network, the 1024-dimensional feature vector is reshaped into a tensor. In case of the spectrogram encoder where the output is and we employ a series of two densely connected linear layers of size 1024 and then reshape into the above mentioned tensor shape. The output of this generator is dependent on the target resolution, e.g. .

With the network, we downsample the

input through a series of layers combining double convolutions with batch normalization and ReLU non-linearities followed by each convolution. The upsampling layers employ a similar series with the first operation being a de-convolution (to up-sample) rather than a convolution.


Direct upsampling  This casts the 1024-dimensional feature vector as a tensor and then employ a series of upsampling layers (see UNet above) to reach the target resolution. The layer configuration for a output are summarized in Table II.

Layer # of Filters Filter size Stride Padding Res.
Up1 512 4 1 0 4
Up2 512 4 2 1 8
Up3 256 4 2 1 16
Up4 128 4 2 1 32
Up5 128 4 2 1 64
Up6 64 4 2 1 128
Final 1 1 1 0 128
TABLE II: The layer configuration of the direct upsampling generator for output.

Iv-C Adversarial Discriminator

To generate more detailed and realistic predictions from our generator we add an adversarial discriminator , conditioned on the difference between the output of the generator and the ground truth collected from the stereo camera. As in [21]

we implement the discriminator as a PatchGAN to penalize structure at the scale of patches and locally enforce reconstruction of high frequency structure. Hence, our discriminator tries to classify if each

patch is a ground truth sample or generated. We model our discriminator as a series of convolutions with characteristics dependent on the generators output resolution. We follow the convention in [21] to have each predicted patch correspond to a receptive field of approximately of the input size. Our discriminator for a configuration is summarized in Table III. Here the output is and has a receptive field for each patch of .

Layer # of Filters Filter size Stride Padding
Conv1 64 4 2 1
Conv2 128 4 2 1
Conv3 256 4 2 1
Conv4 1 4 2 1
TABLE III: PatchGAN discriminator configuration for input.
Fig. 6: “Generator only” (no GAN) test samples. First and second column show the grayscale image of the scene and ground truth depth map.The remaining columns show results from using raw waveforms and spectrograms at resolutions , , and .

V Experiments

V-a Generator Only - Pre-Study

In a preliminary study where we reconstruct small images of resolution, we found that early fusion (concatenation on input) of the input signals outperforms late fusion (concatenation by Conv8, see Fig. 5) when using raw waveforms. In addition we found that using spectograms as input yields slightly better loss than using raw waveforms. However, as stated in Section IV, we aim for real-time capabilities on embedded platforms and as such our main focus is on the least computational expensive method.

The output of the network is compared with the visual ground truth via a regression loss:

(1)

where is the left and right audio waveforms or spectrograms, is the ground truth from the stereo camera, is the audio encoder and is the generator.

For these experiments we use a batch size of , the Adam solver [22] with an initial learning rate set to and parameters and set to and respectively. All ReLUs are leaky with slope .

When using raw waveforms, direct upsampling and early fusion performs best (Table IV). For spectrograms, early fusion, downsampling to

and the UNet style generator performs best. We also compare the mean depth map of the training set with the test set reconstructions and random inputs drawn from a uniform distribution on the interval

. These two best configurations are retrained for output dimensions of , and where we find moderately higher loss for larger depth maps (Table V). Reconstruction quality for these resolutions differ and is compared in detail in Fig. 6. More samples for in Fig. 7 show reconstruction of diverse scenes.

Audio Encoder Fusion Shape Generator Loss
Waveform Early 1024 UNet 0.0883
Direct 0.0838
Late 1024 UNet 0.0894
Direct 0.0845
Spectrogram Early 1024 UNet 0.0834
Direct 0.0790
UNet 0.0773
Direct 0.0778
Mean 0.1058
Random 0.3654
TABLE IV: “Generator Only” results for imgs. on the test set.
Model Waveform (D. Upsampling) Spectrogram (UNet Style)
32 64 128 32 64 128
Gen. Only
Depth map 0.0852 0.0862 0.0880 0.0722 0.0726 0.0742
GAN
Depth map 0.0867 0.0955 0.0930 0.0799 0.0808 0.0878
Grayscale 0.2238 0.1967 0.2018 0.1721 0.1845 0.1841
TABLE V: test loss for 32, 64, 128 waveform and spectrogram.

V-B Generative Adversarial Network

The use of an adversarial discriminator improves the depth-maps qualitatively with more accurate details even though the average loss increases slightly. Furthermore, based on the audio information, we reconstruct grayscale images which show an approximation of how the room layout could look like. Even though features of visual appearance are not present in the audio signal, we achieve plausible floor and wall layouts using the discriminator.

As proposed in [23]

we use least squares loss rather than traditional sigmoid cross entropy loss function. This is to avoid vanishing gradients which consequentially will saturate learning. The GAN loss therefore becomes:

(2)
(3)

Our full objective therefore is:

(4)

where is a scaling factor. For these experiments we use , batch size , the Adam solver with learning rate set to and parameters and set to and respectively. All ReLUs are leaky with slope .

Equally as in the ”Generator Only” case we find moderately higher loss for larger depth maps (Table V). However, samples in Fig. 7 show finer details and clearer borders. Grayscale reconstruction (rightmost columns) shows well placed floors even though objects are roughly abstracted.

Fig. 7: Test samples shown for output resolution. First and fourth column shows the ground truth depth map and grayscale image of the scene. The remaining columns show results from using raw waveforms as input. Generated depth-images show correct mapping of close and distant areas even in row three, where errors in the ground truth are present.

Vi Discussion

From an empirical study of the test results we find that our models reconstruct depth maps using only two microphones to a remarkable level of visual accuracy. We obtain disparity-like depth maps showing detailed room depth and obstacles such as walls and furniture. We even outperform our ground truth in some cases where the depth-from-stereo algorithm struggles to estimate disparity. This can be seen in Fig. 7, third row, where the ground truth grayscale shows the true room layout and the GAN captures the depth best. Corridors and open spaces can be distinguished and obstacles are visible, even though fine details are yet difficult to capture.

Generating grayscale images is a more difficult task and the amount of detail and information required is not expected to be present in echo returns. However, highly interesting is the ability of the trained model to generate plausible “free” floor areas and place “walls” with seemingly good performance. Objects are not recognizable but in lack of further information, the network correctly places an approximation where obstacles are. Note this is without being trained with any depth-related ground truth, i.e. with monocular grayscale images only.

Fig. 8: Test samples with poor results shown for output resolution. First and fourth column shows the ground truth depth map and grayscale image. The remaining columns show results obtained using raw waveforms. Close and complex objects are not well represented.

Vi-a Current Limits of the Approach

How sound resonates, propagates and reflects in a room highly impacts performance. Some materials have dampening properties, leading to weak (or completely absorbed) echos. Facing corners, where hallways fork in different directions, pose a challenge because sound waves scatter off both sides of a corner. At short ranges (<), multipath echoes returning at the same time with similar amplitudes can create a superposition of echoes that is also difficult to resolve.

In areas with dense obstacles such as conference rooms with office chairs, the model often fails to predict meaningful content. Examples in Fig. 8 may show limits in reconstruction performance achievable with a binaural microphones.

Finally, we fitted all sensors on a mobile robot to collect data from a perspective which will enable driving in the future but did not use the robot’s own motor yet to minimize audible noise.

References

  • [1] F. Schillebeeckx, F. De Mey, D. Vanderelst, and H. Peremans, “Biomimetic sonar: Binaural 3d localization using artificial bat pinnae,” I. J. Robotic Res., vol. 30, pp. 975–987, 07 2011.
  • [2] I. Matsuo, J. Tani, and M. Yano, “A model of echolocation of multiple targets in 3d space from a single emission,” The Journal of the Acoustical Society of America, vol. 110, no. 1, pp. 607–624, 2001. [Online]. Available: https://doi.org/10.1121/1.1377294
  • [3] R. Kuc and V. Kuc, “Modeling human echolocation of near-range targets with an audible sonar,” The Journal of the Acoustical Society of America, vol. 139, pp. 581–587, 02 2016.
  • [4] J. Sohl-Dickstein, S. Teng, B. Gaub, C. C. Rodgers, C. Li, M. R. DeWeese, and N. S. Harper, “A device for human ultrasonic echolocation,” IEEE transactions on bio-medical engineering, vol. 62, 01 2015.
  • [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14.   Cambridge, MA, USA: MIT Press, 2014, pp. 2672–2680. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969033.2969125
  • [6] I. Eliakim, Z. Cohen, G. Kósa, and Y. Yovel, “A fully autonomous terrestrial bat-like acoustic robot,” PLOS Computational Biology, vol. 14, p. e1006406, 09 2018.
  • [7] J. Steckel and H. Peremans, “Batslam: Simultaneous localization and mapping using biomimetic sonar,” PloS one, vol. 8, p. e54076, 01 2013.
  • [8] B. Fontaine, H. Peremans, and J. Steckel, “3d sparse imaging in biosonar scene analysis,” 04 2009.
  • [9] J. M. Wotton and J. A. Simmons, “Spectral cues and perception of the vertical position of targets by the big brown bat, eptesicus fuscus,” The Journal of the Acoustical Society of America, vol. 107, no. 2, pp. 1034–1041, 2000. [Online]. Available: https://doi.org/10.1121/1.428283
  • [10] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in the wild,” Proc. CVPR Workshop: Sight and Sound, 06 2019.
  • [11] A. Senocak, T. Oh, J. Kim, M. Yang, and I. S. Kweon, “Learning to localize sound source in visual scenes,” CoRR, vol. abs/1803.03849, 2018. [Online]. Available: http://arxiv.org/abs/1803.03849
  • [12] A. F. Pérez, V. Sanguineti, P. Morerio, and V. Murino, “Audio-visual model distillation using acoustic images,” 04 2019.
  • [13] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Trans. Graph., vol. 37, no. 4, pp. 112:1–112:11, July 2018. [Online]. Available: http://doi.acm.org/10.1145/3197517.3201357
  • [14] A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” arXiv preprint arXiv:1804.03641, 2018.
  • [15]

    A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. Del Bue, and V. Murino, “Seeing the sound: A new multimodal imaging device for computer vision,”

    2015 IEEE International Conference on Computer Vision Workshop (ICCVW), 12 2015.
  • [16] F. Keyrouz and K. Diepold, “An enhanced binaural 3d sound localization algorithm,” in 2006 IEEE International Symposium on Signal Processing and Information Technology, Aug 2006, pp. 662–665.
  • [17] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik, “Learning individual styles of conversational gesture,” in

    Computer Vision and Pattern Recognition (CVPR)

    .   IEEE, June 2019.
  • [18] D. B. Lindell, G. Wetzstein, and V. Koltun, “Acoustic non-line-of-sight imaging,” Proc. CVPR, 2019.
  • [19] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.   Curran Associates, Inc., 2016, pp. 892–900. [Online]. Available: http://papers.nips.cc/paper/6146-soundnet-learning-sound-representations-from-unlabeled-video.pdf
  • [20] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds.   Cham: Springer International Publishing, 2015, pp. 234–241.
  • [21]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,” 07 2017, pp. 5967–5976.

  • [22] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 12 2014.
  • [23] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 2813–2821.