Deep Unsupervised Common Representation Learning for LiDAR and Camera Data using Double Siamese Networks

by   Andreas Bühler, et al.

Domain gaps of sensor modalities pose a challenge for the design of autonomous robots. Taking a step towards closing this gap, we propose two unsupervised training frameworks for finding a common representation of LiDAR and camera data. The first method utilizes a double Siamese training structure to ensure consistency in the results. The second method uses a Canny edge image guiding the networks towards a desired representation. All networks are trained in an unsupervised manner, leaving room for scalability. The results are evaluated using common computer vision applications, and the limitations of the proposed approaches are outlined.


page 2

page 5

page 6

page 7


Exploring Simple Siamese Representation Learning

Siamese networks have become a common structure in various recent models...

Deep Representation Learning for Road Detection through Siamese Network

Robust road detection is a key challenge in safe autonomous driving. Rec...

Sampling strategies in Siamese Networks for unsupervised speech representation learning

Recent studies have investigated siamese network architectures for learn...

Understanding the Domain Gap in LiDAR Object Detection Networks

In order to make autonomous driving a reality, artificial neural network...

Common Variable Learning and Invariant Representation Learning using Siamese Neural Networks

We consider the statistical problem of learning common source of variabi...

Extrinsic Calibration of a 3D-LIDAR and a Camera

This work presents an extrinsic parameter estimation algorithm between a...

On the Importance of Asymmetry for Siamese Representation Learning

Many recent self-supervised frameworks for visual representation learnin...

1 Introduction

In the last years a lot of research has been done on improving computer vision and LiDAR systems. Applications range from autonomous cars to unmanned drones. One of the main areas of development is mapping and localization. Mainly two sensors are used for this application: LiDARs, which provide highly accurate distance measurements but are still rather expensive, and cameras, which are cheaper but they lack accuracy when used for mapping. In this work we propose a step towards a new concept of localization, namely finding a common representation of LiDAR and camera data to be used with standard computer vision algorithms. This common representation provides a useful application in swarm robotics like swarm drones or fleets of autonomous cars. The idea is the following: A mapping device is equipped with one highly accurate LiDAR sensor and utilized to create a precise and consistent 3D model of the environment, the map. The swarm robots are equipped with an affordable, lightweight camera. The images taken by the camera can be transformed into the proposed common representation, together with nearby point cloud data of the map. Next, a standard 2D-2D feature matching is performed using the common representations. The 3D positions of the matched features are retrieved from the stored point cloud based map data, which was originally used to generate one of the common representations. Using this data, the 6DoF pose of the robot can be estimated. Our contribution to this concept are two methods for finding a common representation of image and point cloud data in the form of a 2D image. The process of finding this common representation is learning based and unsupervised. Figure 

1 provides a high-level overview of the system. Figure 2 shows the used input data. A point cloud created by multiple LiDAR sensors is projected onto the corresponding image of a forward facing camera with a fisheye lens.

Figure 1:

The common representation (CR) generators leverage the generalization power of neural networks to find a 2D image, the common representation, that can be created from either camera or LiDAR data.

Figure 2: (a) Overlay of point cloud (red) and fisheye distorted image. (b) Overlay of point cloud (red) and image after undistortion. (c) Undistorted, grayscale input image to the proposed networks. (d) Undistorted depth input image to the proposed network, generated by a 2D projection of the point cloud data.

2 Related Work

Finding an abstract representation of images is an extensively studied topic in the field of computer vision and deep learning. A very common approach to this problem is the utilization of convolutional layers combined in an encoder-decoder network [1]. The encoder aims to learn an encoding of the input in a high dimensional hyperspace, while the decoder ensures that this code contains all relevant information to generate the desired output image. The simplest example are auto-encoders, which reconstruct the input image from the intermediate code. Those networks are often used as the backbone architecture of more complex structures [14]. By constraining the learned encoding, variational auto-encoders (VAE) are restricted to include certain information in the low level representation [8]. Another approach to enforce a specific representation of the output leverages the feedback loop in generative adversarial networks (GANs). A GAN is composed of two subnetworks: while the generator creates artificial data, the discriminator tries to differentiate them from another data set that represents the target domain. Recent research based on GANs has shown astonishing results when converting images from one space to another space in an unsupervised fashion [15]. In both [2] and [7] multiple encoding-decoding networks are coupled to transfer images between several domains where no trivial relations exist. In order to find a common structure between data taken from multiple modalities, it is important to guide the network during the training process by providing both similar and non-similar pairs of the respective domains. Siamese networks consist of two networks that share all weights and that are trained simultaneously. A final cost layer computes the similarity score between the output of the different input streams. In [3] Siamese networks are used to identify persons in images taken by different cameras and across time. Another field of application is image matching when dealing with changing weather conditions or diverse viewpoints [10]. An example of combining the aforementioned work is presented in [9]. Using output data from a VAE as first input for a GAN’s discriminator and having a traditional auto-encoder as the second stream, the encoding learned by the auto-encoder is constrained to contain the desired information. This overcomes the problem of handcrafted features based on a pixel level. However, this approach cannot be directly applied to data from multiple modalities. In SuperPoint [4] the idea of Siamese training is applied to find keypoint correspondences between images in an completely unsupervised fashion. Classical computer vision methods, e.g., SIFT features, fall short behind the capabilities of SuperPoint. Even though the visual results are very promising, the method requires the computational power of recent GPU models, which conflicts with equipping low-cost swarm robots.

3 Method

In this section two approaches for finding a common representation (CR) between camera and LiDAR data are proposed. The first approach (section 3.1) is purely unsupervised and utilizes a double Siamese network architecture. The second approach (section 3.2) tries to find a CR that resembles an edge image, to facilitate the learning process. Lastly, in section 3.3 the network layers utilized in the proposed architectures are explained.

Figure 3: Double Siamese Networks based on two encoder-decoder subnetworks to generate the CR from two pairs consisting of an image and the corresponding depth map.

3.1 Architecture 1 - Double Siamese Networks

Figure 3 shows the proposed architecture to learn a CR of image and point cloud data in a purely unsupervised manner. The framework builds on ideas from Siamese learning and encoder-decoder architectures, which are commonly used in unsupervised approaches. The Siamese learning approach ensures that the results are consistent, i.e., changes of the input result in similar changes of the output. Firstly, two images (image 1 and image 2) are sampled from the database and used to compute a difference score. This metric describes the similarity between both images and returns a high value for samples of the same scene of slightly different viewpoints. The Chebyshev distance [5]

is utilized as it provides the desired behavior. During the sampling process, a tuning parameter is used as the probability whether to randomly sample from a set of similar or non-similar images. Secondly, the system fetches the corresponding depth maps (depth map 1 and depth map 2) belonging to image 1 and image 2, respectively. The depth maps are computed by projecting the point cloud data onto the same image plane as given by the corresponding images. The pixels encode the depth measurement of the respective point. Finally, image 1 and image 2 are fed into the image CR generator, whereas depth map 1 and depth map 2 are fed into the depth map CR generator. Both generators are designed in an encoding-decoding fashion, which is explained in section 

3.3. The four forward passes through the subnetworks generate four grayscale output images representing the respective CR of the input pair 1 and input pair 2. The double Siamese networks are trained using the following loss :


where and are the CRs generated from the first and second image, respectively. and are the CRs generated from the first and second depth map, respectively. is the Chebyshev distance of images 1 and 2, and the norm

can theoretically be any matrix/image comparison norm. To this end the Manhattan distance is utilized. The intuition behind this formulation of the loss function is the following: The loss should be small when the CR computed from the image and the depth map is similar and, additionally, one of the two following conditions holds: either when the Siamese pairs are similar, i.e., a small difference score, and the CR of the respective Siamese pairs are similar, or when the Siamese pairs are not similar, i.e., large difference score, and the CR of the respective Siamese pairs differ a lot. Similarly, the loss should be large in the respective inverse cases.

3.2 Architecture 2 - Common Edges

Our second approach of finding a common representation utilizes a Canny edge image in the loss function to solve the consistency problem. Figure 4 shows the proposed architecture of this approach. The input, which is randomly fetched from a database, consists of a grayscale image and the corresponding depth map. The Canny edge algorithm is utilized to generate an edge image based on the grayscale input image. The inputs are fed through two separate encoder-decoder networks, which are explained in more detail in section 3.3. The following loss function is utilized:


where and are the CRs generated from the image and the depth map, respectively, and resembles the edge image. The norm operator is implemented as the Manhattan distance, but other matrix/image comparison norms could work as well. This approach ensures that the generated CRs are consistent and resemble an edge image. Therefore, a change in the input image leads to a similar change in the output image.

Figure 4: Common Edges architecture utilized to train a common representation that resembles a Canny edge image.

3.3 Encoder-Decoder Layers

Figure 5: Network layers used as CR generator in both approaches.

Figure 5 depicts the encoder-decoder network architecture that is utilized in the Double Siamese Networks, section 3.1, and in Common Edges, section 3.2

. The input into this network could be any image that fits the input dimension. For our purposes, the input is either an image or a depth map, and the output is a 2D image, the CR. The layers are chosen such that the network learns a higher dimensional representation of the input data in some latent space by extracting the relevant features in an unsupervised fashion. Since the CR should preserve local information with respect to the input data, the network consists only of convolutional layers. After each convoulutional layer, a max pooling operation reduces the first two dimensions by a factor of two.

4 Results

Figure 6: Top left to bottom right of each group of four images: Input image, input depth image from point cloud, common representation generated from input image, common representation generated from depth image.

The proposed architectures are trained on the UP-Drive data set [13]. The point cloud data is projected onto a 2D plane and utilized as depth image. The RGB image coming from the front facing camera is converted to grayscale and scaled to a feasible size. The dimensions are 320x160 pixels for the Double Siamese Networks and 640x320 pixels for the Common Edges method, facilitating the training process.

4.1 Architecture 1 - Double Siamese Networks

Figure 6 shows the results generated by the Double Siamese Networks. It can be seen that the CR is very similar when generated from either the grayscale image or depth image, which is what was one of the goals of this work. On the other hand, it is is not very descriptive and detailed anymore. A lot of information is lost in the compression process of the encoder-decoder networks and due to the large domain gap between the two modalities. To demonstrate the usability of the CR, two standard feature matching methods are applied. The results are averaged over 100 images from the test data set. Figure 7 shows the result when using ORB and SIFT features for matching. The left image pairs are matched using ORB, the right image pairs using SIFT. Of each pair, the data is taken of the same scene from different time instances resulting in small variations in the point of view. The most significant result is the reprojection error given in table 1

. It is significantly higher for the CR in both cases (ORB and SIFT), but still in an acceptable range. The loss of detail in the CR can be quantified when counting the number of matches that the algorithms find. Normal images achieve 10 times as many matches as the CR. On the other hand, the quality of the matches is surprisingly good. In fact, the average L2 distance of the matched descriptor vectors is even lower for the CRs, demonstrating the strength of the matches.

Figure 7: Feature matching using ORB (left) and SIFT (right) features. The matched images and common representations are taken at time and , respectively. The left common representation is generated from a grayscale input image, the right one from a projected point cloud depth image.
Matching inputs SIFT avg. dist. SIFT avg. matches SIFT avg. reprojection error (pixels)
image vs. image 1 148 0.65
image CR vs. depth map CR 0.67 13 1.46

Matching inputs ORB avg. dist. ORB avg. matches ORB avg. reprojection error (pixels)
image vs. image 1 270 1.49
image CR vs. depth map CR 0.81 23 3.68
Table 1: Comparison of image to image matching with common representation matching from RGB and depth image. The distance values of SIFT and ORB feature matches are the L2 norm between matched feature vectors and normalized to 1, to facilitate the interpretation. The number of matches is significantly higher in the image to image matching case, but the resulting reprojection error achieved by the common representation matches is acceptably low.

4.2 Architecture 2 - Common Edges

Figure 8: Top left to bottom right: Input image, input depth image from point cloud, common edges representation generated from input image, common edges representation generated from depth image.

The results of the Common Edges approach are depicted in Figure 8. The CR generated from the grayscale image looks similar to an edge image, which agrees with intuition, as the edge image has been generated from the input grayscale image in the first place. The CR generated from the depth image is blurry, which does not resemble how an edge image looks like. More details on this behavior and potential solutions are presented in section 5. In general, the results of this architecture are too inconsistent to be used for any matching algorithms. A quantitative evaluation is therefore omitted.

5 Discussion

The results of the Double Siamese Networks show that there is a lack of detail in the common representation, limiting the usability for matching operations. Still, the evaluations using ORB and SIFT matches show that the quality of the matches is not too far off the quality of purely image based matches. The Common Edges architecture does not perform well at all since the CR generated from the depth image is very blurry. There are two potential causes for this issue. Firstly, when utilizing encoder-decoder architectures, it is a common problem that the results look blurry. This can potentially be resolved under utilization of GANs, as proposed in [12]. Secondly, there is a conceptual flaw in this approach. Namely, many objects have edges in the sense of a Canny edge detector, e.g., red brick buildings, which a depth sensor is not capable to capture. The reduction of information when generating a common representation seems to be unavoidable when utilizing an architecture as proposed in this work. In order to find a representation that retains more information, it is necessary to incorporate more information in either of the pipelines. For instance, one could perform monocular depth estimation from images to reduce the domain gap between camera and LiDAR data. A novel approach for this is presented in [11]. Another approach could include semantic information. On those lines it could be useful to apply object instance detection on both modalities to generate a higher-level representation of the scene. A common method for this is proposed in [6], called Mask R-CNN. The issues with approaches like this are that the generalizability of the whole system depends on the generalization capabilities of the individual parts. The training time and the data set size, which is necessary to still be able to generalize well, might not be feasible anymore.

6 Conclusion

In this work we have proposed two architectures to learn common representations of LiDAR and camera data, in the form of a 2D image. We find that for one of our approaches the results are not as detailed as initially hoped, but that they are still sufficient for basic feature matching algorithms. The Double Siamese Network architecture showed to be very effective in ensuring that the consistency and distinctiveness of the common representation is given. We have further shown the limitations of using just geometric data, and propose the incorporation of additional semantic information for future work.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017-12) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. External Links: Document, ISSN 0162-8828 Cited by: §2.
  • [2] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2017)

    StarGAN: unified generative adversarial networks for multi-domain image-to-image translation

    CoRR abs/1711.09020. External Links: Link, 1711.09020 Cited by: §2.
  • [3] D. Chung, K. Tahboub, and E. J. Delp (2017-10)

    A two stream siamese convolutional neural network for person re-identification

    In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [4] D. DeTone, T. Malisiewicz, and A. Rabinovich (2017) SuperPoint: self-supervised interest point detection and description. CoRR abs/1712.07629. External Links: Link, 1712.07629 Cited by: §2.
  • [5] C. Guo, M. Rana, M. Cissé, and L. van der Maaten (2017) Countering adversarial images using input transformations. CoRR abs/1711.00117. External Links: Link, 1711.00117 Cited by: §3.1.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: §5.
  • [7] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. CoRR abs/1703.05192. External Links: Link, 1703.05192 Cited by: §2.
  • [8] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum (2015) Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2539–2547. Cited by: §2.
  • [9] A. B. L. Larsen, S. K. Sønderby, and O. Winther (2015) Autoencoding beyond pixels using a learned similarity metric. CoRR abs/1512.09300. External Links: Link, 1512.09300 Cited by: §2.
  • [10] I. Melekhov, J. Kannala, and E. Rahtu (2016-12) Siamese network features for image matching. In

    2016 23rd International Conference on Pattern Recognition (ICPR)

    Vol. , pp. 378–383. External Links: Document, ISSN Cited by: §2.
  • [11] S. Pillai, R. Ambrus, and A. Gaidon (2018) SuperDepth: self-supervised, super-resolved monocular depth estimation. CoRR abs/1810.01849. External Links: Link, 1810.01849 Cited by: §5.
  • [12] T. Sainburg, M. Thielk, B. Theilman, B. Migliori, and T. Gentner (2018)

    Generative adversarial interpolative autoencoding: adversarial training on latent space interpolations encourage convex latent distributions

    CoRR abs/1807.06650. Cited by: §5.
  • [13] R. Varga, A. Costea, H. Florea, I. Giosan, and S. Nedevschi (2017-10) Super-sensor for 360-degree environment perception: point cloud segmentation using image features. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Vol. , pp. 1–8. External Links: Document, ISSN 2153-0017 Cited by: §4.
  • [14] K. Zeng, J. Yu, R. Wang, C. Li, and D. Tao (2017-01)

    Coupled deep autoencoder for single image super-resolution

    IEEE Transactions on Cybernetics 47 (1), pp. 27–37. External Links: Document, ISSN 2168-2267 Cited by: §2.
  • [15] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017-10) Unpaired image-to-image translation using cycle-consistent adversarial networks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.