Camera and Light Detection and Ranging (LiDAR) sensors are essential components of autonomous vehicles with complementary properties. A camera can provide high-resolution color information but is sensitive to illumination and lacks direct spatial measurement. A LiDAR can provide accurate spatial information at long ranges and is robust to illumination changes, but its resolution is much lower than the camera and does not measure color. However, combining the information from two sensors is always difficult due to the differences in representation (pixel vs. point) and information (visual vs. geometric), etc. Common methods combine the two sensors’ information by manually defining common features across two modalities like edges, gradients, and semantics. This is widely used in traditional calibration methods [Jiang2021_MFI], which is always the first step when using multiple sensors. Other scenarios include mapping, localization, SLAM etc. [Pham2019_LCD, Wang2022_p2net, Feng2019_2D3Dmatchnet]
. Thus, defining common features across sensor modalities is important across all applications. With the development of deep learning, many methods have been explored in the literature to generate features for a single modality like image[Wang2021_DenseCL] or point cloud [Christopher2019_ICCV]. The features can be used to find the inter-modal correspondences but cannot directly be applied to cross-modal situations. Finding unified cross-modal dense features is more challenging than finding unified inter-modal features. A few works have worked on cross-modal descriptors, and most of them are patch-based methods and require dense point clouds to learn[Pham2019_LCD, Wang2022_p2net, Feng2019_2D3Dmatchnet]
. One reason is that there are no unified neural network structures that are versatile to process all modalities. Images are always processed by 2D convolutional networks. Meanwhile, various architectures can process point clouds according to different representations. This difference makes it difficult to train two models simultaneously. Secondly, compared with other 3D scanners, LiDAR creates sparser point clouds, making this problem much more challenging. Moreover,[Wang2022_p2net] and our preliminary tests show that using the existing loss functions is challenging for point clouds and image cross-feature learning. To overcome the above difficulties but not lose generality, we propose a U-Net variant of the widely used PointNet++[Qi2017_NIPS] architecture for the point cloud and a U-Net CNN architecture for images. To handle the sparsity of LiDAR, we add multi-scale grouping features to the feature propagation layer of PointNet/PointNet++. We propose a Tuple-Circle loss function by viewing the feature learning problem from a contrastive learning perspective and considering the cross-modal properties to optimize the two networks. The contributions of this paper are summarized as follows:
We propose the Tuple-Circle loss function for cross-modal deep contrastive learning. By representing the features vectors into tuples and adding self-paced weights, Tuple-Circle loss helps the models learn features across different modalities
We propose U-Net architecture for 2D image and 3D point cloud dense feature contrastive learning with flexible receptive fields.
We conduct experiments on a real-world dataset KITTI360[Liao2021_ARXIV] to show the effectiveness of our method and visualize the learned features.
Ii Related Work
Ii-a Contrastive Learning
Contrastive representation learning aims to learn a feature space where similar sample pairs are close to each other and dissimilar pairs are away. Several contrastive learning loss functions have proposed [Le2020_ACCESS, Sun2020_Circleloss]. [Chopra2005_CVPR] is one of the earliest works about contrastive learning. It proposes Contrastive loss that considers only one positive or negative pair at a time. The triple loss [Schroff2015_CVPR] tries to minimize the similarity between anchor samples and positive samples and simultaneously maximize the similarity between anchor samples and negative samples. [Sohn2016_NIPS] generalizes triple loss to include comparison with multiple negative samples. [Gutmann2010_PMLR]Van2018_arXiv] proposes InfoNCE loss which uses categorical cross-entropy loss to deal with multiple negative samples. Circle loss in [Sun2020_Circleloss] provides a uniﬁed perspective for optimizing pair similarity. For multimodal contrastive learning, [Liu2021_ICCV] uses tuples to differentiate information from different modalities and comes up with TupleInfoNCE loss for multimodal fusion task. Most of the above methods are used for global embedding learning. Some methods attempt to learn the dense embedding of inputs. [Wang2021_DenseCL]
implements a self-supervised learning method by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.[Christopher2019_ICCV] proposes hardest-contrastive losses and hardest-triplet losses by exploring hard negative samples to learn geometric features of the point cloud. Our Tuple-circle loss considers the cross-modal properties of cross-modal learning and provides a more flexible and adaptive property for optimization.
Ii-B Image and Point Cloud Feature Learning
[Cattaneo2020_ICRA] uses the triple loss and combines the teacher/student method to create a shared 2D-3D embedding space for image-based global localization in LiDAR-maps. [Yin2021_i3dLoc]
utilizes a Generative Adversarial Network (GAN) to extract cross-domain symmetric place descriptors for localizing a single camera with respect to a point cloud map for indoor and outdoor scenes.[Xing2018_3DNet] is designed for learning robust local feature representation leveraging both textures from images and geometric information from the point cloud. [Li2015_JointEmbeddings]
uses CNN to map an image to a point in the embedding space, which is created by using a 3D shape similarity measure. The embedding allows cross-view image retrieval, image-based shape retrieval, as well as shape-based image retrieval. In[Feng2019_2D3Dmatchnet]
an end-to-end deep network architecture is presented to jointly learn the descriptors for 2D and 3D keypoints from image and point cloud, respectively. As a result, the approach is able to directly match and establish 2D3D correspondences from the query image and 3D point cloud reference map for visual pose estimation.[Pham2019_LCD] proposes LCD that uses a dual auto-encoder neural network and triplet loss to learn a shared latent space representing the 2D and 3D data. Still, the method requires a point cloud with RGB information. Our approach is closest to [Wang2022_p2net]. [Wang2022_p2net] proposes a joint learning framework with an ultra-wide reception mechanism for simultaneous 2D and 3D local features description and detection to achieve direct pixel and point matching. Based on Circle loss, [Wang2022_p2net] comes up with circle-guided descriptor loss to train P2-Net for joint description local features for pixel and point matching. However, circle-guided descriptor loss only considers cross-modality positive sample pairs and inter-modal negative sample pairs, which do not fully utilize batch data. It converges slowly in cross-modality sitting and fails in LiDAR and image situations. The model proposed in this paper, combined with the Tuple-Circle loss function, can work in LiDAR and image pairs situations.
Iii Our Approach
Iii-a Tuple-Circle Loss
Contrastive learning, an approach in self-supervised learning, allows the model to learn rich representative features. Contrastive representation learning aims to learn such a feature space where similar sample pairs stay close while dissimilar ones (called negative samples ) are far apart see Fig.1. In a single modality setting, each sample in a dataset is treated as a different instant (called anchor sample ). Similar counterparts (called positive samples ) for training are created by random augmentations, e.g., rotating an image, gaussian blur. We consider an image-lidar pair acquired from the same scene and their augmented counterparts as a dataset in our setting and the pixel and point corresponding to the same physical locations as positive pair. It is ideal to have a loss function to deal with multiple positive and negative samples in a cross-modal learning setting because samples from different modalities can construct different pairs. Many loss functions have been proposed for contrastive learning. Sun et al. [Sun2020_Circleloss]
proposed a unified loss function from a pair similarity optimization viewpoint on deep feature learning. Based on the unified loss function, he came up with Circle Loss (see Eq.(1)) by adding self-paced weights and margins.
where and are self-paced weights. and are the total numbers of the neigative samples and positive samples. and are the negative pair and positive margin. There are only three hyper-parameters of circle loss in which is a scale factor, controls the radius of the decision boundary. represents similarity between features of positive samples, and denotes similarity between features of negative samples.
Our preliminary test shows that self-pace weight plays a vital role in the convergence of cross-modality learning. The weights of negative pairs decrease while the distance between the negative pairs increases. This behavior guarantees that the features from the two different modalities will not be pushed too far between each other before the two models converge to the same features space. The self-paced weights of positive samples also have similar effects. As the two positive samples become close to each in the features, the weights decrease to allow optimization to take care of other pairs. However, similar to other approaches[Chopra2005_CVPR, Sohn2016_NIPS, Van2018_arXiv], circle loss also does not consider the properties of cross-modality learning and uses a unified representation to represent the features (same-sized vectors). We propose Tuple-Circle loss to generalize the circle loss to cross-modality feature learning. In a cross-modality setting, samples from different modalities can share some common information but may also contain modality-specific information [Liu2021_ICCV, Jiang2021_LiDARNet, Bousmalis2016_NIPS]. Separate treatment of the common and modality-specific information can make learning easier. To learn features across K modalities, we represent the features as a k-tuple to separate the information in which features vectors are shared with other modality and vector is for private feature learning. For inter similarity measurement , for cross-modality similarity, we have . Therefore, our final loss function can be represented as:
where are sum of similarity measurement within modality as described in Eq. (3) and sum of similarity measurement for cross modality as shown in Eq. (4). In an Image-Point Cloud cross modality setting, we represent the features as where denotes shared features between image and point cloud and represents modality-specific private features in point cloud or image. During training, we use one set of a positive pair and negative pairs from one modality. For inter-modal learning, we have:
For cross modality, we can have cross-modality positive pairs and negative pairs by combination:
In this paper, we use cosine similarity to measure the similarity between two features. For inter modality, we use the whole vector to compute cosine similarityand for cross modalities, we only use the shared part of the feature vector to compute the similarity .
We develop a dual U-Net structure framework(see Fig.2) for cross-modality dense feature learning and describe the network structure next. There are several approaches to learning from LiDAR data. One way is to project LiDAR data onto a spherical plane and process further using the 2D CNN network. By using this point cloud representation, we are able to unify both models under 2D CNN architecture[Goodfellow2016_DeepLearning]. However, the projection process leads to the loss of 3D information of the point cloud. For generality of our model and allow our model to be applied to point clouds from other 3D sensors, we chose PointNet/PointNet++ as the model for point cloud feature learning. PointNet/PointNet++ was designed to process raw point clouds directly, and several architectures for 3D data[Zhang2019_Access] are designed on a similar structure and applied to our method. We first explore the vanilla multi-scale group (MSG) version of PointNet++, which was designed for the Point Cloud semantic segmentation task [Qi2017_NIPS]
. The model has an encoder consisting of a farest-sampling layer and sets abstraction layers, and the decoder is comprised of feature propagation layers. However, the model has difficulty producing good point-level features. A potential reason for this is that the feature propagation layers in the decoder create point features by interpolating the neighborhood features, and there is no learnable weight for the interpolation. This limits the description ability of the feature. We make a simple modification by adding a set-abstraction layer before each feature propagation layer to overcome this limitation. As shown in Fig.2
(b), we create new features for up-sampled points by grouping their neighborhood in the old point clouds and passing it to a multilayer perceptron (MLP). After that, we concatenate the new features with the interpolated features and feed them to the MLP like standard feature propagation layers described in[Qi2017_NIPS]. This modification allows the set-abstraction layers to provide more neighbor information and learn-able weights for interpolation. We propose a 2D CNN network with a U-Net structure for image feature learning. The encoder is a ResNet having four ResNet layers[He2016_CVPR]
, while the decoder is an inverse of the encoder that replaces the first convolution layer with the deconvolution layer to up-sample the feature. Convolutional neural networks (CNNs) are inherently unable to handle non-trivial geometric transformations due to the fixed geometric structures in their building modules[Bronstein2017_ispm]. The PointNet++, on the other hand, is more flexible for modeling geometric transformations and has a broader receptive field view. To make the 2D model more flexible, we replace the second convolutional layers of the ResNet block in the encoder with deformable convolutional layers[Dai2017_ICCV]. The same modification is made in the decoder model.
Iv Experimental Evaluation
In order to verify our approach, the evaluation dataset is required to have the image and point cloud pairs with known pixel-point level correspondence (known sensor calibration). Datasets meeting this requirement exist in papers, such as [Liao2021_ARXIV, Jiang2021_RELLIS3D]. Other works like [Wang2022_p2net, Pham2019_LCD] construct datasets in order to have dense correspondence between point and pixel. However, not to constraint to our model to dense point cloud and show the generality of our method, we use camera image and LiDAR scan pairs sequences from the KITTI360 dataset[Liao2021_ARXIV]. We trained our models on sequences 0, 2, 4, 5, 6, 7, and 9 of KITTI360 and performed validation on sequence 3. During training, the image sequences were randomly cropped to , and the LiDAR point cloud was down-sampled to points.
A lambda workstation with two NVIDIA Titan RTX was used for training, and, Pytorch library[Paszke2019_NIPS]
was used to implement the models. The learning rate was initialized at 0.01 and decayed every epoch by 0.985 for 100 epochs. Finally, we used AdamW optimizer to optimize the models[Ilya2019_ICLR], and the total size of the feature vector was 256 for all the models.
Since our goal is to find the unified cross-modal features between two modalities and not to detect key points for matching, we evaluate our features by randomly sampling points from two modalities and calculating the percentage of correct matches. For inter-modal matching, we use full features to compute cosine similarity. For cross-modal matching, we only use the shared part features to perform matching. In the following part of the paper, and denote the inter-modal matching accuracy of image and point cloud, respectively. denote the cross-modality matching accuracy using full feature vectors , and denotes the cross-modal matching accuracy using shared feature vectors .
Iv-C1 Tuple-Circle Loss vs. Circle Loss
In our preliminary test, other widely used contrastive loss functions [Chopra2005_CVPR, Gutmann2010_PMLR, Van2018_arXiv] can easily allow the two models for different modalities to learn good inter-modal features while training simultaneously. However, two models cannot or only slowly converge to learn cross-modal features. Thus, we compare the proposed Tuple-Circle loss function with the Circle loss function [Sun2020_Circleloss, Wang2022_p2net]. Fig.3 shows the accuracy changes during training. There are three lines in Fig.3 which denote the of Tuple-Circle loss (blue) and Circle loss (green), and of Tuple-Circle loss. We can see that the Tuple-Circle loss function converges faster than the Circle loss function and converges to better results. Another interesting observation of Tuple-Circle loss is that the full features can also be used to distinguish across modalities and show better performance at the early stage. However, after some epochs, the full features cannot get better results and are outperformed by the shared features. The Table I shows the comparison of different model settings. Rows and show that using Tuple-Circle loss results in better feature learning for inter and cross-modality matching. However, cross-modal matching accuracy is much lower than inter-matching accuracy. Fig.5 shows that the distance between more than 40% of mismatched cross-modal pairs is less than 1.5 pixels after projecting the LiDAR point cloud on the image.
Iv-C2 Models Comparison
We also test the performance of different model setups described in setionIII-B on feature learning. Table I shows the comparison of different model settings; DCN denotes the use of Deformable convolution layers in the 2D model, and ASFP denotes using the set abstraction forward propagation. The results show that ASFP layers help to learn better point cloud features and cross-modality features (highlighted in bold in Table.I). According to [Wang2022_p2net], a broader reception field help to learn inter and cross features, and a Deformable convolution layer can provide a larger and more flexible reception field [Dai2017_ICCV]. However, the DCN layer did not help us learn cross features from our experiment results.
Iv-C3 Shared Features Size Comparison
We also studied the effect of different sizes of shared features on the performance of feature learning. The results are presented in Table.II. Interestingly, the inter-matching results improve after introducing the Tuple-Circle loss. However, the cross-modality matching results did not change much with the change in the size of the shared features. We hypothesize that the learned shared information cross-modality does not have very high entropy.
What do the two models learn across different modalities? In order to answer this question and visualize the features, we use Cosine-KMean Clustering [Arindam2005_JMLR] method. We first use the full features
as input to the Cosine-KMean Clustering method and try to classify these features in 200 clusters. The results are shown in Figs.4(b). We also perform the same clustering on full features of the image’s corresponding point cloud and show the result in Figs.4(e). As shown, we cannot find many similarities between Figs.4(b) and (e). Fig.4(e) looks more like noise due to the low resolution of the point cloud. However, we can find a clear pattern in Figs.4(b). The pattern shows more texture properties than geometric properties, which have been shown in [Robert2018_CoRR, Leng2019_access]. For shared features , we first concatenate features from both modalities as a dataset and try to classify all 200 clusters by using the Cosine-KMean Clustering method. And then, we plot clustering results of pixel and point in Figs.4(b) and (d). As we can see, both image and point cloud clusters overlap. Moreover, the clustering results show the position relative properties. In Fig.Fig.4(d), we show an image pixel position clustering results. We use all the pixel positions in an image as a dataset and perform normal KMean clustering on the dataset. In this way, we can get an image like Fig.4(d), and we can see this result is close to the Voronoi diagram. Comparing Fig.4(d) and (c), we can observe some similarities between the two figures. Fig.4 (c) shows that the shared features may encode 2D position information, which is shown in Fig.4(d) and the 3D depth information of the corresponding point cloud which is shown in Fig.4(f). We also analyze the error matching across different modalities based on the clustering results. Figs.6 (a) and (b) show part of the mismatching. From the visualization, it can be observed that most mismatching happened within the same clusters or on the border of two clusters.
This paper proposes a Tuple-Circle loss function for cross-modality feature learning to learn common features across different sensor information modalities. Our results show that the proposed Tuple-Circle loss allows faster and better convergence of the model. Furthermore, we develop and present a variant of PointNet++ architecture for point cloud to achieve better inter-modal and cross-modal matching. We develop a variant of U-Net CNN architecture for image feature learning and study the effectiveness of the Deformable convolutional layers for our evaluation settings. By utilizing the Cosine-Kmean clustering method, we present the visualizations of the learned features and show that our method allows the two models of different modalities to learn geometrically meaningful common features.