SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

Pre-training has become a standard paradigm in many computer vision tasks. However, most of the methods are generally designed on the RGB image domain. Due to the discrepancy between the two-dimensional image plane and the three-dimensional space, such pre-trained models fail to perceive spatial information and serve as sub-optimal solutions for 3D-related tasks. To bridge this gap, we aim to learn a spatial-aware visual representation that can describe the three-dimensional space and is more suitable and effective for these tasks. To leverage point clouds, which are much more superior in providing spatial information compared to images, we propose a simple yet effective 2D Image and 3D Point cloud Unsupervised pre-training strategy, called SimIPU. Specifically, we develop a multi-modal contrastive learning framework that consists of an intra-modal spatial perception module to learn a spatial-aware representation from point clouds and an inter-modal feature interaction module to transfer the capability of perceiving spatial information from the point cloud encoder to the image encoder, respectively. Positive pairs for contrastive losses are established by the matching algorithm and the projection matrix. The whole framework is trained in an unsupervised end-to-end fashion. To the best of our knowledge, this is the first study to explore contrastive learning pre-training strategies for outdoor multi-modal datasets, containing paired camera images and LIDAR point clouds. Codes and models are available at https://github.com/zhyever/SimIPU.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/02/2020

Pre-Training by Completing Point Clouds

There has recently been a flurry of exciting advances in deep learning m...
06/08/2022

CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving

Unsupervised contrastive learning for indoor-scene point clouds has achi...
01/27/2022

Contrastive Embedding Distribution Refinement and Entropy-Aware Attention for 3D Point Cloud Classification

Learning a powerful representation from point clouds is a fundamental an...
08/04/2022

P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting

Nowadays, pre-training big models on large-scale datasets has become a c...
05/05/2021

Multi-Modal Loop Closing in Unstructured Planetary Environments with Visually Enriched Submaps

Future planetary missions will rely on rovers that can autonomously expl...
06/24/2020

Disentangle Perceptual Learning through Online Contrastive Learning

Pursuing realistic results according to human visual perception is the c...
06/15/2021

End-to-End Learning of Keypoint Representations for Continuous Control from Images

In many control problems that include vision, optimal controls can be in...

1 Introduction

Large-scale models have achieved significant success in deep learning, where fine-tuning after pre-training has become a well-established and commonly used paradigm, such as ELMo 

Peters et al. (2018), GPT Brown et al. (2020), and BERT Devlin et al. (2018)

in NLP. As for computer vision, benefiting from the massive amount of labeled data, the supervised pre-trained models on ImageNet 

Deng et al. (2009) have long dominated. In recent years, unsupervised pre-training strategies have drawn much more attention. Various successful methods He et al. (2020); Chen et al. (2020); Grill et al. (2020) have achieved comparable or better results compared to supervised pre-trained ones on a few 2D tasks, including image classification, object detection, and image segmentation Jaiswal et al. (2021). However, all these methods are designed on the two-dimensional image plane, which exists a large discrepancy between the three-dimensional space. As a result, such pre-trained models can not perceive spatial information and demonstrate limited performance improvement on 3D-related downstream tasks. Therefore, learning a spatial-aware representation that can describe three-dimensional space is much more essential.

Figure 1: Motivation. SimIPU is designed on both the 2D image plane and the 3D space. Intra-modal module learns the spatial-aware representation with point clouds. Inter-modal module transfers the capability of extracting spatial-aware representations to the image-feature extractor. ‘CL’ is the abbreviation of the contrastive learning.

Compared to images, point clouds are much more superior in providing spatial information Qi et al. (2017), which may lead them to be more suitable for learning such representations. PointContrast Xie et al. (2020) is the first study to explore pre-training strategies for point clouds. They utilize different scene views to generate positive pairs and adopt a PointInfoNCE loss to learn useful dense/local representations. Motivated by the success of 2D image and 3D point cloud pre-training, Prior3D Hou et al. (2021b) propose a geometric prior contrastive loss to imbue the prior 3D information to image representations. However, there is no intra-modal constraint on point clouds, which can lead to trivial solutions. Furthermore, all of these methods focus on indoor RGB-D data, where point clouds are reconstructed by the depth value. As for the outdoor scene, the point cloud provided by LIDAR contains more noise and massive background points. It also lacks point-to-point correspondences, which makes the design of pre-training methods tougher.

In this paper, we develop a simple yet effective 2D image and 3D point cloud unsupervised pre-training framework for outdoor multi-modal data (i.e., paired images and LIDAR point clouds) to learn spatial-aware visual representations. To solve the aforementioned problems, our method explicitly imposes the contrastive loss on point-cloud features to guarantee models to learn spatial-aware representations. We harness more robust and informative global features and apply the Hungarian algorithm and the projection matrix to associate the matching correspondences. To the best of our knowledge, this is the first study to explore pre-training strategies for outdoor multi-modal data.

Specifically, the framework consists of an intra-modal spatial perception module and an inter-modal feature interaction module. In terms of the spatial perception module, we adopt an intra-modal contrastive learning method to learn spatial-aware representations with point clouds. We utilize global transformation to yield two views of a point cloud and adopt an encoder to extract the global features. Then, the Hungarian algorithm is applied to establish the matching correspondences between the downsampled points. A contrastive loss serves to push the distances of matched point features closer. The certain equivariance learned from random geometric transformation leads to a spatial-aware representation. As for the feature interaction module, we adopt a similar contrastive learning strategy to transfer prior spatial knowledge from the point-cloud encoder to the image encoder. Therefore, positive pairs between image features and point-cloud global features are established through the projection process from LIDAR to the camera. Since both features are global representations, alignment is achieved on the fly. Benefiting from the inter-modal interaction, the image encoder can gradually acquire the capability of extracting spatial-aware representations. In the pre-training stage, the whole framework is trained in an unsupervised end-to-end fashion. Our contributions are summarized as follows:

  • We propose a simple yet effective pre-training method, termed SimIPU. It exerts the advantages of massive unlabeled multi-modal data to learn spatial-aware visual representations that can further improve the model performance on downstream tasks. To the best of our knowledge, this is the first study to explore contrastive learning pre-training strategies for outdoor multi-modal data.

  • We develop a Multi-Modal Contrastive Learning framework, which consists of an intra-modal spatial perception module and an inter-modal feature interaction module.

  • Our method significantly outperforms other pre-training counterparts when transferring the models to 3D-related downstream tasks, including 3D object detection (2.3% AP), monocular depth estimation (0.12m RMSE) and monocular 3D object detection (0.6% AP).

2 Related Work

(a) Spatial Perception Module
(b) Feature Interaction Module
(c) Contrastive Losses
Figure 2: Framework of SimIPU. Matched pairs are in the same color. The whole framework is trained in an end-to-end manner. (a) Intra-Modal Spatial Perception Module: We utilize set abstraction layers to extract global point cloud features and downsample points (results are in color) from different views. The Hungarian Algorithm is applied to match the downsampled points according to locations. (b) Inter-Modal Feature Interaction Module: We adopt a standard ResNet-50 to extract global image features. Projection matrix from point cloud to image plane establish the association between positive pairs. (c) Contrastive Loss: Contrastive losses are applied to push closer the distances of matched pair features.

2.1 2D Self-supervised Representation Learning

Pretext task and contrastive learning are two key points of 2D self-supervised representation learning. There is a wide range of tasks that have been designed to learn useful visual representations, including colorization  

Zhang et al. (2016), inpainting Pathak et al. (2016), spatial jigsaw puzzles Noroozi and Favaro (2016), and discriminate orientation Gidaris et al. (2018). Although the improvement is limited, these methods provide a possibility to achieve performance gains from pre-training strategies. SimCLR and SimCLR v2 Chen et al. (2020) makes a breakthrough. They groundbreakingly proposed a discrimination pretext task. Contrastive loss is applied for pushing away the feature distances of different instances. MoCo and its improved version MoCo v2 He et al. (2020) further utilize a memory bank to alleviate the constraints on large batch size. Beyond contrastive learning, BYOL Grill et al. (2020) relies only on positive pairs, but it does not collapse in case a momentum encoder is used. All these methods are designed on the image plane and can be sub-optimal solutions for 3D-related downstream tasks. To circumvent the need for spatial-aware representations, some methods propose to learn representations from videos by using ego-motion as supervisory signal Jayaraman and Grauman (2015); Agrawal et al. (2015); Lee et al. (2019) and self-supervised depth estimation Jiang et al. (2018). In this paper, we aim to further explore contrastive pre-training strategies following the successful trend of contrastive learning.

2.2 3D Self-supervised Representation Learning

Inspired by the success of 2D self-supervised representation learning, PointContrast Xie et al. (2020) introduces a contrastive pretext task in a 3D paradigm. Driven by indoor point cloud data properties, the same points in different frames compose positive pairs and are used for contrastive learning. To make full use of the point cloud data, Contrast Context Hou et al. (2021a) proposes to adopt a ShapeContext descriptor to divide the scene, which provides more negative pairs for contrastive learning and improve the effectiveness of pre-training models. CoCoNets Lal et al. (2021)

further explores self-supervised learning of amodal 3D feature representations agnostic to object and scene semantic content. The above methods focus on indoor RGB-D data. As for outdoor LIDAR point clouds, Pillar-Motion 

Luo et al. (2021) propose a self-supervised pillar representation learning method that makes use of the optical flow extracted from camera images. Since the LIDAR point clouds lack point-to-point correspondences, there are fewer contrastive learning methods for pre-training.

2.3 Multi-modal Representation Learning

Much effort has been made into multi-modal representation learning. Based on paired image and text, Yuan et al. (2021) propose a unified multi-modality contrastive learning framework to learn useful visual representations. Motivated by the success of 2D image and 3D point cloud pre-training, Pri3D Hou et al. (2021b) further explore multi-modal pre-training methods to enhance the visual representation learning with indoor RGB-D data. Dense/local representations are learned, which boosts the performance of downstream tasks. However, there is no intra-modal constraint on point clouds, which can lead to trivial solutions and limit performance improvement. On the contrary, Liu et al. (2021) propose a method to imbue the image prior to the 3D representation. All these methods motivate us to further explore multi-modal pre-training strategies for more challenging outdoor multi-modal data.

2.4 3D Visual Tasks

We utilize three 3D visual tasks to evaluate the effectiveness of our method, which are fusion-based 3D object detection, monocular depth estimation, and monocular 3D object detection. Fusion-based 3D object detection methods Zhang et al. (2020); Sindagi et al. (2019); Liang et al. (2019); Wang et al. (2020) combine the image and point cloud modality and learn the interaction between them. By exert multi-modal data, they achieve satisfying results compared to methods that only utilize point cloud data. In this paper, we mainly focus on this task and design extensive experiments to show the effectiveness of our method. For monocular depth estimation and monocular 3D object detection, which are two challenging 3D-related visual tasks on a single image modality, many methods Eigen and Fergus (2015); Bhat et al. (2021); Lyu et al. (2020); Wang et al. (2021) are proposed to improve the model performance. We execute experiments on these two tasks, which only utilize the single image modality to further evaluate the generalization of SimIPU.

3 Method

In this section, we introduce our self-supervised pre-training pipeline in detail. First, to motivate the necessity of this multi-modal method, we conduct a pilot study to determine what kind of pre-training strategy we need in the downstream task that we are mainly focusing on: fusion-based 3D objection detection (Section 3.1). Then, we introduce our multi-modal self-supervised pre-training framework, including an Intra-Modal Spatial Perception module (Section 3.2), and an Inter-Modal Feature Interaction module (Section 3.3). The overview of the proposed framework is shown in Fig. 2.

3.1 Pilot Study: Is 2D pre-training Useful?

Previous fusion-based 3D objection detection methods utilize different kinds of pre-trained 2D feature extractors, including scratch models, ImageNet supervised classification pretrained models in  

Chen et al. (2017); Liang et al. (2019); Wang et al. (2020), and 2D detection pre-trained models in Sindagi et al. (2019), to initialize the backbone. However, there has been no discussion about which pre-training strategy can further improve the model performance on 3D object detection. To fill this blank, we execute this pilot study to assess the effect of different pre-trained models on fusion-based 3D object detection.

We adopt the state-of-the-art fusion-based 3D object detection method MVXNet Sindagi et al. (2019) with Moca Zhang et al. (2020) as our baseline, and train models on the KITTI Geiger et al. (2012) dataset. We only change different pre-trained 2D feature extractor weights when initialization and keep all the other training settings the same. The results are shown in Fig. 3. Critically, one can observe that pre-training models cannot improve the performance of the downstream task. These results suggest that the discrepancy between the two-dimensional image plane and three-dimensional space exists. All these pre-trained methods are designed on the 2D domain, which leads to sub-optimal solutions for the fusion-based 3D task.

Is there any way to learn a spatial-aware visual representation that is beneficial to 3D tasks? Without any label, it is extremely tough to directly learn such representations with the single image modality. Driven by massive multi-modal data, we can achieve it via multi-modal contrastive learning in an indirect manner. Specifically, we propose an intra-modal spatial perception module to learn a spatial-aware representation from point clouds and an inter-modal feature interaction module to transfer the capability of space perception to the image encoder.

Figure 3:

Pilot Study Results. In KITTI 3D object detection experiments, we adopt different pre-trained models, such as A. Scratch, B. 2D detection on CityScapes, C. 2D detection on KITTI, D. Supervised pre-trained on ImageNet, E. Moco-v2 on ImageNet, F. DenseCL on ImageNet, and G. Our method on KITTI, to initialize the backbones.

3.2 Intra-Modal Spatial Perception Module

We design the intra-modal contrastive learning module to pre-train a spatial-aware global representation with point clouds. The framework is shown in Fig. 1(a). A key observation is that features at the same location in different views should be similar.

To yield two different views of a point cloud, we sample a random 3D geometric transformation to transform a given point cloud into another view :

(1)

where is the point number in a scene, and is the channel of raw point features, which normally includes the 3D location and reflection rate. The superscript and indicate two different views. In this work, we mainly consider the rigid transformation , including rotation, translation, and scaling.

On completion of constituting two different views, we extract the global point cloud features by a PointNet++ Qi et al. (2017) encoder. Specifically, we apply several set abstraction layers for downsampling and extracting global context representations, which can be mathematically written as:

(2)

where and are the location and feature of downsampled points, respectively. is the point cloud feature extractor. Note that our method is different from pre-training methods for indoor point clouds, which mainly focus on dense/local representations Xie et al. (2020); Hou et al. (2021a, b)

. Since the outdoor data contains much more noise and massive background points, we utilize the global features to enhance the quality of the extracted representations. Therefore, during the feature extraction, meaningless information can be gradually filtered by the random sample strategy and the set abstraction layers. As a result, the spatial-aware representations will be well preserved, which results in certain spatial prior knowledge to transfer to the image encoder. Furthermore, random sampling can make the same point cloud generate different sampled points, which can improve the utilization of point cloud data and the effectiveness of the representation learning.

However, extracting global features can introduce random properties, which leads to the inevitable mismatching of downsampled points. To construct positive pairs for contrastive learning, we utilize the Hungarian algorithm to achieve the positive correspondence matching . The cost matrix of the bipartite assigning algorithm is computed by the -norm distance between the downsampled points in two different views:

(3)

Here, we apply the same transformation in Eq.1 to align the coordinates for the distance computation. The Hungarian algorithm can guarantee a favorable global-optimal matching and solves the problem of the lack of correspondences in outdoor multi-modal data.

After establishing the correspondence, we adopt the constrastive loss to push in the distance between the features of matched points. As for each matched pair , point feature will serve as the query and will serve as the positive key . We treat point feature where and as negative keys. We calculate the intra-modal contrastive loss as:

(4)

where is the temperature factor.

The intra-modal contrastive loss can push in the feature distance on a similar location of different views. Such a certain equivariance learned from random geometric transformation engenders a spatial-aware representation.

3.3 Inter-Modal Feature Interaction Module

To enable the image encoder the capability of perceiving spatial space, we propose the Inter-Modal Feature Interaction module, where the image feature extractor can gradually learn spatial-aware representations by embracing the inter-modal interaction. The framework is shown in Fig. 1(b).

Following most of the contrastive learning methods, we adopt a standard ResNet-50 as the default image backbone to extract global feature maps from given images:

(5)

where , , are the feature map, image feature extractor, and the input image, respectively.

Pre-train Car (%) Pedestrian (%) Cyclist (%) Overall (%)
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
Scratch 86.18 76.57 74.08 67.95 62.18 57.24 83.37 66.99 63.11 79.17 68.58 64.81
Ours-K 87.87 77.36 74.30 71.25 66.18 60.24 84.83 69.11 64.04 81.32 70.88 66.19
Gain +1.69 +0.79 +0.22 +3.30 +4.00 +3.00 +1.46 +2.12 +0.93 +2.15 +2.30 +1.38
MoCo-v2-IN 87.98 77.40 74.08 69.33 62.03 57.14 82.66 67.38 62.42 79.99 68.94 64.55
MoCo-v2-K 87.66 77.10 74.33 69.24 62.42 57.86 80.55 65.33 60.97 79.15 68.28 64.39
DenseCL-IN 88.11 77.56 74.62 66.56 61.42 57.08 83.86 68.81 64.74 79.51 69.26 65.48
Table 1: Camera-lidar fusion based 3D object detection fine-tuned on KITTI validation set. We show the bounding box AP of each class in details. ‘K’ and ‘IN’ indicates pre-trained models are trained on KITTI and ImageNet datset. Best is in bold.

Given the camera parameter , we can establish the positive correspondences between the downsampled points in Eq.2 and the image feature maps in Eq.5 through the projection matrix. Specifically, we project the downsampled points onto the image plane and sample from the image feature maps to get the corresponding image features :

(6)

where are the resulting 2D coordinates of the projected points.

is the sampling operator. In our work, we utilize bi-linear interpolation to sample features.

Above Operations lead to positive matches . Similar to Eq. 4, for each matched pair , we calculate the inter-modal contrastive loss as:

(7)

Here, we crop the gradient of to avoid influences on Intra-Modal Spatial Perception Module. The contrastive loss imbues the prior spatial knowledge of the lidar feature extractor to the image encoder.

Finally, we train the whole framework in an end-to-end fashion with the total loss:

(8)

where ,

are hyperparameters that balances between the two parts of the loss.

Compared with methods that focus on the indoor RGB-D data Hou et al. (2021b); Liu et al. (2021), which utilize a U-Net Ronneberger et al. (2015) shape backbone to align the dense/local feature extracted by point-cloud feature extractor, our method only adopts a standard ResNet-50 encoder to extract the global features. Since the modules apply to global representations, alignment is achieved on the fly. Such characteristic indicates that our method is more general to downstream tasks because there is no more limitation on the downstream network design.

  Pre-train Vehicle L1/L2 Pedestrian L1/L2 Cyclist L1/L2 Overall L1/L2
mAP(%) mAPH(%) mAP(%) mAPH(%) mAP(%) mAPH(%) mAP(%) mAPH(%)
Scratch 65.0/61.0 64.6/60.5 67.6/62.9 58.8/54.7 64.0/61.2 61.1/58.5 65.57/61.53 61.75/57.93
Ours-W 66.5/62.4 66.1/62.0 69.4/64.7 60.5/56.3 64.7/62.3 62.3/60.0 66.92/63.01 63.18/59.47
Gain +1.5/+1.4 +1.5/+1.5 +1.8/+1.8 +1.2/+1.6 +0.7/+1.1 +1.2/+1.5 +1.35/+1.48 +1.43/+1.54
Table 2: Camera-lidar fusion based 3D object detection performance comparison on Waymo validation set. ‘W’ indicates the pre-trained models are trained on Waymo datset.

4 Experiments

In this section, we introduce our experimental settings (Section 4.1) and downstream results (Section 4.2) in detail. The ablation study is applied to prove the effectiveness of key components in the framework and explore the influence of pre-training data scale in Section 4.3.

4.1 Experimental Settings

We study pre-training strategies on multi-modal datasets, including the KITTI dataset and the Waymo Open Dataset. Both the datasets contain paired image and point cloud data. All the experiments are based on MMDetection3d Contributors (2020).

KITTI Dataset (K).

The KITTI Dataset Geiger et al. (2012) contains 7481 training images and 7518 test images, both with their corresponding point cloud. To make full use of the dataset, we utilize both the training set and the testing set data to pre-train the model. Note that we filter out the validation set to avoid the information leak.

Waymo Open Dataset (W).

The Waymo Dataset Sun et al. (2020) contains 0.15 million training images with corresponding point clouds. The testing set is relatively smaller than the training set. We only use the training set data to pre-train models. Because we need to project lidar points to image plane during the pre-training stage, for simplicity, we filter out points beyond the image field of view (FOV).

Pre-training.

In terms of the backbone settings, we use three set abstraction layers Qi et al. (2017); Yang et al. (2020) to downsample the points and extract point cloud global features. We combine two downsampling strategies to make full use of the data, which are the 3D Euclidean distance sample (D-FPS) Qi et al. (2017) and the feature distance sample (F-FPS) Yang et al. (2020). Following most of the contrastive learning methods, a ResNet-50 He et al. (2016) is adopted as the image feature extractor, which can also be replaced by any other image backbone. Compared with the Prior3D Hou et al. (2021b), our method has fewer constraints on the image backbone and is more general to downstream tasks.

As for contrastive learning settings, the temperature factor in Eq. 4 and Eq.7 is 0.07. Following Chen et al. (2020), we use the MLP projection head to map the dimension of features to 128-d for either the intra-modal contrastive learning and the inter-modal contrastive learning. We use 4096 matched pairs for faster training Xie et al. (2020). In addition, we implement the Moco-v2 He et al. (2020) on both the KITTI dataset and the Waymo dataset to make a fair comparison with other strong unsupervised methods. The data augmentation pipeline of MoCo-v2 consists of random color jittering, random gray-scale conversion, Gaussian blurring, and random horizontal flip.

width=center Pre-train REL Sq R RMS log Scratch 0.096 0.493 3.575 0.146 0.896 0.977 0.994 Ours-W 0.073 0.285 2.840 0.113 0.935 0.990 0.998 Gain -0.023 -0.208 -0.735 -0.033 +0.039 +0.013 +0.004 Super-IN 0.068 0.247 2.712 0.104 0.946 0.993 0.998 Ours-IN/W 0.067 0.235 2.592 0.102 0.949 0.993 0.998 Gain -0.001 -0.012 -0.120 -0.002 +0.003 0 0

Table 3: Monocular depth estimation performance comparison on KITTI dataset. ‘IN/W’ indicates the double fine-tuning pre-trained models on ImageNet and Waymo.

4.2 Experimental Results

3D Object Detection.

We evaluate the pre-trained models by fine-tuning on the target 3D-related tasks. For early-fusion based 3D object detection, we utilize two challenging and popular datasets, i.e., KITTI and Waymo. We fine-tune the pre-trained models with the state-of-the-art algorithms: MVX-Net Sindagi et al. (2019) with Moca Zhang et al. (2020) on KITTI. For the Waymo dataset, we find that the Moca can not improve the MVX-Net performance. Therefore, we choose to only apply the MVX-Net as our default protocol, which is still a strong baseline of early-fusion based methods. When evaluating, we use the standard 2 schedule, which is more effective on the waymo dataset.

The KITTI 3D object detection performance comparison is shown in Tab. 1. We utilize the scratch one as our baseline method. Compared with it, our method achieves the significant 2.3% moderate overall gains. Furthermore, 1.38% gains on the hard overall shows that our pre-training method improves the localization accuracy. To further dig into the effectiveness of the pre-training methods, we report the per-class comparison results. Our fine-tuned model can localize small objects more accurately, which is reflected in significant 4% gains on pedestrian moderate , compared with other classes. More results compared with other state-of-the-art counterpart pre-training methods can be found in Fig. 3 and the appendix:.

In Fig. 4, we report the 3D object detection results on Waymo Dataset. Waymo dataset is much larger than KITTI. Although the performance difference between pre-trained and scratch models is not obvious for larger dataset Xie et al. (2020), our method still achieves a slight improvement even using limited data. Compared with training from scratch, our method achieves a relatively significant 1.35% mAP improvement. Similar to the results on KITTI, our fine-tuned model has the capability to localize small objects more accurately. We provide more results in the appendix.

width=center Pre-train AP(%) ATE ASE AOE AVE AAE NDS(%) Scratch 17.90 0.92 0.30 0.84 1.33 0.19 26.27 Ours-W 26.18 0.84 0.27 0.67 1.31 0.17 33.50 Gain +8.28 - - - - - +7.23 Super-IN 27.71 0.83 0.26 0.59 1.34 0.16 35.23 Ours-IN/W 28.36 0.82 0.26 0.62 1.33 0.16 35.36 Gain +0.65 - - - - - +0.13

Table 4: Monocular 3D object detection performance comparison on Nuscenes dataset.

KITTI Monocular Depth Estimation.

As for the monocular depth estimation, we design a simple yet strong baseline to evaluate the model performance. We provide more information of the baseline in the appendix. We evaluate the effectiveness of our method by fine-tuning pre-trained the model on the KITTI Eigen split Eigen and Fergus (2015). All settings are the same when doing fine-tuning to make a fair comparison.

In Tab. 3, we report the KITTI monocular depth estimation performance. Compared with training from scratch, our methods achieve great improvement on all of the metrics. However, we can only obtain 0.15M (Million) data to pre-train our model, which is extremely smaller than the ImageNet supervised pre-training methods that can utilize 1M labeled data to pre-train models, therefore, there is a small performance gap between the results of Ours-W over Super-IN. To fill this gap, we propose a simple double fine-tuning strategy. We load the pre-trained Super-IN model at the beginning of our multi-modal contrastive pre-training stage and double fine-tune the model on the downstream task. As a result, our method can learn useful spatial-aware visual representations and preserve part of the semantic representations, which leads to further performance gains, especially for the 0.12m (4.4%) improvement on RMS.

Nuscenes Monocular 3D Object Detection.

In terms of the monocular 3D object detection, FCOS3D Wang et al. (2021) is fine-tuned on Nuscenes training set and evaluated on Nuscenes validation set. For simplicity, we only replace the image encoder ResNet-101 in the default config of FCOS3D with ResNet-50. When evaluating, we use the standard 1

schedule. The batch size is set to 8. Synchronized batch normalization is used. All settings are default.

We report the performance of Nuscenes monocular 3D object detection in Tab. 4. Our methods achieve significant improvement on all of the metrics compared with training from scratch. Applying the same double fine-tuning strategy, our method further boosts the performance of the Super-IN model by 0.65% mAP.

4.3 Ablation Study

Intra-Modal Module.

Viewing the whole framework, our method can be treated as an indirect method for image feature extractors to learn a spatial-aware visual representation. Knowledge transferred from the point cloud encoder is crucial and determines the effectiveness of our method. We design the first ablation study to prove if the intra-modal contrastive learning module does have learned a useful representation. The results shown in the first block of Tab. 5 indicate that the intra-modal branch is essential in the multi-modal contrastive learning framework. With the intra-modal branch, the point cloud feature extractor can learn a spatial-aware visual representation and transfer it to the image encoder via inter-modal contrastive learning, which can further boost the performance of 3D-related tasks.

width=1center Pre-train Overall (%) Easy Mod. Hard Scratch (Equivalent to , = 0) 79.17 68.58 64.81 SimIPU w/o intra-module (, = 1) 79.36 69.19 65.17 SimIPU w/o inter-module (, = 0) 79.17 68.58 64.81 SimIPU (, ) 81.32 70.88 66.19 Greedy Assignment 78.92 68.20 64.72 Hungarian Algorithm 81.32 70.88 66.19

Table 5: Ablation study on intra-modal contrastive learning module and matching strategies. We report 3D object detection performance results on KITTI validation set.
Figure 4: Fusion-based 3D object detection performance comparison on Waymo dataset. SimIPU achieves comparable results with limited multi-modal pre-training data.

Matching Algorithm.

We use the Hungarian Algorithm as our default assignment method, which can achieve the optimal global solution for the positive pair association. In this ablation study, we compare it with another commonly used matching algorithm: Greedy Assignment. During the matching process, this algorithm takes the nearest optimal options and repeats them. The ablation experimental results are shown in the last block of Tab. 5. Greedy assignment hampers the downstream performance. The reason may be caused by poor quality matches. It hurts the effectiveness of intra-modal contrastive learning, which is essential to learn useful representations for downstream tasks.

Number of Pre-training Data.

In this ablation experiment, we use different amounts of training data to pre-train models on the Waymo dataset and fine-tune them on both the monocular depth estimation and the monocular 3D object detection task, to explore the influence of the number of pre-training data on downstream tasks. Results are shown in Fig. 5. One can easily observe that the performance of downstream tasks is significantly improved by the increase of pre-training data. It is in line with our intuition: a larger scale of pre-training data will further boost the performance of downstream tasks. Note that we only use 0.15M unlabeled data to pre-train our model. Compared with Super-IN, which utilizes 1M labeled data to pre-train models, our method undoubtedly shows significant competitiveness. The curves of the experimental results can thus be suggested that our method will easily surpass the Super-IN with more pre-training data. In addition, by a simple double fine-tuning strategy, our method can further boost the performance of baseline models, which indicates the friendly compatibility and generalization of our method.

Figure 5: Ablation study on different pre-training data scale. We report the RMSE values on KITTI monocular depth estimation (left) and the AP results on Nuscenes monocular 3D object detection (right).

5 Conclusion

In this paper, we propose SimIPU, a simple yet effective 2D Image and 3D Point cloud Unsupervised pre-training method, and develop a multi-modal contrastive learning framework to learn spatial-aware visual representation for 3D-related tasks in the outdoor environment. This method fills the blank of pre-training methods for outdoor multi-modal datasets and achieves significant performance gains on different 3D-related downstream tasks, including fusion-based 3D object detection, monocular depth estimation, and monocular 3D object detection, with a limited number of multi-modal pre-training data. However, an inadequate of this method is that it only focuses on spatial-aware visual representation while ignores the semantic information, but even notwithstanding this limitation, our approach still shows great generalization and effectiveness on downstream tasks. In the long term, associating both the spatial and the semantic information would be a fruitful area for further work, and we will dig into more effective methods to achieve this purpose. We hope our work will encourage more research on visual representation learning for a suitable design of the cross-modal pre-training paradigm.

6 Acknowledgments

The research was supported by the National Natural Science Foundation of China (61971165, 61922027), in part by the Fundamental Research Funds for the Central Universities (FRFCU5710050119). The authors would like to thank Jinghuai Zhang from the Computer Science Department of Duke University for helpful discussions on topics related to this work.

References

  • P. Agrawal, J. Carreira, and J. Malik (2015) Learning to see by moving. In International Conference on Computer Vision (ICCV), pp. 37–45. Cited by: §2.1.
  • S. F. Bhat, I. Alhashim, and P. Wonka (2021) Adabins: depth estimation using adaptive bins. In

    Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 4009–4018. Cited by: Appendix C, §2.4.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems (NIPS), Vol. 33, pp. 1877–1901. Cited by: §1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International Conference on Machine Learning (ICML)

    ,
    pp. 1597–1607. Cited by: §1, §2.1, §4.1.
  • X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3D object detection network for autonomous driving. In Computer Vision and Pattern Recognition (CVPR), pp. 1907–1915. Cited by: §3.1.
  • M. Contributors (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. Note: https://github.com/open-mmlab/mmdetection3d Cited by: §4.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171–4186. Cited by: §1.
  • D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In International Conference on Computer Vision (ICCV), pp. 2650–2658. Cited by: Appendix C, §2.4, §4.2.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: §3.1, §4.1.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), Vol. 1. Cited by: §2.1.
  • J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, §2.1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738. Cited by: §1, §2.1, §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.1.
  • J. Hou, B. Graham, M. Nießner, and S. Xie (2021a)

    Exploring data-efficient 3D scene understanding with contrastive scene contexts

    .
    In Computer Vision and Pattern Recognition (CVPR), pp. 15587–15597. Cited by: §2.2, §3.2.
  • J. Hou, S. Xie, B. Graham, A. Dai, and M. Nießner (2021b) Pri3D: can 3d priors help 2d representation learning?. In International Conference on Computer Vision (ICCV), Cited by: Appendix D, §1, §2.3, §3.2, §3.3, §4.1.
  • A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon (2021) A survey on contrastive self-supervised learning. Technologies 9 (1), pp. 2. Cited by: §1.
  • D. Jayaraman and K. Grauman (2015) Learning image representations tied to ego-motion. In International Conference on Computer Vision (ICCV), pp. 1413–1421. Cited by: §2.1.
  • H. Jiang, G. Larsson, M. M. G. Shakhnarovich, and E. Learned-Miller (2018) Self-supervised relative depth learning for urban scene understanding. In European Conference on Computer Vision (ECCV), pp. 19–35. Cited by: §2.1.
  • S. Lal, M. Prabhudesai, I. Mediratta, A. W. Harley, and K. Fragkiadaki (2021) CoCoNets: continuous contrastive 3D scene representations. In Computer Vision and Pattern Recognition (CVPR), pp. 12487–12496. Cited by: §2.2.
  • S. Lee, J. Kim, T. Oh, Y. Jeong, D. Yoo, S. Lin, and I. S. Kweon (2019) Visuomotor understanding for representation learning of driving scenes. In The British Machine Vision Conference (BMVC), Cited by: §2.1.
  • M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3D object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 7345–7353. Cited by: §2.4, §3.1.
  • Y. Liu, Y. Huang, H. Chiang, H. Su, Z. Liu, C. Chen, C. Tseng, and W. H. Hsu (2021) Learning from 2D: pixel-to-point knowledge transfer for 3D pretraining. arXiv preprint arXiv:2104.04687. Cited by: §2.3, §3.3.
  • C. Luo, X. Yang, and A. Yuille (2021) Self-supervised pillar motion learning for autonomous driving. In Computer Vision and Pattern Recognition (CVPR), pp. 3183–3192. Cited by: §2.2.
  • X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan (2020) HR-Depth: high resolution self-supervised monocular depth estimation. In

    AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Cited by: §2.4.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision (ECCV), pp. 69–84. Cited by: §2.1.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Computer Vision and Pattern Recognition(CVPR), pp. 2536–2544. Cited by: §2.1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Vol. 1, pp. 2227–2237. Cited by: §1.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems (NIPS), Vol. 30. Cited by: §1, §3.2, §4.1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241. Cited by: §3.3.
  • V. A. Sindagi, Y. Zhou, and O. Tuzel (2019) Mvx-net: multimodal voxelnet for 3D object detection. In International Conference on Robotics and Automation (ICRA), pp. 7276–7282. Cited by: §2.4, §3.1, §3.1, §4.2.
  • P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. (2020) Scalability in perception for autonomous driving: Waymo open dataset. In Computer Vision and Pattern Recognition (CVPR), pp. 2446–2454. Cited by: §4.1.
  • G. Wang, B. Tian, Y. Zhang, L. Chen, D. Cao, and J. Wu (2020) Multi-view adaptive fusion network for 3D object detection. arXiv preprint arXiv:2011.00652. Cited by: §2.4, §3.1.
  • T. Wang, X. Zhu, J. Pang, and D. Lin (2021) FCOS3D: fully convolutional one-stage monocular 3D object detection. In International Conference on Computer Vision Workshops (ICCVW), Cited by: §2.4, §4.2.
  • S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany (2020) PointContrast: unsupervised pre-training for 3D point cloud understanding. In European Conference on Computer Vision (ECCV), pp. 574–591. Cited by: Appendix B, §1, §2.2, §3.2, §4.1, §4.2.
  • Z. Yang, Y. Sun, S. Liu, and J. Jia (2020) 3DSSD: point-based 3D single stage object detector. In Computer Vision and Pattern Recognition (CVPR), pp. 11040–11048. Cited by: §4.1.
  • X. Yuan, Z. Lin, J. Kuen, J. Zhang, Y. Wang, M. Maire, A. Kale, and B. Faieta (2021) Multimodal contrastive training for visual representation learning. In Computer Vision and Pattern Recognition (CVPR), pp. 6995–7004. Cited by: §2.3.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In European Conference on Computer Vision (ECCV), pp. 649–666. Cited by: §2.1.
  • W. Zhang, Z. Wang, and C. Change Loy (2020) Multi-modality cut and paste for 3D object detection. arXiv preprint arXiv:2012.12741. Cited by: Appendix D, §2.4, §3.1, §4.2.

Appendix A KITTI 3D Object Detection Results

As shown in Tab. 6, we report more results on KITTI 3D Object Detection. In pilot study, we utilize different pre-training models, such as supervised 2D object detection pre-trained ones, supervised classification pre-trained ones, and other state-of-the-art counterpart unsupervised pre-trained ones, to initialize the image backbone. Our method achieves significant improvement that indicates the effectiveness.

Pre-train Car (%) Pedestrian (%) Cyclist (%) Overall (%)
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
Scratch 86.18 76.57 74.08 67.95 62.18 57.24 83.37 66.99 63.11 79.17 68.58 64.81
MoCo-v2-IN 87.98 77.40 74.08 69.33 62.03 57.14 82.66 67.38 62.42 79.99 68.94 64.55
MoCo-v2-K 87.66 77.10 74.33 69.24 62.42 57.86 80.55 65.33 60.97 79.15 68.28 64.39
DenseCL-IN 88.11 77.56 74.62 66.56 61.42 57.08 83.86 68.81 64.74 79.51 69.26 65.48
DenseCL-Co 88.29 77.46 74.47 65.53 59.42 53.33 83.59 68.86 63.72 79.14 68.58 63.84
Det-K 88.23 77.54 74.82 65.94 60.04 55.46 80.82 65.74 61.01 78.33 67.77 63.76
Det-CS 87.68 77.29 74.83 67.44 61.04 56.14 83.26 67.39 62.70 79.46 68.73 64.52
Super-IN 88.38 77.49 74.94 67.20 60.06 55.77 85.07 67.47 62.60 80.21 68.34 64.44
Ours-K 87.87 77.36 74.30 71.25 66.18 60.24 84.83 69.11 64.04 81.32 70.88 66.19
Table 6:

Camera-lidar fusion based 3D object detection fine-tuned on KITTI validation set. ‘K’, ‘CS’, ‘Co’ and ‘IN’ indicates pre-trained models are trained on KITTI, Cityscapes, COCO and ImageNet dataset. ‘Det’ represents the 2D detection task.

Pre-train Vehicle L1/L2 Pedestrian L1/L2 Cyclist L1/L2 Overall L1/L2
mAP(%) mAPH(%) mAP(%) mAPH(%) mAP(%) mAPH(%) mAP(%) mAPH(%)
Scratch 65.0/61.0 64.6/60.5 67.6/62.9 58.8/54.7 64.0/61.2 61.1/58.5 65.57/61.53 61.75/57.93
Super-IN 66.0/61.9 65.5/61.5 69.5/64.8 60.9/56.7 63.7/61.3 60.4/58.1 66.41/62.66 62.26/58.76
MoCo-v2-IN 66.1/62.0 65.6/61.6 69.0/64.3 60.0/55.8 63.8/61.0 60.9/58.3 66.30/62.43 62.16/58.56
MoCo-v2-W 65.5/61.5 65.1/60.7 68.6/63.9 59.1/54.9 64.0/61.2 61.1/58.5 66.03/62.16 61.76/58.03
Ours-W 66.5/62.4 66.1/62.0 69.4/64.7 60.5/56.3 64.7/62.3 62.3/60.0 66.92/63.01 63.18/59.47
Table 7: Camera-lidar fusion based 3D object detection performance comparison on Waymo validation set. ‘W’ indicates pre-trained models are trained on Waymo datset.
Method Backbone REL RMS
Our Baseline ResNet-50 0.068 2.712 0.946
Adabins Efficient-B5 0.058 2.360 0.964
Table 8: Depth Estimation Baseline Comparison.

Appendix B Waymo 3D Object Detection Results

As shown in Tab. 7, we report more results on Waymo 3D Object Detection. Waymo dataset is much larger than KITTI. Although the performance difference between pre-trained and scratch models is not obvious for larger dataset Xie et al. (2020), our method still achieves a promising improvement even with limited data.

Appendix C Depth Estimation Baseline Design

In this paper, we design a simple depth estimation baseline to evaluate the effectiveness of pre-trained models. The whole UNet-shape framework contains two components: 1) a standard ResNet-50 encoder and 2) an upsampling decoder Bhat et al. (2021). Skip connections are applied to consolidate feature maps from higher resolutions. Therefore, it outputs depth maps with half the spatial resolution. After upsamping the prediction depth maps to the same resolution as the input images, we utilize the Scale-Invariant loss (SI) introduced by Eigen et al. Eigen and Fergus (2015) to train the model:

(9)

where and the ground truth depth and denotes the number of pixels having valid ground truth values. We use and for all our experiments, which is the same as the Adabins Bhat et al. (2021).

This simple baseline is strong enough to evaluate the effectiveness of our pre-training method. The comparison with the state-of-the-art depth estimation algorithm is shown in Tab. 8, where both the encoders are pre-trained on ImageNet for fair comparison.

Appendix D Pre-training Setting Details

During pre-training, we use the hybrid optimize strategy Zhang et al. (2020) to train the whole framework in an end-to-end manner. For the image branch, we use SGD as our optimizer. The momentum is set to 0.9, weight decay is 0.0001 and learning rate is 0.03. For lidar branch, we use AdamW as our optimizer. The is set to (0.95, 0.99), weight decay is 0.01 and learning rate is 0.001. The loss weight in total loss is experimentally set to 1, because we find the convergence of lidar branch is faster than image Hou et al. (2021b)

, by inter-modal contrastive learning, it can gradually transfer the spatial aware representation to image modality. We train our model on KITTI dataset for 100 epoches and Waymo dataset for 10 epoches with a batch size of 4 on 8 NVIDIA TITAN V100 GPUs. It takes around 24 hours and 48 hours to pre-train models on KITTI and Waymo Dataset, respectively.

Appendix E Single-Modality Downstream Task Results

Limited by pre-training data scale, our method can not directly beat the Super-IN when fine-tuning on the single-modality downstream tasks, i.e.,

monocular depth estimation and monocular 3D object detection. As shown in the ablation study, with more data, our method has a high probability of surpassing the ImageNet supervised ones. To further exert the advantages of multi-modal pre-training, we propose a simple double fine-tuning strategy. In this way, although the data scale is limited, we can fine-tune a strong baseline model (Super-IN) with our method and further improve the model performance on these single-modality downstream tasks. To prove the effectiveness of our method, we need to compare it with other unsupervised pre-training methods with the same double fine-tuning strategy. However, we find that double fine-tuned models with other pre-training methods (MoCo-v2)

cannot converge on these downstream tasks, which indicates that the representations learned from different pre-training strategies (supervised pre-training and unsupervised pre-training) designed on 2D image planes maybe not compatible. Such phenomena shows the effectiveness and generalization of our method.

Appendix F Generation and Convergence

We provide loss curves in Fig. 6. They further prove the effectiveness of SimIPU. Our method achieves a better convergence on various downstream tasks. Moreover, we fine-tune our pre-trained models on the indoor depth estimation dataset NYU-Depth-V2. The results in Table 9 further validate the generalization ability of our method. Notably, it means models trained by SimIPU with outdoor data can work on indoor data.

WOD 3DDet KITTI Depth NUS Mono3D
Figure 6: Figure S2: Trends of losses drawn by the Tensorboard. These curves correspond to the results in the article one-to-one.
Method REL  RMS 
Scratch 0.631 0.889 0.968 0.226 0.711
Ours 0.792 0.941 0.971 0.159 0.512
Table 9: Generalization evaluation on NYU Depth (10K Subset).