Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining

04/10/2021 ∙ by Yueh-Cheng Liu, et al. ∙ 0

Most of the 3D networks are trained from scratch owning to the lack of large-scale labeled datasets. In this paper, we present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets. We propose the pixel-to-point knowledge transfer to effectively utilize the 2D information by mapping the pixel-level and point-level features into the same embedding space. Due to the heterogeneous nature between 2D and 3D networks, we introduce the back-projection function to align the features between 2D and 3D to make the transfer possible. Additionally, we devise an upsampling feature projection layer to increase the spatial resolution of high-level 2D feature maps, which helps learning fine-grained 3D representations. With a pretrained 2D network, the proposed pretraining process requires no additional 2D or 3D labeled data, further alleviating the expansive 3D data annotation cost. To the best of our knowledge, we are the first to exploit existing 2D trained weights to pretrain 3D deep neural networks. Our intensive experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D deep learning has gained a large amount of attention recently due to its wide applications, including robotics, autonomous driving, and AR/VR. Numerous state-of-the-art 3D network architectures have been proposed, showing remarkable performance improvements, including point-based methods

[50, 51, 61], efficient sparse 3D CNN [21, 11], and hybrid point-voxel methods [38, 58]

. However, unlike 2D ImageNet supervised pretraining, most of the 3D neural networks are trained from scratch. Although many efforts are made to collect datasets for 3D

[57, 12, 18, 6]

, the expensive labeling cost and various 3D sensing devices make it challenging to build a large-scale dataset comparable to ImageNet

[14], resulting in supervised pretraining in 3D difficult. However, unlike 2D ImageNet supervised pretraining, most of the 3D neural networks are trained from scratch. Although many efforts are made to collect datasets for 3D [57, 12, 18, 6], the expensive labeling cost and diverse 3D sensing devices make it challenging to build a large-scale dataset comparable to ImageNet [14], resulting in supervised pretraining in 3D difficult.

Figure 1: Learning from 2D as 3D pretraining. Due to the limited size of labeled data in 3D, pretraining with large unlabeled data is important. (a) Previous works [64]

apply self-supervised learning on pure 3D as pretraining. (b) We propose pretraining 3D networks by leveraging the existing 2D network weights by pixel-to-point knowledge transfer. In this way, we can provide the knowledge learned from rich 2D image datasets to the 3D networks.

Recently, self-supervised pretraining has been proved successful in NLP [15, 52] and 2D vision [25, 4, 22, 28, 30, 40, 43, 62]. In 3D vision, PointContrast [64] first shows the opportunity of self-supervised pretraining by leveraging contrastive learning for real-world point cloud data. By learning point correspondence across two adjacent point clouds, it can generate good initial weights for other 3D downstream tasks.

In this work, we study pretraining 3D neural networks for point cloud data from another new perspective by learning from 2D pretrained networks (see Figure 1 for conceptual comparison). PointContrast [64] successfully brings the 2D self-supervised learning paradigm into 3D but neglects the 2D semantic information, for example, the semantic cue extracted by 2D CNN. Enormous 2D labeled datasets are much larger and more diverse than the 3D datasets, and the 2D network architectures have been widely studied and well developed in the past few years. Therefore, we believe that the knowledge of well-trained 2D networks is valuable and informative, which can provide the 3D network signals to learn good initial weights without additional labeled data. Our idea is also related to previous 2D-3D fusion research [13, 10, 36]

, suggesting that 2D features extracted by 2D networks can complement 3D. The finding motivates us that learning from 2D as pretraining brings additional information which cannot be easily learned directly from 3D data. Hence, we would like to raise a core problem:

how to transfer pretrained 2D network to 3D in the self-supervised fashion given the difference of their data properties and the heterogeneous network structures?

In the cross-modal knowledge distillation framework [23], we can view the 2D neural network as the teacher and the 3D counterpart as the student. However, unlike 2D images and depth maps [23], 2D image data and 3D point clouds are not naturally aligned. It is challenging to align intermediate network representations between 2D and 3D due to the heterogeneous network structures. Besides, we found that naively minimizing global representations across two different modalities does not work well empirically in our 2D-3D setting. Moreover, compared to previous cross-modal knowledge distillation works, we consider more challenging real-world 3D downstream tasks to demonstrate the practical advantage of pretraining.

We propose the novel pixel-to-point knowledge transfer (PPKT) for 3D pretraining. The key idea is to learn the point-level features of 3D network from the corresponding pixel representations extracted from pretrained 2D networks. To enable the transfer process, we construct the pixel-point mappings and align pixel and point features with differentiable back-projection function. To overcome the lack of pixel-level outputs of common 2D networks, ResNet [27], we propose the learnable upsampling feature projection layer (UPL) that restores the feature map spatial resolution back to the original size (see Figure 2). Our method is able to pretrain the 3D network with the knowledge of 2D networks without strict assumptions on 2D or 3D network architectures, output channel size or the output spatial resolution.

Following the pretrain-finetune protocol in [64], we show that PPKT consistently boosts overall downstream performance across multiple real-world 3D tasks and datasets. Specifically, we adopt the state-of-the-art 3D network, SR-UNet [11], as the target network for pretraining and fine-tune on object detection and semantic segmentation tasks, including ScanNet [12], S3DIS [2] and SUN RGB-D [57, 56, 33, 63]. We achieve +3.17 mAP@0.25 improvement on ScanNet object detection and +3.12 mIoU on S3DIS semantic segmentation compared to training from scratch. We also provide an extensive ablation study to further verify the effectiveness of our proposed method. Additionally, we study the generalization ability of PPKT by transferring self-supervised pretrained 2D network knowledge as 3D pretraining.

Our contributions can be summarized as follows:

  • We are the first to explore 3D pretraining by leveraging existing 2D pretrained knowledge for high-level, real-world 3D recognition tasks.

  • We propose the novel pixel-to-point knowledge transfer (PPKT) and the upsampling feature projection layer (UPL) which enable transferring the 2D knowledge into 3D as pretraining.

  • We show the effectiveness of our pretraining method by fine-tuning on various 3D real-world scene understanding tasks and successfully boost the overall performance.

Figure 2: Pixel-to-point knowledge transfer (PPKT). PPKT transfers the 2D pretrained network knowledge into 3D from pixels to points. A back-projection is used to align corresponding pixel-level features and point-level features. To restore the granularity of low-resolution 2D feature maps, we propose the learnable upsampling feature projection layer (UPL). The details are described in Section 3.2.

2 Related Work

2.1 Cross-modal 2D-3D Learning

In the field of 3D computer vision, many works leverage 2D information fused with 3D data for 3D semantic segmentation

[13, 10], 3D object detection [9, 35, 48], and 3D instance segmentation [31]. The idea of fusing multiple modality data as input for neural networks comes from that different modality networks capture complementary information [13, 10]

. This inspires us that existing well-trained 2D networks knowledge is helpful for 3D networks. However, instead of fusing 2D and 3D deep features, we learn the 3D representation with the guidance of 2D knowledge as pretraining. Our method does not need labeled data to jointly train 2D and 3D networks and only uses the 2D data during the pretraining stage. does not introduce training and inference overhead.

2.2 Self-supervised Representation Learning

The recent success of BERT [15] and pretraining in NLP have brought the interest of self-supervised representation learning in 2D vision. Previous methods [67, 68, 46, 16, 42, 19] train 2D neural networks by designing handcrafted self-supervised pretext tasks to learn informative representations. Recently, many works [25, 4, 7, 22, 28, 30, 40, 43, 62] introduce contrastive learning and its variants which largely improve the field of self-supervised representation learning. It is shown that the model learned by these methods can be used as good pretraining weights. Some also extend the self-supervised representation learning to the 3D field [55, 53, 24], yet their methods take single, clean 3D objects as input, which are not applicable to complex real-world 3D data. PointContrast [64] introduce contrastive-based loss suitable for point clouds for 3D pretraining, especially for 3D scene understanding task. Given an existing 2D pretrained network, our method can be viewed as a self-supervised pretext task since our knowledge transfer process requires no annotated data. In this work, we follow the pretrain-finetune protocol of [64]. However, our idea is different in that we leverage the rich 2D pretrained network knowledge to guide 3D instead of self-supervised learning on pure 3D data.

2.3 Knowledge Distillation

Knowledge distillation (KD) [5, 37, 29] is proposed to compress a large model (teacher) into a smaller one (student) by learning the class-level soft-output of the teacher. Many works further extend the KD idea by mimicking the teacher’s intermediate representations [54, 66], advanced objective function [32, 1, 44, 47, 60], or learnable proxy layers [34]. Despite the application of model compression, some recent works also show that student can surpass teacher achieving better performance through KD [65, 69, 17].

Knowledge distillation across different modality takes the advantage of modality with rich labeled data to help the data shortage of target modality. Gupta  [23] first proposes the idea of transferring the supervision of CNN from RGB images to depth maps with unlabeled pair data. [3, 20] consider transfer from RGB to sound and video by CNN pretrained with objects and scenes data. The distillation process for target modality (student) involves no labeled data. Therefore, unlike general knowledge distillation, the cross-entropy loss to match true label cannot be used.

The cross-modal setup is the most similar to ours if we consider 2D network as teacher and 3D as student. However, the key difference between [23] and ours is that [23] considers 2D CNN for both RGB image and depth map which is well-aligned while 2D images and 3D point cloud are not, and the networks for 2D and 3D are heterogeneous. Moreover, we consider 3D state-of-the-art models [21, 11] and complex real-world 3D downstream tasks to show the actual effectiveness of pretraining for 3D.

3 Learning from 2D

3.1 2D Pretraining on 3D Tasks

First, we conduct a simple pilot study showing the opportunity of deep neural networks pretraining for 3D tasks. We train 2D semantic segmentation networks to perform 3D semantic segmentation on ScanNet [12] by aggregating multi-view predictions [10]. The 2D network with ImageNet pretraining achieves +4.81% 3D mIoU improvement compared to training from scratch. The result indicates (1) pretraining effect: with proper 3D pretraining, 3D neural networks may perform better; (2) 2D knowledge: 2D pretraining knowledge, ImageNet, may be helpful for 3D scene understanding tasks such as ScanNet semantic segmentation. Based on the observations, instead of developing a pure 3D self-supervised pretraining method, we aim to design a novel approach to transfer the 2D pretrained network knowledge into 3D as pretraining for 3D.

3.2 Pixel-to-point Knowledge Transfer (PPKT)

To exploit the 2D knowledge for 3D pretraining, we propose the novel pixel-to-point knowledge transfer (PPKT) from 2D to 3D. We divide our method into the following parts: (1) the pixel-to-point design, (2) the upsampling feature projection layer, and (3) point-pixel NCE loss. We illustrate the overview of our proposed pipeline in Figure 2.

# f_2D, f_3D: 2D and 3D backbone
# g_2D: upsampling projection layer
# g_3D: projection layer
# bp: back projection function
# n_sub: sub-sample size, nce_t: temperature
for (x, d) in loader:
    (c, f) = bp(x, d)
    z_2D = g_2D(f_2D(x).detach())
    # Back-project pixel-level feature
    _, z_2D = bp(z_2D, d)
    _, z_3D = g_3D(f_3D(c, f))
    idxs = random.choice(N, n_sub)
    z_2D = z_2D[idxs]
    z_3D = z_3D[idxs]
    logits = mm(z_3D, z_2D_lift.T)
    labels = arange(n_sub)
    loss = CrossEntropy(logits / nce_t, labels)
    update(f_3D.param, g_2D.param, g_3D.param)
Algorithm 1

Pseudocode of pixel-to-point knowledge transfer in Pytorch-like style

We leverage large unlabeled RGB-D dataset during the PPKT pretraining. RGB-D frames are as easy to collect by using inexpensive RGB-D cameras, such as RealSense and Kinect, scanning over the scenes. Let the RGB-D dataset be where and are aligned RGB images and depth maps. Given the camera intrinsic parameters, we define the back-projection function , where is the coordinate of points and is the point features. is a 3-dimension matrix with shape which can be an RGB image () or a 2D feature map ( is the feature dimension).

generates a single-view point cloud from a pair of depth map and RGB image or 2D feature map, accompanied with the RGB values or feature vector on the pixel as the point feature.

For the networks, let be a 2D CNN network which takes an image as input and outputs a feature map with size 111We ignore the final global pooling and classification layer without loss of generality. We define a 3D neural network which outputs features for each point. The formulation can represent most of the commonly used 3D backbones, SR-UNet [11] or PointNet++ [51]. Assuming the 2D CNN has been pretrained with large-scale 2D image datasets, ImageNet classification in our default settings, we aim to learn an initial weight of a 3D network from the knowledge of . Then, can be used as the backbone network and fine-tuned for other downstream tasks, such as 3D semantic segmentation or 3D object detection. On the other hand, is fixed during the PPKT pretraining stage.

Figure 3: T-SNE plot of global features versus pixel-level features. For a single ScanNet scene, the global features of images extracted from ImageNet pretrained ResNet are less discriminative in feature space than pixel-level features. Therefore, we propose transferring 2D knowledge from pixels to points rather than global features to preserve fine-grained pixel-level information.

Transfer knowledge from pixels to points.

A naive knowledge distillation approach for the cross-modal setup is to minimize the distance between global features [23], or classification logits [29] of two different modality models. In practice, we found that it does not work well between 2D images and 3D point clouds. PointContrast [64] argues that 3D datasets has large number of points but small number of instances. Learning global representations for 3D will suffer from the limited number of instances. In addition, we suppose that there are a few other reasons: (1) the spatial information loss caused by global pooling operations. (2) For 3D common encoder-decoder backbone networks, only the encoder would be pretrained, and the decoder is ignored. (3) Unlike ordinary 2D datasets for self-supervised learning, RGB-D frames in indoor environments may contain similar global contexts, which are not discriminative in feature space. For example, most of the frames contain the floor, walls, or tables, sharing similar global semantic meaning. Therefore, we propose transferring knowledge between pixel-level of images and point-level of point clouds. Figure 3 shows that global features of indoor scene images extracted by ImageNet pretrained ResNet are less diverse than pixel-level features.

Given a pair of point cloud , RGB image , and depth map , we obtain the 3D feature representation and the pixel-to-point 2D representation (lifted 2D features) by


where . and are learnable feature projection layers which map the feature dimension of 2D and 3D into the same embedding space with identical dimension and scale. In practice, could be a convolution layer, and

is a shared linear perceptron. Both

and are followed with L2 normalization. is the 2D representation back-projected into 3D space by . In this way, and are well-aligned such that the 2D feature in -th pixel and the -th point feature are from the same 3D coordinates.

Upsampling feature projection layer (UPL).

ImageNet supervised classification pretrained weights are commonly used in 2D computer vision since the dataset is diverse and large in scale, providing good generalizability and transferability. A classification network, ResNet, usually decreases the spatial resolution of feature maps and enlarges the feature dimension to better extract high-level semantics. The low spatial resolution ( and ) makes it difficult for our pixel-to-point knowledge transfer. Therefore, we propose the upsampling feature projection layer to tackle this issue. Specifically, given the feature map from the last layer of 2D CNN, we apply an convolution and bilinear upsamples to the original input image resolution. It also maps the 2D features into the same dimension as 3D. The method is effective and works well in practice. More importantly, it enables the flexibility to handle the difference of spatial resolution and number of channels of various 2D network architectures.

Point-pixel NCE loss.

We minimize the relative distance between the corresponding pixel and point representation through point-pixel NCE loss (PPNCE), a modified version of InfoNCE loss [43] for contrastive learning, which is defined as


where is the temperature hyper-parameter, and

is the total number of points. The physical meaning of the loss function is to form a feature space by attracting the features between a 3D point and its corresponding 2D pixel and separate the 3D feature from other 2D pixel features. In other words, if a pixel and a point share the same coordinate in the 3D world, they are positive pairs. Otherwise, they are negative pairs. For empirical memory concern, we sub-sample a fixed number of points and pixels before the loss calculation. Algorithm 

1 provides a simplified pseudo-code for our proposed pixel-to-point knowledge transfer.

Note that most of the existing contrastive learning methods apply linear or non-linear feature projection layers on the global features by multi-layer perceptrons [25, 7, 60]. Our 3D feature projection layer is applied on 3D point-level feature maps, and the 2D upsampling projection layer is applied on 2D high-level but low-resolution feature maps. For simplicity, we do not include memory bank [7] due to the large number of pixels and points.


If we consider the back-projected pixel-level features as pseudo 3D points, our loss function is similar to PointInfoNCE [64]. We both apply contrastive loss on point-level (or pixel-level) instead of global instance level, which is commonly seen in 2D contrastive learning. However, our PPNCE is applied across 2D and 3D features encoded by two different networks. PointInfoNCE is applied across features of two different point cloud samples extracted by the same 3D network. Additionally, our motivation is different. We propose using the contrastive loss to minimize the relative distance between the point and the pixel representation to transfer the 2D knowledge.

He  [26] shows that training from scratch can achieve or even surpass the ImageNet supervised pretraining performance. However, [64] shows that the small dataset size of 3D labeled data makes training from scratch fail to surpass pretraining even with longer training time. In this work, we transfer the knowledge of ImageNet supervised pretrained 2D network into 3D. The 2D knowledge may provide additional information complementary to 3D supervised learning and bring performance improvements. Moreover, we show that our method is not restricted to 2D networks pretrained by supervised learning in Section 4.5.

4 Experiment

In this section, we study the effectiveness of our proposed 3D pretraining method. The ability to transfer the pretrained weight to various downstream tasks is essential. Therefore, we fine-tune the model trained by our pixel-to-point knowledge transfer on 3D semantic segmentation and 3D object detection tasks. We will describe the details and results in the following sections.

4.1 Experimental Setup

Network architectures.

We choose the widely used ResNet [27] as our 2D backbone, pretrained on ImageNet classification. For 3D network, following [64], we adopt Sparse Residual 3D U-Net 34 (SR-UNet34) [21, 11] as our 3D backbone network since it achieves state-of-the-art performance on multiple 3D tasks. It contains 34 layers of 3D sparse convolution layers with the encoder-decoder design and skip connections. It is naturally suitable for 3D semantic segmentation tasks since it generates per-point outputs. Following [64], with little modification, it can also performs 3D object detection by attaching VoteNet [49] modules.


We use the raw RGB-D images in ScanNet dataset [12], collected by hand-hold light-weight depth sensors, as our transfer dataset since it is currently the largest available real-world dataset of its kind. ScanNet is composed of 1513 indoor scans, 707 distinct spaces. We sub-sample every 25 frames from the sequential RGB-D data, resulting in about 100k frames.


We compared our method with PointContrast [64] using the officially released pretrained weight. Additionally, in order to prove the effectiveness of our design, we build other naive 2D-3D knowledge transfer methods for comparison. Global knowledge distillation (Global KD) represents the method we modified from [29]

, considering the SR-UNet encoder as the student and 2D CNN as the teacher. Specifically, we extract the encoder part from SR-UNet and attach a classifier with the output number of classes matched with the ImageNet supervised pretrained network. We compute KL-divergence between 2D and 3D logits as loss. Even though the transfer dataset is not from ImageNet, we force the 2D classifier outputs “dark” class information before softmax. Since the transfer dataset has no label, the classification loss between student and true label is ignored. Global CRD

[59] follows the similar settings but with contrastive-based loss function. We also compared with [23] with slight modifications to adapt our 2D-3D settings. It can be seen as minimizing the L2 distance between 2D and 3D global features. Note that for global KD, CRD and [23] baseline, only the 3D encoder is pretrained.

Training details.

We use the upsampling feature projection layer to project the 2D ResNet50 layer4 feature. As for 3D, a feature projection layer is used to project the last layer feature of SR-UNet. The projected feature dimension 128. The temperature of NCE loss is 0.04. The voxel size is set as 2.5cm, and the input image size is [480, 640]. We use momentum SGD with learning rate 0.5 and weight decay 1e-4. The exponential learning rate scheduler is used. We apply 2D augmentations, including horizontal flip and random resize, and 3D augmentations, including scaling, rotation, and elastic distortion. We pretrain the 3D network for 60k iterations using one V100 GPU with batch size 24. We implement our method with PyTorch [45]. We use the 2D ResNet ImageNet pretrained weights from the torchvision library. We use the same PPKT pretrained 3D network weight as the initial weights for all downstream tasks.

4.2 S3DIS Semantic Segmentation

Method mean Acc mean IoU
From scratch 73.24 65.16
Global KD 72.65 66.56
Gupta [23] 72.11 64.39
CRD [60] 72.70 65.65
PointContrast [64] 73.97 66.86
Pixel-to-point (ours) 75.19 68.28
Table 1: S3DIS semantic segmentation result. Our pixel-to-point knowledge transfer largely boosts the performance by +3.34% mIoU compared to training from scratch. In contrast, the global pretraining methods (Global KD, Gupta , and CRD) do not bring performance improvement in the downstream task.


Stanford Large-Scale 3D Indoor Spaces dataset (S3DIS) [2] contains 6 large buildings, nearly 250 rooms, and semantic labels of 13 categories. For evaluation, we use the most common area 5 for validation.


The result of S3DIS semantic segmentation is shown in Table 1. Compared to training from scratch, our proposed pixel-to-point knowledge transfer pretraining brings significant improvements (+3.34% mIoU). In contrast, the performance of global pretraining method baselines (Global KD, CRD, Gupta ) is similar to training from scratch. This indicates that our pixel-to-point design is important to transfer 2D knowledge.

4.3 SUN RGB-D Object Detection

Method mAP@0.25 mAP@0.5
From scratch 55.21 32.81
Global KD 56.10 33.58
Gupta [23] 54.42 31.12
CRD [60] 55.86 34.30
PointContrast [64] 56.14 32.70
Pixel-to-point (ours) 57.26 33.92
Table 2: SUN RGB-D 3D object detection result. Compared to training from scratch, our point-to-pixel knowledge pretraining significantly improves the performance by +2.05% mAP@0.25.
Semantic segmentation Object detection
Method mean Acc mean IoU mAP@0.25 mAP@0.5
From scratch 88.57 69.49 56.50 34.54
Global KD 88.63 70.35 57.50 34.81
Gupta [23] 88.40 68.71 56.74 36.27
CRD [60] 88.08 68.53 56.82 36.75
PointContrast [64] 88.63 69.22 58.30 36.26
Pixel-to-point (ours) 88.53 69.56 59.67 38.90
Table 3: ScanNet semantic segmentation and object detection results. For segmentation, pretraining does not bring significant improvement since the dataset used in pretraining and fine-tuning are the same. On the other hand, the model pretrained by our method shows a large improvement gain in object detection.


SUN RGB-D dataset [57] comprises about 10,000 RGB-D images and is annotated with 60k 3D oriented bounding boxes of 37 object classes. The RGB-D images are back-projected into 3D point clouds. We use the official train/val split and train with the ten most common object classes.


3D object detection requires high-level semantic understanding for object classification and localization. The result of our pretrained model against training from-scratch is summarized in Table 2. Our proposed pixel-to-point knowledge transfer achieves significant improvement compared to training from scratch (+2.28% mAP@0.25).

4.4 ScanNet Semantic Segmentation and Object Detection


ScanNet [12] dataset contains 1201 scans in training set and 312 in validation set. For semantic segmentation, we use the v2 label set, which has 20 semantic classes and evaluate the performance on vertex points. We use 2cm voxel size. During testing, the points are assigned to the predicted class of the nearest voxel center.

For object detection, ScanNet contains 18 object classes for instance segmentation labeled on mesh data. We follow the preprocessing of [49, 64], generating 3D axis-aligned bounding box. Compared to SUN RGB-D, ScanNet consists of reconstructed 3D scans, which are larger and more complete. We set 2.5cm for voxel size.


The semantic segmentation and object detection results are shown in Table 3. For semantic segmentation, the pretraining has little performance improvement against training from scratch. Since the supervised fine-tune dataset is the same as our pretraining dataset, it is possible that self-supervised pretraining only accelerates the training process but hardly brings performance improvements. For verification, we experiment gradually decreasing the number of labeled data in ScanNet for fine-tuning. We randomly sample 50%, 30%, 15% of the labeled scenes in the dataset for fine-tuning, and the result is presented in Figure 4. Although no pretraining gain using 100% labeled data during fine-tuning, the performance gap between pretraining and from scratch becomes larger when the data size is smaller (2.23% mIoU difference for 15% labeled data). Our limited labeled data fine-tune result also matches the findings in previous works [41, 8], which suggest that self-supervised pretraining methods benefit more when the size of labeled data is smaller in the downstream supervised task.

For object detection, unlike the semantic segmentation result, our pixel-to-point knowledge transfer increases the performance by a large margin (+3.17% mAP@0.25 and +4.36% mAP@0.5) even if the pretraining dataset and fine-tune dataset are the same. The improvements are consistent with the results in SUN RGB-D object detection. This may suggest that 3D object detection tasks benefit more from the high-level semantics learned from the 2D pretrained network.

Figure 4: Limited labeled data fine-tune on ScanNet semantic segmentation. We subsample the labeled scenes in the ScanNet dataset into 50%, 30%, and 15%. With less labeled data available, the gap between from scratch and our pretraining become larger.

4.5 Ablation Study

Backbone network size.

Previous works [8, 41] have shown that larger or deeper 2D networks will benefit more from self-supervised pretraining. We provide the ablation study on S3DIS in Table 4 of different 3D network sizes pretrained by our method. SR-UNet18 has about half the size in terms of network parameters compared to SR-UNet34. We have similar findings that our PPKT pretraining on larger models has more performance improvement. It is also worth noticing that our PPKT pretraining on SR-UNet18 could achieve the performance of SR-UNet34 training from scratch. In other words, a good pretraining can make up the network backbone size limitation.

Method mean Acc mean IoU
SR-UNet18 71.67 64.67
SR-UNet18 + PPKT 73.67 (+2.0) 66.40 (+1.73)
SR-UNet34 73.24 65.16
SR-UNet34 + PPKT 75.19 (+1.95) 68.28 (+3.12)
Table 4: 3D backbone ablation study on S3DIS. The result shows that larger 3D networks perform better with our pretraining. The models with larger capacity benefit more from pretraining with rich unlabeled data.

Point-pixel loss functions and datasets.

Loss Dataset Network mAP@0.25
From scratch - - 55.21
PPKD loss ADE20k ResNet-FCN 56.07
PPNCE loss ADE20k ResNet-FCN 58.03
PPNCE loss ImageNet ResNet 57.26
Table 5: Point-pixel loss ablation study on SUN RGB-D object detection. We replace the loss with the ordinary knowledge distillation loss [29] across points and pixels (PPKD loss). To do so, we train the first pretrain 2D FCN with ADE20K semantic segmentation [70]. The result shows that PPNCE loss can better transfer the 2D knowledge.

We study the loss function in our pixel-to-point knowledge transfer by replacing PPNCE loss with the ordinary knowledge distillation loss [29] applied on points and pixels (PPKD). However, ImageNet pretrained ResNet does not output pixel-wise class-level predictions. To achieve this, we train a 2D ResNet-FCN [39] with ADE20k semantic segmentation dataset [70]. For PPNCE on ADE20k, the upsampling feature projection layer (UPL) is replaced with the normal feature projection layer (PL). We evaluate the fine-tune performance on SUN RGB-D object detection in Table 5. The result shows both loss functions are better compared to training from scratch, and PPNCE is better than PPKD (+2.03% mAP@0.25). In contrast to traditional KD loss, which treats each output independently, the contrastive loss is better to transfer the knowledge in structurally due to the large numbers of negative examples [60].

On the other hand, ADE20k PPNCE performs better than the ImageNet PPNCE. Nevertheless, we argue that network architectures without decoder (FCN) and using UPL are the better choices since FCN-like structure requires semantic segmentation dataset to pretrain. In contrast, our default setup has more flexibility and weak assumption on 2D network architectures.

Learning from 2D self-supervised pretrained network.

We further show that learning from 2D as pretraining is not restricted to 2D networks pretrained by supervised learning. It can also take advantage of 2D self-supervised pretraining. Here, we use the MoCo [25] pretrained 2D ResNet50 using ImageNet without labels and compared to our default settings, where ImageNet supervised pretraining is used.

We show the result in Table 6. In our experiment, the 3D models learned from MoCo pretrained 2D networks show comparable fine-tune performance with the 2D network trained by supervised ImageNet classification. This shows the generalization ability of our method to self-supervised learned 2D knowledge. We believe the performance could be further improved if the 2D self-supervised pretraining is trained with an unlabeled dataset larger than ImageNet, which is the strength of self-supervised learning.

Method mAP@0.25
From scratch 56.50 55.21
PPKT (ImageNet super.) 59.67 57.26
PPKT (ImageNet MoCo) 59.69 57.17

Table 6: PPKT with self-supervised pretrained 2D network on object detection. In addition to ImageNet supervised pretrained 2D network, our method can also adopt a 2D self-supervised pretrained 2D network, MoCo [25]. It shows comparable performance with our default setting (ImageNet supervised pretrained 2D network).

5 Conclusion

In this work, we explore a new 3D pretraining approach by learning from 2D pretrained networks. We present the pixel-to-point knowledge transfer to effectively utilizing the 2D knowledge. Our comprehensive experiment demonstrates that our proposed pretraining method brings significant improvements to various real-world 3D downstream tasks. Ideally, our method is complementary to PointContrast [64]. We expect that our idea of learning from 2D and empirical findings can inspire future works to consider 2D network knowledge when developing self-supervised 3D algorithms.


  • [1] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai (2019) Variational information distillation for knowledge transfer. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 9163–9171. Cited by: §2.3.
  • [2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543. Cited by: §1, §4.2.
  • [3] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. arXiv preprint arXiv:1610.09001. Cited by: §2.3.
  • [4] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15535–15545. Cited by: §1, §2.2.
  • [5] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §2.3.
  • [6] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) Nuscenes: a multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631. Cited by: §1.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2.2, §3.2.
  • [8] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton (2020) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §4.4, §4.5.
  • [9] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §2.1.
  • [10] H. Chiang, Y. Lin, Y. Liu, and W. H. Hsu (2019) A unified point-based framework for 3d segmentation. In 2019 International Conference on 3D Vision (3DV), pp. 155–163. Cited by: §1, §2.1, §3.1.
  • [11] C. Choy, J. Gwak, and S. Savarese (2019)

    4d spatio-temporal convnets: minkowski convolutional neural networks

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084. Cited by: §1, §1, §2.3, §3.2, §4.1.
  • [12] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §1, §1, §3.1, §4.1, §4.4.
  • [13] A. Dai and M. Nießner (2018) 3dmv: joint 3d-multi-view prediction for 3d semantic scene segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 452–468. Cited by: §1, §2.1.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  • [15] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.2.
  • [16] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §2.2.
  • [17] T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. In

    International Conference on Machine Learning

    pp. 1607–1616. Cited by: §2.3.
  • [18] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §1.
  • [19] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2.2.
  • [20] R. Girdhar, D. Tran, L. Torresani, and D. Ramanan (2019) Distinit: learning video representations without a single labeled video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 852–861. Cited by: §2.3.
  • [21] B. Graham, M. Engelcke, and L. Van Der Maaten (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9224–9232. Cited by: §1, §2.3, §4.1.
  • [22] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, §2.2.
  • [23] S. Gupta, J. Hoffman, and J. Malik (2016) Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2827–2836. Cited by: §1, §2.3, §2.3, §3.2, §4.1, Table 1, Table 2, Table 3.
  • [24] K. Hassani and M. Haley (2019) Unsupervised multi-task feature learning on point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8160–8171. Cited by: §2.2.
  • [25] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2.2, §3.2, §4.5, Table 6.
  • [26] K. He, R. Girshick, and P. Dollár (2019) Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927. Cited by: §3.2.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
  • [28] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §1, §2.2.
  • [29] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.3, §3.2, §4.1, §4.5, Table 5.
  • [30] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018)

    Learning deep representations by mutual information estimation and maximization

    arXiv preprint arXiv:1808.06670. Cited by: §1, §2.2.
  • [31] J. Hou, A. Dai, and M. Nießner (2019) 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4421–4430. Cited by: §2.1.
  • [32] Z. Huang and N. Wang (2017)

    Like what you like: knowledge distill via neuron selectivity transfer

    arXiv preprint arXiv:1707.01219. Cited by: §2.3.
  • [33] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell (2013) A category-level 3d object dataset: putting the kinect to work. In Consumer depth cameras for computer vision, pp. 141–165. Cited by: §1.
  • [34] J. Kim, S. Park, and N. Kwak (2018) Paraphrasing complex network: network compression via factor transfer. arXiv preprint arXiv:1802.04977. Cited by: §2.3.
  • [35] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: §2.1.
  • [36] A. Kundu, X. Yin, A. Fathi, D. Ross, B. Brewington, T. Funkhouser, and C. Pantofaru (2020) Virtual multi-view fusion for 3d semantic segmentation. In European Conference on Computer Vision, pp. 518–535. Cited by: §1.
  • [37] J. Li, R. Zhao, J. Huang, and Y. Gong (2014) Learning small-size dnn with output-distribution-based criteria. In Fifteenth annual conference of the international speech communication association, Cited by: §2.3.
  • [38] Z. Liu, H. Tang, Y. Lin, and S. Han (2019) Point-voxel cnn for efficient 3d deep learning. arXiv preprint arXiv:1907.03739. Cited by: §1.
  • [39] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §4.5.
  • [40] I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §1, §2.2.
  • [41] A. Newell and J. Deng (2020) How useful is self-supervised pretraining for visual tasks?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7345–7354. Cited by: §4.4, §4.5.
  • [42] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.2.
  • [43] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2.2, §3.2.
  • [44] W. Park, D. Kim, Y. Lu, and M. Cho (2019) Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: §2.3.
  • [45] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • [46] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §2.2.
  • [47] B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang (2019) Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5007–5016. Cited by: §2.3.
  • [48] C. R. Qi, X. Chen, O. Litany, and L. J. Guibas (2020) Imvotenet: boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4404–4413. Cited by: §2.1.
  • [49] C. R. Qi, O. Litany, K. He, and L. J. Guibas (2019) Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286. Cited by: §4.1, §4.4.
  • [50] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1.
  • [51] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §1, §3.2.
  • [52] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1.
  • [53] Y. Rao, J. Lu, and J. Zhou (2020) Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5376–5385. Cited by: §2.2.
  • [54] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.3.
  • [55] J. Sauder and B. Sievers (2019) Self-supervised deep learning on point clouds by reconstructing space. In Advances in Neural Information Processing Systems, pp. 12962–12972. Cited by: §2.2.
  • [56] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp. 746–760. Cited by: §1.
  • [57] S. Song, S. P. Lichtenberg, and J. Xiao (2015) Sun rgb-d: a rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 567–576. Cited by: §1, §1, §4.3.
  • [58] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han (2020) Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, pp. 685–702. Cited by: §1.
  • [59] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §4.1.
  • [60] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive representation distillation. arXiv preprint arXiv:1910.10699. Cited by: §2.3, §3.2, §4.5, Table 1, Table 2, Table 3.
  • [61] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §1.
  • [62] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §2.2.
  • [63] J. Xiao, A. Owens, and A. Torralba (2013) Sun3d: a database of big spaces reconstructed using sfm and object labels. In Proceedings of the IEEE international conference on computer vision, pp. 1625–1632. Cited by: §1.
  • [64] S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany (2020) PointContrast: unsupervised pre-training for 3d point cloud understanding. In European Conference on Computer Vision, pp. 574–591. Cited by: Figure 1, §1, §1, §1, §2.2, §3.2, §3.2, §3.2, §4.1, §4.1, §4.4, Table 1, Table 2, Table 3, §5.
  • [65] J. Yim, D. Joo, J. Bae, and J. Kim (2017)

    A gift from knowledge distillation: fast optimization, network minimization and transfer learning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §2.3.
  • [66] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §2.3.
  • [67] R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    In European conference on computer vision, pp. 649–666. Cited by: §2.2.
  • [68] R. Zhang, P. Isola, and A. A. Efros (2017)

    Split-brain autoencoders: unsupervised learning by cross-channel prediction

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1058–1067. Cited by: §2.2.
  • [69] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4320–4328. Cited by: §2.3.
  • [70] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641. Cited by: §4.5, Table 5.