Image2Point: 3D Point-Cloud Understanding with Pretrained 2D ConvNets

06/08/2021 ∙ by Chenfeng Xu, et al. ∙ Facebook berkeley college 4

3D point-clouds and 2D images are different visual representations of the physical world. While human vision can understand both representations, computer vision models designed for 2D image and 3D point-cloud understanding are quite different. Our paper investigates the potential for transferability between these two representations by empirically investigating whether this approach works, what factors affect the transfer performance, and how to make it work even better. We discovered that we can indeed use the same neural net model architectures to understand both images and point-clouds. Moreover, we can transfer pretrained weights from image models to point-cloud models with minimal effort. Specifically, based on a 2D ConvNet pretrained on an image dataset, we can transfer the image model to a point-cloud model by inflating 2D convolutional filters to 3D then finetuning its input, output, and optionally normalization layers. The transferred model can achieve competitive performance on 3D point-cloud classification, indoor and driving scene segmentation, even beating a wide range of point-cloud models that adopt task-specific architectures and use a variety of tricks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

page 18

page 19

page 20

Code Repositories

image2point

Official implementation of Image2Point.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A point-cloud is an important visual representation for 3D computer vision. It is widely used in applications such as autonomous driving Behley et al. (2019); Caesar et al. (2020); Yue et al. (2018), robotics Armeni et al. (2017); Pomerleau et al. (2015); Xu et al. (2021), augmented and virtual reality Sketchup (2021); Wu et al. (2015); Shi et al. (2015), etc. A point-cloud represents visual information in a highly different way from a 2D image. A point-cloud essentially consists of a set of unordered points, with each point encoding its spatial x, y, z coordinates and potentially other features. In contrast, a 2D image organizes visual features as a dense 2D pixel array. Due to the representation differences, 2D image and 3D point-cloud understanding are treated as separate problems. Image models and point-cloud models are designed to have different architectures and are trained on different types of data. Few research efforts have tried to directly transfer models from images to point-clouds or vice versa.

Intuitively, both 3D point-clouds and 2D images are visual representations of the physical world. Their low-level representations are drastically different, but they can represent the same underlying visual concept. Furthermore, human vision has no problem understanding both representations. Therefore, we explore whether computer vision models trained on one modality can be used to understand the other.

Somewhat surprisingly, the answer to the question immediately above is: yes, 2D image models trained on image datasets can be transferred to understand 3D point-clouds with minimal effort. As illustrated in Figure 1, we transfer a 2D ConvNet to a 3D ConvNet whose input is a 3D voxel representation converted from a point-cloud. Based on a pretrained 2D ConvNet, we inflate its 2D convolutional filters to 3D by copying the filter weights along a third dimension. We add input and output layers to the network, and on a target point-cloud dataset, we only finetune the input/output layers and optionally the normalization layers, while keeping the original model weights untouched. This transferred model can achieve competitive performance with 88.09% top 1 accuracy, 55.22% mIoU, and 58.76% mIoU for CAD model classification, indoor and outdoor semantic segmentation, respectively, outperforming many previous point-cloud models that adopt task-specific model architectures and tricks. When trained on a small number of samples, the transferred model drastically outperforms the model trained from scratch, exhibiting superior data efficiency with 11.26% accuracy improvement in extremely low data regime on the ModelNet 3D Warehouse dataset.

Transferring models from images to point-clouds has several potential benefits: 1) recent research has shown that pretrained image models effectively improve the downstream task’s performance Kolesnikov et al. (2019). 2) pretrained image models also drastically improve data efficiency on downstream tasks, even enabling promising few-shot or zero-shot transfer Kolesnikov et al. (2019); Chen et al. (2020a); He et al. (2020); Caron et al. (2021). This is particularly interesting for point-cloud understanding as obtaining and annotating point-clouds are difficult and expensive Wang et al. (2019a).

In order to better understand whether we can transfer the image model to point-cloud, whether we can utilize benefits of the recent progress of image representation learning, what factors impact the performance, and how we can make it work better, we design eight questions and provide experimental results to these questions. Our experiments show that 1) transferring image models to point-cloud understanding is feasible; 2) we observe preliminary but promising results that pretrained image models can improve downstream task performance and data efficiency; 3) however, we also note some phenomenon that contradicts our expectations from image-to-image transfer. For example, bigger datasets do not necessarily improve or even hurt downstream performance; stronger pretrained models do not necessarily yield better downstream performance. These findings indicate future research is needed to unlock the potential of transferring image models to point-cloud understanding.

Figure 1: We investigate the feasibility of pretrained 2D ConvNets transferring to 3D sparse ConvNets. With filter inflation and finetuning only the input, output layer (classfier for classfication task and decoder for semantic segmentation task), and optionally normalization layers, 3D Sparse ConvNets are capable of dealing with point-cloud classfication, indoor, and driving scene segmentation.

2 Related Work

2.1 Point-cloud processing model

3D convolution-based method is one of the mainstreams in point-cloud processing approaches which efficiently process point-clouds based on voxelization. In particular, in this approach, voxelization is used to rasterize point-cloud into regular grids (called voxels), thus conventional 3D convolutions can be applied. Sparse convolution is proposed to apply on the non-empty voxels Liu et al. (2015); Choy et al. (2019); Tang et al. (2020); Zhou et al. (2020); Yan et al. (2018), largely improving the efficiency of 3D convolutions.

Projection-based method attempts to project a 3D point-cloud to a 2D plane and use 2D convolution to extract features Wang et al. (2018); Wu et al. (2018, 2019); Xu et al. (2020); Su et al. (2015); Lawin et al. (2017); Boulch et al. (2017). Specifically, the bird-eye-view projection Yang et al. (2018); Lang et al. (2019) and the spherical projection Wu et al. (2018, 2019); Xu et al. (2020); Milioto et al. (2019) make great progress in outdoor point-cloud tasks.

Point-based method directly processes the point-cloud data. The most classic method, PointNet Qi et al. (2016) and PointNet++ Qi et al. (2017), directly consume by customized feature aggregation. Many works further develop advanced local-feature aggregation operators that mimic the convolution to structure data Xu et al. (2021); Li et al. (2018b); Hua et al. (2018); Liu et al. (2019b, 2020); Wang et al. (2017); Li et al. (2018a); Komarichev et al. (2019); Feng et al. (2021).

2.2 Pretraining in 2D and 3D vision

Pretraining in 2D vision has shown effectiveness under supervised Dosovitskiy et al. (2020); Girshick et al. (2014), self-supervised Jing and Tian (2020); Goyal et al. (2021), and unsupervised contrastive approach He et al. (2020); Bachman et al. (2019); Chen et al. (2020a); Caron et al. (2020); Chen et al. (2020c); Hjelm et al. (2018). After pretraining on a large amount of image data, a 2D model requires much fewer computational resources and data for later finetuning to reach competitive performance on downstream tasks Kataoka et al. (2020); Caron et al. (2019); Chen et al. (2020b); Henaff (2020).

Pretraining in 3D vision has been studied similarly as pretraining in 2D vision: both self-supervised and contrastive pretraining Xie et al. (2020) show promising results. Due to the lack of large, annotated point-cloud datasets, pretraining in 3D vision is motivated for data efficiency Xu and Lee (2020). Recent works Hou et al. (2020); Zhang et al. (2021) consider pretraining methods, for example, Contrastive Scene Contents which making use of both point-level correspondences and spatial contexts, with data efficiency in mind.

2.3 Cross-modal learning

Cross-modal learning attempts to take advantage of a data modality that is different from the data modality of the downstream task Dai and Nießner (2018); Liu et al. (2021b). For example, xMUDA Jaritz et al. (2020) utilizes aligned images and point-clouds to transfer 2D feature map information for 3D semantic segmentation through knowledge distillation Tian et al. (2019)

. For cross-modal transfer learning

Hou et al. (2021), Liu et al. Liu et al. (2021a) proposed pixel-to-point knowledge transfer (PPKT) from 2D to 3D which uses aligned RGB and RGB-D images during pretraining. Our work does not rely on joint image-point-cloud pretraining. Instead, we directly transfer an image-pretrained model to point-cloud with the simplest pretraining-finetuning scheme.

Some of the previous works for video and medical images Carreira and Zisserman (2017); Shan et al. (2018) have discussed 2D-3D transfer learning by simply extending a 2D filter along time or depth direction to transfer pretrained 2D filters to 3D models. In this work, we study in depth for point-cloud models that, with a similar approach, we are able to reach a comparable performance with training from scratch, but only finetuning input and output layers. Between language and image modality, transfer learning with minimal finetuning also shows a competitive performance Lu et al. (2021).

3 Converting a 2D ConvNet to a 3D ConvNet

In this paper, we mainly focus on the 3D sparse-convolution based method to process point-clouds. As discussed in 2.1, we consider a set of points where each point is parameterized by its 3D coordinates and optionally additional features such as intensity and RGB. We then voxelize/quantize these points into voxels according to their 3D space coordinates, following Choy et al. (2019). A voxel’s feature is inherited from the point that lies in the voxel. If there are multiple points in a voxel, then we average all points’ feature and assign the mean to the voxel. If there is no point in the voxel, then we simply set the voxel’s feature to 0. When using sparse convolution, we skip the computation on empty voxels.

Given a pretrained 2D ConvNet, we convert it to a 3D ConvNet that takes 3D voxels as input. The key element of this procedure is to convert 2D convolutions to 3D. A 2D convolutional filter can be represented as a 4D tensor of shape

, representing output dimension, input dimension, and two spatial kernel sizes, respectively. A 3D convolutional filter has an extra dimension and its shape is . To better illustrate this point, we ignore the output and input dimensions and only consider a spatial slice of the 2D filter with shape . The simplest way to convert this 2D filter to 3D is to copy the 2D filter and repeat it by times along a third dimension, as illustrated in Fig. 2. This operation is the same as the inflation trick used by Carreira and Zisserman (2017)

to initialize a video model with a pretrained 2D ConvNet. More generally, if we represent the original 2D filter slice as a vector with

elements and the 3D filter slice as a vector with

elements, this inflation can be represented by a linear transformation matrix

with shape . This transformation matrix can represent handcrafted operations such as inflation along different orientations, rotations, and can also be trained. We will discuss how this transformation matrix impact the transfer performance in Section 4.8.

Besides convolution, other operations such as downsampling, batch normalization (BN), nonlinear activation can be easily migrated to 3D. Our 3D model inherits the architecture of the original 2D ConvNet, but we also add an input layer with 3 convolutional layers, and an output layer depending on the target task. For classification, we use a global average pooling followed by two fully connected layers to get the final prediction. For semantic segmentation, the output layer is a U-Net style decoder

Ronneberger et al. (2015). The architecture of the input/output layers are described in more detail in Section 6.4.

Figure 2: The left figure shows the 2D convolution and the converted 3D convolution by directly copying the 2D filter. The right figure shows a general filter transformation, in which 2D filters is linearly transformed into 3D filters.

4 Empirical Evaluation

To study the image to point-cloud transfer, we formulate eight questions and provide experimental results to address these questions. These questions are organized in three parts: (i) Can we use transferred image models to understand point-clouds? (Section 4.1, Section 4.2, Section 4.3) (ii) How do different factors such as point-cloud processing methods, pretrained image datasets, and types of tasks, impact the transferring performance? (Section 4.4, Section 4.5, Section 4.6, Section 4.7) (iii) How can we further improve the transferring performance? (Section 4.8)

Datasets.

We benchmark the transferred models on ModelNet 3D Wharehouse classification Wu et al. (2015), S3DIS indoor segmentation Armeni et al. (2017), and SemanticKITTI outdoor segmentation Behley et al. (2019) tasks. ModelNet 3D Wharehouse is a CAD model classification dataset that consists of point-clouds with 40 categories. CAD models in this benchmark come from 3D Warehouse Sketchup (2021). In this benchmark, we only utilize the x, y, z coordinates as features. S3DIS is an indoor dataset collected from real-world indoor scenes and includes 3D scans of Matterport Scanners from 6 areas. It provides point-wise annotations for indoor objects like chair, table, and bookshelf, etc. SemanticKITTI dataset from KITTI Vision Odometry Geiger et al. (2012) is a driving scene dataset. It provides dense point-wise annotations for the complete 360 degrees field-of-view of the deployed automotive lidar, which is currently one of the most challenging datasets.

We use pretrained ResNet He et al. (2016)

as 2D ConvNets mostly throughout our experiments. Depending on the experiments, ResNets are pretrained on Tiny-ImageNet, ImageNet-1K, ImageNet-21K

Deng et al. (2009)

, Cityscapes

Cordts et al. (2016)

, ADE20K

Zhou et al. (2017), and Fractal database (FractalDB) Kataoka et al. (2020). Our pretrained models are directly downloaded from various sources, with detailed links provided in Section 6.1.

4.1 Can we use transferred image models to understand point-clouds?

To evaluate the feasibility of transferring pretrained 2D image models to 3D point-cloud tasks, we use a ResNet-18 model which is pretrained on ImageNet-1K. We convert 2D ConvNets into 3D ConvNets using the procedure described in Section 3. We hypothesize that, if a pretrained 2D image model is capable of understanding point-clouds directly, we can see an acceptable performance by only finetuning input and output layers of the transferred model. To further our hypothesis, we believe that, as we gradually relaxing the freezing parameters—finetuning BN parameters as well, the transferred model can achieve better performance, even surpassing training-from-scratch performance. Thus, in this first experiment, we propose four different settings, as shown in Table 1: 1) In the first setting, we only finetune the input (I) and output (O) layers; 2) In addition to I, O, we also allow the BN layers to update the mean (

) and variance (

) based on the input data; 3) In the third setting, we include all the previous relaxation of constraints I, O, , and also finetune BN weight (W) and bias (b); 4) finetune the whole network.

Layers to finetune ModelNet(top 1 Acc%) S3DIS(mIoU%) SemanticKITTI(mIoU%)
I, O 78.69 53.54 55.78
I, O, , 83.45 54.86 56.29
I, O, , , W, b 88.09 55.22 58.76
whole network 89.14 56.62 65.59
from scratch 88.61 55.09 64.75
linear I, O 69.74 48.83 54.43
linear network(w/o backbone) 8.91 - -
from scratch (linear I, O) 88.53 54.79 64.01
3DShapeNets Wu et al. (2015) 77.0 - -
DeepPano Shi et al. (2015) 77.63 - -
Xu et al. Xu and Todorovic (2016) 81.26 - -
PointNet Qi et al. (2016) 89.2 42.97 17.4
PointNet++ Qi et al. (2017) 90.7 50.73 22.1
DGCNN Wang et al. (2019b) 90.7 47.94 -
RSNet Huang et al. (2018) - 51.93 -
SPGraph Landrieu and Simonovsky (2018) - 58.04 20.00 (test)
TangentConv Tatarchenko et al. (2018) - 52.6 35.9 (test)
SqueezeSegV3 Wu et al. (2019) - - 57.31
Table 1: Transfer experiments from pretrained ResNet-18 on ImageNet-1K to ModelNet 3D Wharehouse (ModelNet40), S3DIS (area 5), and SemanticKITTI (val set), respectively. We list some well-known point-cloud benchmarks in the third group for easy comparison to prior arts. means our reproduced results based on their official code since they do not provide the results on these evaluation.

We discover that even if we only finetune the input/output layers while keep the image representation untouched, the transferred model achieves highly competitive performance, outperforming a large number of task-specific models on all three benchmarks. Specifically, it outperforms 3D ShapeNet Wu et al. (2015) and DeepPano Shi et al. (2015) by 1.69% and 1.06% which were the state-of-the-arts in year 2015, respectively, in top 1 accuracy on the ModelNet 3D Wharehouse dataset. For the more challenging segmentation task on the S3DIS and SemanticKITTI dataset, the minimal transferred model outperforms a wide range of state-of-the-arts Qi et al. (2016, 2017); Wang et al. (2019b); Landrieu and Simonovsky (2018); Tatarchenko et al. (2018). It even approaches the performance of SqueezeSegV3 Xu et al. (2020) (only 1.53 points mIoU lower), a task-specific network customized for driving scene segmentation.

We then finetune the BN weight and bias but do not change weights of the pretrained model, the performance can be further improved. Specifically, it is largely improved by 9.4% accurracy on ModelNet 3D Wharehouse. Compared with training from scratch, the performance does not drop by a big margin: 0.52% accuracy and 5.99% mIoU on the ModelNet and SemanticKITTI.

To quantify the representation ability of the transformed 3D ConvNets for point-cloud, we refer to previous works for the method of finetuning only the simplest linear input layer and a linear classifier

He et al. (2020); Chen et al. (2020a, c). Thus, we design the simplest linear input and output layers for classification and segmentation task, as shown in Section 6.4. We compare the results Table 1 with two baseline settings: 1) stacking the input and output layer to form a network (linear network (w/o backbone)). We increase the output channel of the input layer equal to the input channel of the output layer; 2) training this new architecture from scratch (from scratch (linear I, O)).

We observe that compared with the linear network formed by linear input and output layer, the introduced 3D ConvNet transformed from the pretrained 2D ConvNet provides effective representations that improves performance from 8.91% to 69.74%. Even compared with previous works Qi et al. (2017); Wang et al. (2019b); Landrieu and Simonovsky (2018), the simplest transferred model still outperforms them on the S3DIS and SemanticKITTI.

With finetuning the whole network, the transformed ConvNets steadily outperform training from scratch, over 0.53% accuracy, 1.53% mIoU, and 0.84% mIoU on the ModelNet, S3DIS, and SemanticKITTI dataset, respectively. However, improvements are not a large margin, and it is a limitation which we will explore in the future.

Therefore, we conclude that pretrained 2D image models can provide a good representation for 3D point-clouds, and more importantly, it can indeed transfer to 3D point-cloud tasks.

4.2 Can image pretraining help projection-based and point-based models?

The above experiments are conducted on the 3D sparse convolution-based model. We are driven to explore if image pretraining also help projection-based and point-based models, as they are widely used for point-cloud understanding.

We choose the typical point-based model, PointNet++ Qi et al. (2017), and train it on the Tiny-ImageNet dataset. Training detail is in Section 6.1. As shown in Table 2, after pretraining on Tiny-ImageNet, we finetune it on ModelNet 3D Wharehouse. We observe that, if only finetuning the input/output layers and adapting BN mean and variance, the transferred model achieves 82.34% top 1 accuracy on ModelNet 3D Wharehouse. When finetuning the whole network, the performance is slightly better than training from scratch.

For the projection-based method, we choose HRNet Sun et al. (2019) as the starting point. We use a HRNetV2-W48 pretrained on Cityscapes, then finetune it on SemanticKITTI. As shown in Table 2, finetuning input/output layers and updating mean and variance yields an accuracy of 36.04%. Further finetuning the entire network improve the accuracy to 49.37 %, > 5% mIoU higher than training from scratch.

Therefore, the experiment demonstrates that image pretraining is also effective for point-based model and projection-based model.

Layers to finetune Tiny-ImageNet ModelNet Cityscapes SemanticKITTI
PointNet++ (Cls top 1 Acc%) HRNetV2-W48 (Seg mIoU%)
I, O, , 82.34 36.04
whole network 90.72 49.37
from scratch 90.67 44.12
Table 2: The performance of pretrained PointNet++ on Tiny-ImageNet and HRNetV2-W48 on Cityscapes transferring to ModelNet 3D Wharehouse and SemanticKITTI for classification and segmentation.

4.3 Can transferred model improve data efficiency?

1% train data 5% train data 10% train data 100% train data
w/ pretraining 24.92 71.11 76.82 88.90
w/o pretraining 13.66 68.56 74.31 88.53
Table 3: The performance of ResNet-50 training on ModelNet 3D Wharehouse with 1%, 5%, 10% randomly selected data.

Obtaining and annotating point-cloud data is much more difficult and expensive than images. Thus, we want to explore if transferring image models to 3D can help improve data efficiency.

To investigate this question, we transfer a ResNet-50 pretrained on ImageNet-1K to 3D and finetune the entire model on the ModelNet 3D Wharehouse using randomly sampled 1%, 5%, 10% training data. We observe that image-pretrained models steadily outperform models trained from scratch. Especially for the most difficult setting with 1% training data, the performance is improved by 11.26% accuracy. As for the 5% and 10% training data, the transformed networks outperform 2.55% and 2.51% top 1 accuracy, respectively. This result shows that models pretrained on images can significantly improve data efficiency, especially on the extremely low data regime. However, we observe that the benefit becomes marginal as we have more training data.

4.4 How does the scale of image dataset affect performance?

We next explore how the performance of transferring changes as the scale of image dataset changes. Here, we control for the size of image dataset: we use ResNet-50 pretrained on Tiny-ImageNet, ImageNet-1K, and ImageNet-21K. In particular, Tiny-ImageNet has 100,000 images across 200 classes, ImageNet-1K was created by selecting a subset of 1.2M images from the full ImageNet dataset, and ImageNet-21K is the full ImageNet dataset with 14,197,122 images and 21,841 labels. We here utilize ResNet-50 as our backbone since rich pretrained models are based on it. The more details are in the Section 6.1.

As the scale of pretrained dataset increasing from 100,000 to 1,200,00, the performance improves significantly as shown in Table 4. In particular, with only finetuning the input (I) and output (O) layers, pretraining on ImageNet-1K outperforms pretraining on Tiny-ImageNet by 15.07%. When dataset size increases from 1.2M to 14M, we discover a slight drop when we only train the input and output layers. If we train the whole normalization layer, the performance increases compare to pretraining on a smaller size. The hypothesis is that, as the scale of the data increases, the distribution variance increases, making the gap between point-clouds and images larger. Thus, effective representations learned from images decline. However, as we finetuning normalization layers, the distribution variances are mitigated Wang et al. (2021).

With full network finetuning, the final accuracy is around 0.5% higher than training from scratch. Scaling up the size of pretraining dataset does not bring notable improvement. This result is contradicting many observations in image-to-image transfer tasks. Therefore, we conclude that the performance of the transferred model cannot fully scale up as the scale of image dataset increases.

4.5 How does the type of image dataset affect performance?

We then conduct experiments on FractalDB dataset Kataoka et al. (2020) to explore how the type of image dataset affects the transferring performance. In particular, FractalDB dataset is a large-scale dataset with 1M non-natural images and 1K/10K categories (FractalDB1K/10K) generated by computer.

We surprisingly discover that such a non-natural image pretraining can significantly help point-cloud classification. With only finetuning the input and output layers, the performance is even better than the transferred model pretrained on ImageNet. It achieves at least 2.07% accuracy improvement when using FractalDB1K for pretraining compared with pretraining on ImageNet. We also discover that updating the mean and variance harms the transfer performance significantly by up to 13 % accruacy. This is surprising since in all other experiments, updating BN mean and variance almost always improves the performance. When finetuning more parameters (I, O, , , W, b), the performance can be improved by 15.64% and 10.37% top 1 accuracy for the models pretrained on FractalDB1K and FractalDB10K, respectively.

Therefore, we conclude that different types of image data do make a difference to the transfer performance and surprisingly, non-natural images like FractalDB are more effective than natural images when finetuning only input and output layers.

Layers to finetune Tiny-ImageNet ImageNet-1K ImageNet-21K FractalDB1k FractalDB10k
ModelNet (top 1 Acc.%) ModelNet (top 1 Acc.%)
I, O 68.40 83.47 81.81 85.54 83.55
I, O, , 81.69 83.28 82.10 72.53 77.96
I, O, , , W, b 88.70 88.98 89.18 88.17 88.33
whole network 89.06 88.90 89.19 88.45 88.49
from scratch 88.53 88.53
Table 4: The transferring classification performance of ResNet-50 from Tiny-ImageNet, ImageNet-1K, ImageNet-21K, Fractal1K and Fractal10K to ModelNet 3D Wharehouse.

4.6 How does the image task affect performance?

The above pretraining is mainly from the classification task. To explore the effect of pretraining on different tasks, we pretrain ResNet-18 on the segmentation task of Cityscapes Cordts et al. (2016) and ADE20K Zhou et al. (2017) dataset. For a fair comparison with pretraining on ImageNet-1K, we only load the encoder part.

The experiment results in Table 5 shows that pretraining on Cityscapes significantly improves the performance over pretraining on ImageNet-1K when transferring to SemanticKITTI. If only finetuning the input and output layers, there is a 2.70% mIoU improvement. The result shows representations learned from Cityscapes is more suitable than those learned from ImageNet-1K for SemanticKITTI segmentation. For pretraining from ADE20K and pretraining from ImageNet-1K to ModelNet 3D Wharehouse, the transferred model exhibits comparable results. One possible explanation is that the S3DIS dataset is closer to the image dataset since it consists of RGB features.

Therefore, we conclude that pretraining image task can affect the performance. As the task becomes closer to the target task, the performance improves.

Layers to finetune ImageNet-1K Cityscapes ImageNet-1K ADE20K
SemanticKITTI (mIoU%) S3DIS (mIoU%)
I, O 55.78 58.48 53.54 55.28
I, O, , 56.29 56.71 54.86 54.95
I, O, , , W, b 58.76 59.54 55.22 55.28
whole network 65.59 65.55 56.62 56.11
from scratch 64.75 55.09
Table 5: The performance of ResNet-18 pretrained from image semantic segmentation transferring to point-cloud segmentation. We compare the pretrain on Cityscapes and ADE20K with pretrain on ImageNet-1K for transferring to SemanticKITTI and S3DIS respectively.

4.7 How do better image models affect performance?

We then explore whether different 2D models on image tasks affect the performance of transferring to point-cloud tasks. Specifically, we conduct experiments on ModelNet 3D Wharehouse, on top of pretrained ResNet-18, ResNet-50, and ResNet-152 on ImageNet-1K, as shown in Table 6.

Layers to finetune ResNet-18 ResNet-50 ResNet-152
I, O 78.69 83.47 69.33
I, O, , 83.45 83.28 81.68
I, O, , , W, b 88.09 88.98 88.66
whole network 89.14 88.90 88.45
from scratch 88.61 88.53 88.74
Table 6: The performance of pretrained models on ImageNet-1K transfer to ModelNet 3D Wharehouse.

We observe that, for the setting of only finetuning the input and output layers, the performance on ModelNet 3D Wharehouse can be improved by 4.78% as the backbone is scaled from ResNet-18 into ResNet-50. However, we observe that using ResNet-152 lead to a noticeable performance drop by 14.14%. Our hypothesis is that the error brought by the data distribution gap between image and point-cloud gradually accumulates as the number of layers increases. Therefore, only training the input and output layers can hardly transfer the good representation from image to point-cloud. This issue can be alleviated by updating the BN mean and variance, which helps to shift the distribution from different domain Wang et al. (2021). Thus, we can observe the performance is largely improved by 12.35% when updating the mean and variance. Furthermore, when further training the weight and bias in the normalization layer, the distribution shift can be further mitigated.

Meanwhile, as finetuning the whole network, the performance is not improved as the scale of models increases, which is also a limitation. We conclude that better image models cannot entirely improve performance.

4.8 How to design the filter transformation?

Figure 3: Visual examples for handcraft designed filter transformation.

In the above experiments, we use filter inflation, i.e., directly copy the 2D filter times into 3D, which can actually be formed as a filter transformation, as shown in the top left of Fig. 3. This motivates us to think about how to design a better filter transformation.

We first design three handcrafted filter transformation T1, T2, T3, which copies 2D weights along different orientations, as shown in Fig. 3. Then we try to directly learn the filter transformation T instead of handcrafted design. We consider two settings: 1) all 2D filters share the same learnable filter transformation, 2) each 2D filter has its own learnable filter transformation. To learn such filter transformation, we fix the weight of the 2D filters, only train T, input, output layers, and update the mean and variance of BN. For the initialization of T, we consider initializing it by our default T and random initialization, respectively.

In Table 7, we observe that the performance is relatively stable for different handcraft designed filter inflation T. However, the performance is significantly improved by at least 4.99% top 1 accuracy, when using learnable transformation in a non-shared learning scheme initialized by the default T. For random initialization, the performance is also improved compared with the handcrafted design, although it is slightly worse than initialization by the default T.

Therefore, we conclude that learnable filter transformation does perform better than handcrafted design. For learnable filters, however, initializing from handcrafted transformation outperforms random initialization.

Layers to finetune T1 T2 T3 ST NST ST(RI) NST(RI)
I, O, , 82.21 81.81 82.25 84.36 87.24 84.40 86.35
Table 7: The performance of different filter inflation based on ResNet-18 on the ModelNet 3D Wharehouse dataset. T1, T2 and T3 are handcraft designed filter inflation. The ST (Shared T) denotes we learn a filter transformation for all convolutions, and the NST (Non-shared T) denotes we learn a filter transformation for each convolution. RI means that we use random initialization.

5 Discussion, Limitation, and Conclusion

In this paper, we explore the feasibility of transferring the pretrained 2D ConvNets for performing 3D point-cloud tasks. Our experimental results in Section 4.1 show that visual representations pretrained on image datasets can indeed transfer to point-cloud. This is a surprising result, given that 2D images and 3D point-clouds represent visual information in highly different manners. Our hypothesis is: when we transfer 2D ConvNet pretrained om images to 3D, we essentially apply a linear projection on the 2D convolutional filter; when processing 3D voxel input, this is equivalent to projecting 3D input to 2D first, and then process the input with trained 2D filters. This transferability reveals that 2D image features and 3D point-cloud features are closely correlated. In our future work, we plan to explore the reasons that we can transfer image models to point-cloud understanding.

We also explore factors that impact the transferring performance. The transfer performance is surprisingly good if we only finetune minimal parts of the model, but if we finetune the entire model and compare with training the model from scratch, we do not observe significant benefits. We seek to improve this by checking the impact of image dataset scale, type, pretraining tasks, and scale of the pretrained image model. However, after the exploration, we still only observe limited improvements. Given the fast progress of image representation learning, the abundance of image data, and the difficulty of obtaining point-cloud data, we believe there is huge potential for further exploring how to transfer image models to point-cloud understanding for a better performance and data efficiency. This will also be a focus of our future work.

Although the performance is limited, our results still show a very promising direction—design better filter transformation, use more suitable datasets and models. Compared with previous works that seek improvements from the aspect of designing architectures and pretraining only on the point-cloud modality, our work is not limited by the immature representation learning scheme, small scale point-cloud dataset, and expensive pretraining cost. We believe that image pretraining is one of the solutions to the bottleneck of point-cloud understanding and do hope this direction can inspire the research community in the future.

References

  • [1] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese (2017)

    Joint 2d-3d-semantic data for indoor scene understanding

    .
    arXiv preprint arXiv:1702.01105. Cited by: §1, §4.
  • [2] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §2.2.
  • [3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall (2019) SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proc. of the IEEE/CVF International Conf. on Computer Vision (ICCV), Cited by: §1, §4.
  • [4] A. Boulch, B. Le Saux, and N. Audebert (2017) Unstructured point cloud semantic labeling using deep segmentation networks.. 3DOR 2, pp. 7. Cited by: §2.1.
  • [5] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) Nuscenes: a multimodal dataset for autonomous driving. In

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    ,
    pp. 11621–11631. Cited by: §1.
  • [6] M. Caron, P. Bojanowski, J. Mairal, and A. Joulin (2019) Unsupervised pre-training of image features on non-curated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2959–2968. Cited by: §2.2.
  • [7] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. Cited by: §2.2.
  • [8] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin (2021) Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294. Cited by: §1.
  • [9] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2.3, §3.
  • [10] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    ,
    pp. 1597–1607. Cited by: §1, §2.2, §4.1.
  • [11] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton (2020) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §2.2.
  • [12] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §2.2, §4.1.
  • [13] C. Choy, J. Gwak, and S. Savarese (2019)

    4D spatio-temporal convnets: minkowski convolutional neural networks

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3075–3084. Cited by: §2.1, §3.
  • [14] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §4, §4.6.
  • [15] A. Dai and M. Nießner (2018) 3dmv: joint 3d-multi-view prediction for 3d semantic scene segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 452–468. Cited by: §2.3.
  • [16] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.
  • [17] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.2.
  • [18] D. Feng, Y. Zhou, C. Xu, M. Tomizuka, and W. Zhan (2021) A simple and efficient multi-task network for 3d object detection and road understanding. External Links: 2103.04056 Cited by: §2.1.
  • [19] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: §4.
  • [20] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.2.
  • [21] P. Goyal, M. Caron, B. Lefaudeux, M. Xu, P. Wang, V. Pai, M. Singh, V. Liptchinsky, I. Misra, A. Joulin, et al. (2021) Self-supervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988. Cited by: §2.2.
  • [22] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2.2, §4.1.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.
  • [24] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §2.2.
  • [25] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018)

    Learning deep representations by mutual information estimation and maximization

    .
    arXiv preprint arXiv:1808.06670. Cited by: §2.2.
  • [26] J. Hou, B. Graham, M. Nießner, and S. Xie (2020) Exploring data-efficient 3d scene understanding with contrastive scene contexts. arXiv preprint arXiv:2012.09165. Cited by: §2.2.
  • [27] J. Hou, S. Xie, B. Graham, A. Dai, and M. Nießner (2021) Pri3D: can 3d priors help 2d representation learning?. arXiv preprint arXiv:2104.11225. Cited by: §2.3.
  • [28] B. Hua, M. Tran, and S. Yeung (2018)

    Pointwise convolutional neural networks

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–993. Cited by: §2.1.
  • [29] Q. Huang, W. Wang, and U. Neumann (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: Table 1.
  • [30] M. Jaritz, T. Vu, R. d. Charette, E. Wirbel, and P. Pérez (2020) Xmuda: cross-modal unsupervised domain adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12605–12614. Cited by: §2.3.
  • [31] L. Jing and Y. Tian (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.2.
  • [32] H. Kataoka, K. Okayasu, A. Matsumoto, E. Yamagata, R. Yamada, N. Inoue, A. Nakamura, and Y. Satoh (2020) Pre-training without natural images. In Proceedings of the Asian Conference on Computer Vision, Cited by: §2.2, §4, §4.5, §6.1.
  • [33] A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby (2019) Large scale learning of general visual representations for transfer. arXiv preprint arXiv:1912.11370. Cited by: §1.
  • [34] A. Komarichev, Z. Zhong, and J. Hua (2019) A-cnn: annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7421–7430. Cited by: §2.1.
  • [35] L. Landrieu and M. Simonovsky (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4558–4567. Cited by: §4.1, §4.1, Table 1.
  • [36] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §2.1.
  • [37] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg (2017) Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pp. 95–107. Cited by: §2.1.
  • [38] J. Li, B. M. Chen, and G. H. Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §2.1.
  • [39] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) PointCNN: convolution on -transformed points. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 828–838. Cited by: §2.1.
  • [40] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky (2015) Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 806–814. Cited by: §2.1.
  • [41] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang (2019) Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2604–2613. Cited by: §6.1.
  • [42] Y. Liu, B. Fan, G. Meng, J. Lu, S. Xiang, and C. Pan (2019) Densepoint: learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5239–5248. Cited by: §2.1.
  • [43] Y. Liu, Y. Huang, H. Chiang, H. Su, Z. Liu, C. Chen, C. Tseng, and W. H. Hsu (2021) Learning from 2d: pixel-to-point knowledge transfer for 3d pretraining. arXiv preprint arXiv:2104.04687. Cited by: §2.3.
  • [44] Z. Liu, H. Hu, Y. Cao, Z. Zhang, and X. Tong (2020) A closer look at local aggregation operators in point cloud analysis. In European Conference on Computer Vision, pp. 326–342. Cited by: §2.1.
  • [45] Z. Liu, X. Qi, and C. Fu (2021) 3D-to-2d distillation for indoor scene parsing. arXiv preprint arXiv:2104.02243. Cited by: §2.3.
  • [46] K. Lu, A. Grover, P. Abbeel, and I. Mordatch (2021) Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247. Cited by: §2.3.
  • [47] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss (2019) Rangenet++: fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. Cited by: §2.1.
  • [48] F. Pomerleau, F. Colas, and R. Siegwart (2015) A review of point cloud registration algorithms for mobile robotics. Foundations and Trends in Robotics 4 (1), pp. 1–104. Cited by: §1.
  • [49] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §2.1, §4.1, §4.1, §4.2, Table 1, §6.1.
  • [50] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016)

    PointNet: deep learning on point sets for 3d classification and segmentation

    .
    Note: cite arxiv:1612.00593 External Links: Link Cited by: §2.1, §4.1, Table 1.
  • [51] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor (2021) ImageNet-21k pretraining for the masses. External Links: 2104.10972 Cited by: §6.1.
  • [52] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.
  • [53] H. Shan, Y. Zhang, Q. Yang, U. Kruger, M. K. Kalra, L. Sun, W. Cong, and G. Wang (2018) 3-d convolutional encoder-decoder network for low-dose ct via transfer learning from a 2-d trained network. IEEE transactions on medical imaging 37 (6), pp. 1522–1534. Cited by: §2.3.
  • [54] B. Shi, S. Bai, Z. Zhou, and X. Bai (2015) DeepPano: deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters 22 (12), pp. 2339–2343. External Links: Document Cited by: §1, §4.1, Table 1.
  • [55] Sketchup (2021) 3D modeling online free|3d warehouse models.. . Note: https://3dwarehouse.sketchup.com Cited by: §1, §4.
  • [56] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §2.1.
  • [57] K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    .
    In CVPR, Cited by: §4.2, §6.1.
  • [58] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han (2020) Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, Cited by: §2.1.
  • [59] M. Tatarchenko, J. Park, V. Koltun, and Q. Zhou (2018) Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3887–3896. Cited by: §4.1, Table 1.
  • [60] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive representation distillation. arXiv preprint arXiv:1910.10699. Cited by: §2.3.
  • [61] B. Wang, V. Wu, B. Wu, and K. Keutzer (2019) LATTE: accelerating lidar point cloud annotation via sensor fusion, one-click annotation, and tracking. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 265–272. Cited by: §1.
  • [62] D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell (2021) Tent: fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, External Links: Link Cited by: §4.4, §4.7.
  • [63] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong (2017) O-cnn: octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–11. Cited by: §2.1.
  • [64] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §4.1, §4.1, Table 1.
  • [65] Z. Wang, W. Zhan, and M. Tomizuka (2018) Fusing bird’s eye view lidar point cloud and front view camera image for 3d object detection. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1–6. Cited by: §2.1.
  • [66] B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In ICRA, Cited by: §2.1.
  • [67] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer (2019) SqueezeSegV2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, Cited by: §2.1, Table 1.
  • [68] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015-06) 3D shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4, §4.1, Table 1.
  • [69] S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany (2020) PointContrast: unsupervised pre-training for 3d point cloud understanding. In European Conference on Computer Vision, pp. 574–591. Cited by: §2.2.
  • [70] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka (2020) Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation. In European Conference on Computer Vision, pp. 1–19. Cited by: §2.1, §4.1, §6.1.
  • [71] C. Xu, B. Zhai, B. Wu, T. Li, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka (2021) You only group once: efficient point-cloud processing with token representation and relation inference module. External Links: 2103.09975 Cited by: §1, §2.1.
  • [72] X. Xu and S. Todorovic (2016) Beam search for learning a deep convolutional neural network of 3d shapes. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3506–3511. Cited by: Table 1.
  • [73] X. Xu and G. H. Lee (2020) Weakly supervised semantic point cloud segmentation: towards 10x fewer labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13706–13715. Cited by: §2.2.
  • [74] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §2.1.
  • [75] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §2.1.
  • [76] X. Yue, B. Wu, S. A. Seshia, K. Keutzer, and A. L. Sangiovanni-Vincentelli (2018) A lidar point cloud generator: from a virtual world to autonomous driving. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 458–464. Cited by: §1.
  • [77] Z. Zhang, R. Girdhar, A. Joulin, and I. Misra (2021) Self-supervised pretraining of 3d features on any point-cloud. arXiv preprint arXiv:2101.02691. Cited by: §2.2.
  • [78] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641. Cited by: §4, §4.6, §6.1.
  • [79] H. Zhou, X. Zhu, X. Song, Y. Ma, Z. Wang, H. Li, and D. Lin (2020) Cylinder3d: an effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550. Cited by: §2.1.

6 Appendix

6.1 Implementation detail.

Our experiments are mainly conducted on ModelNet 3D Wharehouse, S3DIS, and SemanticKITTI dataset. Specifically, as for the ModelNet 3D Wharehouse dataset, all models that we use are trained on the train set of ModelNet 3D Wharehouse and evaluated on the validation set. As for the S3DIS, all models that we use are trained on area 1, 2, 3, 4, 6 and evaluated on area 5. As for the SemanticKITTI dataset, all models are trained on splits 00-10 expect 08 which is used for evaluation. For each of these datasets, all ResNet-series models are trained based on the same training scheme, and all experiments are implemented with Pytorch.

For the ModelNet 3D Wharehouse dataset, during training, coordinates of point-cloud are randomly scaled, translated, and jittered. We use SGD optimizer with momentum 0.9, weight-decay

, and initial learning rate 0.1 with cosine learning rate scheduler. Each mini batch is set to 32, and models are trained for 300 epochs. For both training and inference phase, we only utilize x, y, z coordinates without other features and set voxel size as 0.05. The experiments for ModelNet 3D Wharehouse are all conducted on a Titan RTX GPU.

For the S3DIS dataset, during training, we concatenate all subparts of an indoor scene to train and validate on. Along x, y directions, scenes are randomly applied horizontal flip. RGB features are randomly jittered, translated, and auto contrasted. Finally, we normalize and clip point-clouds. We set voxel size as 0.05. We use SGD optimizer with momentum 0.9, weight-decay , and initial learning rate 0.1 with polynomial learning rate scheduler. Each mini batch is set to 3, and models are trained for 400 epochs. The experiments for S3DIS are all conducted on 2 Titan RTX GPUs.

For the SemanticKITTI dataset, during training, coordinates of each point-cloud are randomly scaled and rotated. We use SGD optimizer with momentum 0.9, weight-decay , and initial learning rate 0.24 with cosine warmup learning rate scheduler. Each mini batch is set to 2, and models are trained for 15 epochs. For both training and inference phase, we utilize x, y, z coordinates as well as intensity feature and set voxel size as 0.05. The experiments for SemanticKTTI are all conducted on 4 Titan RTX GPUs.

Detail of Section 4.1. All the experiments in this section are conducted based on pretrained ResNet18 on ImageNet1K from Pytorch. The results in the first group of Table. 1 are based on our default ResNet, as shown in the Section 6.4 listing 3 and 4. The results in the second group of Table. 1 are based on ResNet with linear input and output layer, as shown in the Section 6.4 listing 5 and 6. In detail, the pseudo code of linear network (w/o backbone) is shown below.

1Class linear_net(nn.Module):
2    def __init__(self):
3        super().__init__()
4        self.input_layer = nn.Sequential(
5            sparse_conv3d(input_dim, layer4_Odim, k=3, s=1),
6            sparse_bn(layer1_Idim))
7        self.output_layer = nn.Sequential(
8            global_average_pooling,
9            nn.Linear(layer4_Odim, class_num),
10            nn.bn(class_num))
11    def forward(self, x):
12        x = self.input_layer(x)
13        return self.output_layer(x)
Listing 1: Pseudo code of linear network without backbone

Detail of Section 4.2. For the pretraining of PointNet++ on Tiny-ImageNet dataset, we treat each pixel of a image as a point. We use the original pixel location as x and y coordinates of the corresponding point and append all points with z equal to 1. Then, we use the PointNet-SSG version, for each set abstraction module Qi et al. (2017), points are sampled into . We choose query ball grouping mechanism and set radius as 3. Then, we use SGD optimizer with momentum 0.9, weight decay , and initial learning rate 0.1 with decay rate 0.1 per 30 epochs. We set batch size to 64 and train the PointNet++ on two Titan RTX GPUS for 90 epochs.

As for the training phase on the ModelNet 3D Wharehouse dataset, we use Adam optimizer with 0.001 learning rate for both training from scratch and finetuning schemes. Batch size is set to 32. We train the model for 200 epochs on one Titan RTX GPU.

We directly use the pretrained HRNet from the official code Sun et al. (2019). Specifcally, we choose to use the official code of SqueezeSegV3 Xu et al. (2020) as our base code. We substitute SqueezeSegV3 backbone with HRNetV2-W48, and all the other training schemes are totally kept the same.

Detail of Section 4.3. The pretrained ResNet50 on ImageNet1K also comes from Pytorch. Both the training from scratch and the finetuning models use the same selected training set and are evaluated on the whole validation set of ModelNet 3D Wharehouse.

Detail of Section 4.4, Section 4.5, Section 4.6, Section 4.7. The pretrained ResNet50 on Tiny-ImageNet is trained by ourselves since there is rarely such pretrained model. The training scheme is directly from Pytorch provided code 888https://github.com/pytorch/examples/tree/cbb760d5e50a03df667cdc32a61f75ac28e11cbf/imagenet. The pretrained ResNet50 on ImageNet1K is from Pytorch, and the pretrained ResNet50 on ImageNet21K is from recent work Ridnik et al. (2021). The pretrained ResNet50 on both FractalDB1K and FracatalDB10K are from the official code of Kataoka et al. (2020). The pretrained ResNet18 on Cityscapes is from Liu et al. (2019a), and the pretrained ResNet18 on ADE20K is from Zhou et al. (2017). The pretrained ResNet18, ResNet50, ResNet152 are from Pytorch pretrained models. All the finetuning schemes in these experiments are as same as above illustrated ModelNet 3D Wharehouse training scheme without any changes.

Detail of Section 4.8. The implementation pseudo code of a 3D sparse convolution with learnable T is shown below.

1import torchsparse.nn.functional as spf
2import torch.nn as nn
3Class conv3dwithT(nn.Module):
4    def __init__(self):
5        super().__init__()
6        self.filter_trans = nn.Parameter(torch.zeros(K^2, K^3))
7        init_(self.filter_trans)
8
9    def forward(self, x, pre_conv2d):
10        #input x is the feature, and the pre_conv2d is the pretrained 2D weight
11        Cout, Cin, K, K = pre_conv2d.shape
12        conv3d_kernel = torch.matmul(pre_conv2d.view(Cout, Cin, -1), self.filter_trans)
13        return spf.sparse_conv3D(x, conv3d_kernel)
Listing 2: Pseudo code of 3D sparse convolution with learnable T

For shared filter transformation, all pretrained 2D convolution weights are multiplied by the same learnable filter transformation. For non-shared filter transformation, each pretrained 2D convolution weight is multiplied by its own filter transformation.

6.2 Detail illustration of our hypothesis Section 5.

In our discussion section, we provide a hypothesis about why pretrained 2D ConvNets can be used in 3D point-cloud. A further illustration is shown in Fig. 4.

Figure 4: Visualization of our hypothesis. We hypothesize that directly performing 3D convolution on local voxels is similar to first projecting local voxels to 2D, then applying 2D convolution on projected voxels.

Although the above visual example applies to our default T, which means directly copy 2D filters, it can actually be more general as T can be any transformation matrix.

6.3 Visualization examples of learned filter transformation.

In Section 4.8, we discuss that, besides handcrafted filter transformations, there are also learned filter transformations. When all convolution layers share one common T, performance increases compared to handcrafted T, and non-shared filter transformations are more so. Here, we provide 2 visualizations showing shared and non-shared T, initialized using default T and randomly.

Figure 5: Visual examples for filter transformation, initialized by default T. The first sub-figure is initialization T for both the shared and non-shared filter transformation. The second sub-figure is shared T. The rest of sub-figures belong to each convolution layer when filter transformations are non-shared.
Figure 6: Visual examples for filter transformation, initialized by random T. The first sub-figure is initialization T for both the shared and non-shared filter transformation. The second sub-figure is shared T. The rest of sub-figures belongs to each convolution layer when filter transformations are non-shared.

6.4 Details of used Architectures.

1Class Default_3DRes_cls(nn.Module):
2    def __init__(self, res_block):
3        super().__init__()
4        # res_block means the residual block as same as the conventional ResNet.
5        self.input_layer = nn.Sequential(
6            sparse_conv3d(input_dim, layer1_Idim, k=3, s=1),
7            sparse_bn(layer1_Idim),
8            sparse_ReLU(True),
9            sparse_conv3d(layer1_Idim, layer1_Idim, k=3, s=1),
10            sparse_bn(layer1_Idim),
11            sparse_ReLU(True),
12            sparse_conv3d(layer1_Idim, layer1_Idim, k=3, s=2),
13            sparse_bn(layer1_Idim),
14            sparse_ReLU(True))
15
16        self.layer1 = inflated_resnet_layer1(res_block, layer1_Idim, layer1_Odim)
17        self.layer2 = inflated_resnet_layer2(res_block, layer2_Idim, layer2_Odim)
18        self.layer3 = inflated_resnet_layer3(res_block, layer3_Idim, layer3_Odim)
19        self.layer4 = inflated_resnet_layer4(res_block, layer4_Idim, layer4_Odim)
20
21        self.output_layer = nn.Sequential(
22            global_average_pooling,
23            nn.Linear(layer4_Odim, 1024),
24            nn.bn(1024),
25            nn.ReLU(True),
26            nn.Linear(1024, class_num),)
27
28    def forward(self, x):
29        x = self.input_layer(x)
30        x = self.layer1(x)
31        x = self.layer2(x)
32        x = self.layer3(x)
33        x = self.layer4(x)
34        return self.output_layer(x)
Listing 3: Pseudo code of inflated ResNet for classification
1Class Default_3DRes_seg(nn.Module):
2    def __init__(self, res_block):
3        super().__init__()
4        # res_block means the residual block as same as the conventional ResNet.
5        self.input_layer = nn.Sequential(
6            sparse_conv3d(input_dim, layer1_Idim, k=3, s=1),
7            sparse_bn(layer1_Idim),
8            sparse_ReLU(True),
9            sparse_conv3d(layer1_Idim, layer1_Idim, k=3, s=1),
10            sparse_bn(layer1_Idim),
11            sparse_ReLU(True),
12            sparse_conv3d(layer1_Idim, layer1_Idim, k=3, s=2),
13            sparse_bn(layer1_Idim),
14            sparse_ReLU(True))
15
16        self.layer1 = inflated_resnet_layer1(res_block, layer1_Idim, layer1_Odim)
17        self.layer2 = inflated_resnet_layer2(res_block, layer2_Idim, layer2_Odim)
18        self.layer3 = inflated_resnet_layer3(res_block, layer3_Idim, layer3_Odim)
19        self.layer4 = inflated_resnet_layer4(res_block, layer4_Idim, layer4_Odim)
20
21        self.up1 = sparse_deconv(layer4_Odim, layer4_Odim, k=2, s=2),
22        self.decode1 = self.Sequential(
23            res_block(layer4_Odim+layer3_Odim, layer3_Odim),
24            res_block(layer3_Odim, layer3_Odim))
25
26        self.up2 = sparse_deconv(layer3_Odim, layer3_Odim, k=2, s=2)
27        self.decode2 = self.Sequential(
28            res_block(layer3_Odim+layer2_Odim, layer2_Odim),
29            res_block(layer2_Odim, layer2_Odim))
30
31        self.up3 = sparse_deconv(layer2_Odim, layer2_Odim, k=2, s=2)
32        self.decode3 = self.Sequential(
33            res_block(layer2_Odim+layer1_Odim, layer1_Odim),
34            res_block(layer1_Odim, layer1_Odim))
35
36        self.up4 = sparse_deconv(layer1_Odim, layer1_Odim, k=2, s=2)
37        self.decode4 = self.Sequential(
38            res_block(layer1_Odim+layer1_Odim, layer1_Odim),
39            res_block(layer1_Odim, layer1_Odim))
40
41        self.output_layer = nn.Sequential(
42            nn.Linear(layer1_Odim, class_num))
43
44    def forward(self, x):
45        x_i = self.input_layer(x)
46        x1 = self.layer1(x_i)
47        x2 = self.layer2(x1)
48        x3 = self.layer3(x2)
49        x4 = self.layer4(x3)
50
51        x3_ = self.decoder1(cat(x3, self.up1(x4)))
52        x2_ = self.decoder2(cat(x2, self.up2(x3_)))
53        x1_ = self.decoder3(cat(x1, self.up3(x2_)))
54        xi_ = self.decoder4(cat(x_i, self.up4(x1_)))
55        return self.output_layer(xi_)
Listing 4: Pseudo code of inflated ResNet for segmentation
1Class LinearIO_3DRes_cls(nn.Module):
2    def __init__(self, res_block):
3        super().__init__()
4        # res_block means the residual block as same as the conventional ResNet.
5        self.input_layer = nn.Sequential(
6            sparse_conv3d(input_dim, layer1_Idim, k=3, s=1),
7            sparse_bn(layer1_Idim))
8
9        self.layer1 = inflated_resnet_layer1(res_block, layer1_Idim, layer1_Odim)
10        self.layer2 = inflated_resnet_layer2(res_block, layer2_Idim, layer2_Odim)
11        self.layer3 = inflated_resnet_layer3(res_block, layer3_Idim, layer3_Odim)
12        self.layer4 = inflated_resnet_layer4(res_block, layer4_Idim, layer4_Odim)
13
14        self.output_layer = nn.Sequential(
15            global_average_pooling,
16            nn.Linear(layer4_Odim, class_num),
17            nn.bn(class_num))
18
19    def forward(self, x):
20        x = self.input_layer(x)
21        x = self.layer1(x)
22        x = self.layer2(x)
23        x = self.layer3(x)
24        x = self.layer4(x)
25        return self.output_layer(x)
Listing 5: Pseudo code of inflated ResNet with linear input and output for classification
1Class LinearIO_3DRes_seg(nn.Module):
2    def __init__(self, res_block):
3        super().__init__()
4        # res_block means the residual block as same as the conventional ResNet.
5        self.input_layer = nn.Sequential(
6            sparse_conv3d(input_dim, layer1_Idim, k=3, s=1),
7            sparse_bn(layer1_Idim))
8
9        self.layer1 = inflated_resnet_layer1(res_block, layer1_Idim, layer1_Odim)
10        self.layer2 = inflated_resnet_layer2(res_block, layer2_Idim, layer2_Odim)
11        self.layer3 = inflated_resnet_layer3(res_block, layer3_Idim, layer3_Odim)
12        self.layer4 = inflated_resnet_layer4(res_block, layer4_Idim, layer4_Odim)
13
14        self.up1 = sparse_deconv(layer4_Odim, layer4_Odim, k=2, s=2),
15        self.decode1 = self.Sequential(
16            sparse_conv3d(layer4_Odim+layer3_Odim, layer3_Odim, k=3, s=1),
17            sparse_bn(layer1_Idim))
18
19        self.up2 = sparse_deconv(layer3_Odim, layer3_Odim, k=2, s=2)
20        self.decode2 = self.Sequential(
21            sparse_conv3d(layer3_Odim+layer2_Odim, layer2_Odim, k=3, s=1),
22            sparse_bn(layer2_Odim))
23
24        self.up3 = sparse_deconv(layer2_Odim, layer2_Odim, k=2, s=2)
25        self.decode3 = self.Sequential(
26            sparse_conv3d(layer2_Odim+layer1_Odim, layer1_Odim, k=3, s=1),
27            sparse_bn(layer1_Odim))
28
29        self.up4 = sparse_deconv(layer1_Odim, layer1_Odim, k=2, s=2)
30        self.decode4 = self.Sequential(
31            sparse_conv3d(layer1_Odim+layer1_Odim, layer1_Odim, k=3, s=1),
32            sparse_bn(layer1_Odim))
33
34        self.output_layer = nn.Sequential(
35            nn.Linear(layer1_Odim, class_num))
36
37    def forward(self, x):
38        x_i = self.input_layer(x)
39        x1 = self.layer1(x_i)
40        x2 = self.layer2(x1)
41        x3 = self.layer3(x2)
42        x4 = self.layer4(x3)
43
44        x3_ = self.decoder1(cat(x3, self.up1(x4)))
45        x2_ = self.decoder2(cat(x2, self.up2(x3_)))
46        x1_ = self.decoder3(cat(x1, self.up3(x2_)))
47        xi_ = self.decoder4(cat(x_i, self.up4(x1_)))
48        return self.output_layer(xi_)
Listing 6: Pseudo code of inflated ResNet with linear input layer and the simplest decoder for segmentation

6.5 Stability analysis

We list the results of stability analysis in below tables. We run three trials to account for the randomness of experiments by changing the random seed. The results reported in our main body is one moderate value of the trials, and the results reported here is formed as mean standard deviation. Note that we run 20, 15, and 10 trials for the experiment on data efficiency of training data, training data, and training data, respectively (Table. 3 in the main paper).

Layers to finetune ModelNet(top 1 Acc%) S3DIS(mIoU%) SemanticKITTI(mIoU%)
I, O 79.32 0.46 53.40 0.25 55.74 0.13
I, O, , 83.11 0.23 54.82 0.21 56.19 0.11
I, O, , , W, b 88.36 0.17 55.26 0.22 58.62 0.14
whole network 88.87 0.35 56.53 0.21 65.33 0.23
from scratch 88.65 0.23 55.05 0.17 64.66 0.16
linear I, O 69.74 0.45 48.83 0.23 54.31 0.14
linear network(w/o backbone) 8.91 3.46 - -
from scratch (linear I, O) 88.53 0.36 54.79 0.21 63.75 0.29
Table 8: Stability analysis for Table. 1 in the main paper.
Layers to finetune Tiny-ImageNet ModelNet Cityscapes SemanticKITTI
PointNet++ (Cls top 1 Acc%) HRNetV2-W48 (Seg mIoU%)
I, O, , 82.14 0.44 36.02 0.11
whole network 90.64 0.35 49.33 0.11
from scratch 90.61 0.33 44.09 0.12
Table 9: Stability analysis for Table. 2 in the main paper.
Training data 1% data (20 trails) 5% data (15 trails) 10% data (10 trails) 100% data
w/ pretraining 26.77 5.48 68.96 1.91 76.44 0.64 88.73 0.13
w/o pretraining 20.61 8.16 68.28 1.64 75.74 0.93 88.56 0.18
Table 10: Stability analysis for Table. 3 in the main paper.
Layers to finetune Tiny-ImageNet ImageNet-1K ImageNet-21K FractalDB1K FractalDB10K
ModelNet (top 1 Acc.%) ModelNet (top 1 Acc.%)
I, O 67.84 0.33 83.99 0.30 81.37 0.33 85.55 0.02 83.32 0.16
I, O, , 81.23 0.37 82.53 0.45 81.78 0.24 71.85 0.56 77.92 0.06
I, O, , , W, b 88.86 0.48 88.92 0.22 89.46 0.20 88.36 0.30 88.57 0.18
whole network 88.84 0.14 88.73 0.13 89.21 0.04 88.48 0.07 88.55 0.14
from scratch 88.56 0.18 88.56 0.18
Table 11: Stability analysis for Table. 4 in the main paper.
Layers to finetune ImageNet-1K Cityscapes ImageNet-1K ADE20K
SemanticKITTI (mIoU%) S3DIS (mIoU%)
I, O 55.74 0.13 58.41 0.11 53.40 0.25 55.21 0.21
I, O, , 56.19 0.11 56.69 0.12 54.82 0.21 54.85 0.25
I, O, , , W, b 58.62 0.14 59.47 0.17 55.26 0.22 55.28 0.25
whole network 65.33 0.23 65.51 0.22 56.53 0.21 56.22 0.24
from scratch 64.66 0.16 55.05 0.17
Table 12: Stability analysis for Table. 5 in the main paper.
Layers to finetune ResNet-18 ResNet-50 ResNet-152
I, O 79.32 0.46 83.99 0.30 69.42 0.08
I, O, , 83.11 0.23 82.53 0.45 81.16 0.38
I, O, , , W, b 88.36 0.17 88.92 0.22 88.74 0.09
whole network 88.87 0.35 88.73 0.13 88.45 0.23
from scratch 88.65 0.23 88.56 0.18 88.75 0.02
Table 13: Stability analysis for Table. 6 in the main paper.
To finetune T1 T2 T3 ST NST ST(RI) NST(RI)
I, O, , 82.24 0.73 81.82 0.18 82.35 0.29 84.20 0.97 87.20 0.06 84.54 0.51 86.59 0.29
Table 14: Stability analysis for Table. 7 in the main paper.