P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting
Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability achieve a great success and dominate many downstream tasks in natural language processing and 2D vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited training data that are relatively inconvenient to collect. In this paper, we provide a new perspective of leveraging pre-trained 2D knowledge in 3D domain to tackle this problem, tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost. Following the principle of prompting engineering, we transform point clouds into colorful images with geometry-preserved projection and geometry-aware coloring to adapt to pre-trained image models, whose weights are kept frozen during the end-to-end optimization of point cloud analysis tasks. We conduct extensive experiments to demonstrate that cooperating with our proposed Point-to-Pixel Prompting, better pre-trained image model will lead to consistently better performance in 3D vision. Enjoying prosperous development from image pre-training field, our method attains 89.3 setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters. Our framework also exhibits very competitive performance on ModelNet classification and ShapeNet Part Segmentation. Code is available at https://github.com/wangzy22/P2PREAD FULL TEXT VIEW PDF
P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting
With the rapid development of deep learning and computing hardware, neural networks are experiencing explosive growth in model size and representation capacity. Nowadays, pre-training big models has become an important research topic in both natural language processingdevlin2018bert; radford2019language; brown2020gpt3; smith2022MTNLG
and computer visionradford2021clip; ramesh2021zero; he2022masked; riquelme2021scaling, and has achieved a great success when transferred to downstream tasks with fine-tuning he2020moco; chen2020mocov2; chen2020simclr; bao2021beit; he2022masked or prompt-tuning radford2021clip; petroni2019lama; wallace2019advtrigger; shin2020autoprompt; lester2021prompttuning; liu2021ptuning strategies. Fine-tuning is a traditional tuning strategy that requires a large amount of trainable parameters, while prompt tuning is a recently emerged lightweight scheme to convert downstream tasks into the similar form as the pre-training task. However, such prevalence of the pretraining-tuning pipeline cannot be obtained without the support of numerous training data in pre-training stage. Language pre-training leading work Megatron-Turing NLG smith2022MTNLG with 530 billion parameters is trained on 15 datasets containing over 338 billion tokens, while Vision MoE riquelme2021scaling with 14.7 billion parameters is trained on JFT-300M dataset sun2017JFT including 305 million training images.
Unfortunately, the aforementioned convention of pre-training big models on large-scale datasets and tuning on downstream tasks has encountered obstacles in 3D vision. 3D visual perception is gaining more and more attention given its superiority in many emerging research fields including autonomous driving lang2019pointpillars; yin2021center, robotics vision chao2021dexycb; yang2020human2robot and virtual reality ma2021avatars; wei2019vr. However, obtaining abundant 3D data such as point clouds from LiDAR is neither convenient nor inexpensive. For example, the widely used object-level point cloud dataset ShapeNet chang2015shapenet only contains 50 thousand synthetic samples. Therefore, pre-training fundamental 3D models with limited data remains an open question. There are some previous literature xie2020pointcontrast; wang2021occo; yu2021pointbert that attempts to develop specific pre-training strategies on point clouds with limited training data, such as Point Contrast xie2020pointcontrast, OcCo wang2021occo and Point-BERT yu2021pointbert. Although they prove that the pretraining-finetuning pipeline also works well in the 3D domain, the imbalance between numerous trainable parameters and limited training data may lead to insufficient optimization or overfitting problems.
Different from the previous methods that directly pre-train models on 3D data, we propose to transfer the pre-trained knowledge from 2D domain to 3D domain with appropriate prompting engineering, since images and point clouds display the same visual world and share some common knowledge. In this way, we address the data-starvation problem in the 3D domain, given that the pre-training strategy is well-studied in the 2D field with abundant training data and that prompt-tuning on 3D tasks does not require much 3D training data. To the best of our knowledge, we are the first work to transfer knowledge in pre-trained image models to 3D vision with a novel prompting approach. More specifically, we propose an innovative Point-to-Pixel Prompting mechanism that transforms point clouds into colorful images with geometry-preserved projection and geometry-aware coloring. Examples of produced colorful images are shown in Figure 1. Then the colorful images are fed into the pre-trained image model with frozen weights to extract representative features, which are further deployed to downstream task-specific heads. The conversion from point clouds to colorful images and the end-to-end optimization pipeline promote the bidirectional knowledge flow between points and pixels. The geometric information from point clouds is mostly retained in projected images via our geometry-preserved projection, while the color information of natural images from the pre-trained image model is transmitted back to colorless point clouds via the cooperation between the geometry-aware coloring module and the fixed pre-trained image model.
We conduct extensive experiments to demonstrate that with our Point-to-Pixel Prompting, enlarging the scale of the same image model will result in higher point cloud classification performance, which is consistent with the observations in image classification. This suggest that we can take advantage of the successful researches in pre-training big image model, opening up a new avenue for point cloud analysis. With much fewer trainable parameters, we achieve comparable results with the best object classification methods on both synthetic ModelNet40 wu2015modelnet and real-world ScanObjectNN uy2019scanobjectnn. We also demonstrate the potential of our method to perform dense predictions like part segmentation on ShapeNetPart yi2016shapenetpart. In conclusion, our Point-to-Pixel Prompting (P2P) framework explores the feasibility and ascendancy of transferring image pre-trained knowledge to the point cloud domain, promoting a new pre-training paradigm in 3D point cloud analysis.
Pre-training visual models has been studied thoroughly in the image domain and there are mainly three kinds of pre-training strategies: supervised, weakly-supervised and unsupervised. Supervised pre-training dosovitskiy2020vit; DBLP:journals/corr/abs-2106-04560; carreira2017quo on classification task with large-scale dataset is a traditional practice and is stimulated by the boosting development of base vision models he2016resnet; huang2017densenet; dosovitskiy2020vit; liu2021swin. From convolution-based networks he2016resnet; huang2017densenet to attention-based models dosovitskiy2020vit; liu2021swin; liu2021swinv2
, model size has grown rapidly and the supporting datasets has grown from ImageNetdeng2009imagenet with 14 million samples to JFT-3B sun2017JFT containing 3 billion images. Weakly-supervised pre-training methods tarvainen2017mean; berthelot2019mixmatch; xie2020self; pham2021meta use less annotations while unsupervised pre-training approaches he2020moco; chen2020mocov2; chen2020simclr; bao2021beit; he2022masked; grill2020bootstrap introduces no task-related bias and brings higher transferability to various downstream tasks. Contrastive learning based methods he2020moco; chen2020mocov2; chen2020simclr; grill2020bootstrap and pretext-task-based approaches bao2021beit; he2022masked; doersch2015unsupervised are representative unsupervised pre-training work.
Different from the prosperity of pre-training image models, pre-training 3D models is still under development. Many researches have developed self-supervised learning mechanisms with various pretext tasks such as solving jigsaw puzzlessauder2019jigsaw
, orientation estimationpoursaeed2020orientation, and deformation reconstruction achituve2021deformation. Inspired by pre-training strategies in image domain, Point Contrast xie2020pointcontrast adopts contrastive learning principle while OcCo wang2021occo and Point-BERT yu2021pointbert introduce reconstruction pretext tasks for better representation learning. However, the data limitation in 3D domain remains a large obstacle in developing better pre-training strategies.
Prompt tuning is an important mechanism whose principle is to adapt downstream tasks with limited annotated data to the original pre-training task at a minimum cost, thus exploiting the pre-trained knowledge to solve downstream problems. It is first proposed in the natural language processing community liu2021promptsurvey
, and has been leveraged in many vision-language models. At first, hand-crafted prompting methods like LAMApetroni2019lama and GPT-3 brown2020gpt3 are promoted, manually designing the cloze format. Later on, researches like AdvTrigger wallace2019advtrigger and AutoPrompt shin2020autoprompt develop an automated searching algorithm to select discrete prompt tokens within a large corpus. Recently, continuous prompting methods represented by Prefix-Tuning li2021prefixtuning, PromptTuning lester2021prompttuning and P-Tuning liu2021ptuning; liu2021ptuningv2 are becoming the mainstream given their flexibility and high performance.
On the contrary, the development in prompting visual pre-training models lags behind. L2P wang2021l2p proposes a prompt pool for continual learning problem while VPT jia2022vpt takes the first step to propose continuous prompt tuning framework inspired by P-Tuning liu2021ptuning; liu2021ptuningv2 to traditional vision perception problems such as classification and segmentation, optimizing learnable prompt tokens with frozen pre-trained ViT dosovitskiy2020vit. In conclusion, prompt tuning in vision research field is still in the ascendant, and there are no previous work like this paper to discuss tuning pre-trained image models for point cloud analysis with an appropriate prompting mechanism.
Given the unordered data structure of point clouds, early literature has developed two kinds of methods to construct structural representations for point cloud object analysis: voxel-based and point-based. Leading by VoxNet maturana2015voxnet, voxel-based methods maturana2015voxnet; klokov2017kdnet; riegler2017octnet
partition the 3D space into ordered voxels and perform 3D convolutions for feature extraction. On the other hand, point-based methods directly process unordered points and introduce various approaches to aggregate local information, with PointNetqi2017pointnet and PointNet++ qi2017pointnet++ as pioneer researches. Their successors are differentiated into multiple branches, including MLP-based ma2022pointmlp; cheng2021pranet, convolution-based li2018pointcnn; wu2019pointconv; thomas2019kpconv and graph-based ones wang2019dgcnn; li2018adagraph. Recently, attention-based Transformer vaswani201transformer; guo2021pct; zhao2021pointtransformer architecture has prevailed over other frameworks in vision community and methods like PCT guo2021pct, Point Transformer zhao2021pointtransformer utilize the technique to achieve competitive performance in point cloud object analysis.
Besides the aforementioned methods that perform representation learning in the 3D space, there are projection-based methodssu2015mvcnn; kanezaki2018rotationnet; wei2020viewgcn; hamdi2021mvtn; feng2018gvcnn that leverage multi-view images to represent 3D objects. Most methods are based on synthetics datasets with fixed multi-view images rendered from meshes in advance, developing various aggregation functions to merge features from multiple views. Recently, MVTN hamdi2021mvtn introduces the differentiable rendering technique to build an end-to-end learning pipeline, rendering images online and regressing the optimal projection view. Different from theirs, our work designs a novel prompting engineering scheme, leveraging differentiable projection as the bridge between points and pixels. Thus we are able to utilize 2D color knowledge from pre-trained image models that is absent in colorless point clouds. Moreover, our framework is implemented in a faster single-view pattern, as we only select one random projection view during training and don’t develop any aggregation strategy to explicitly fuse multi-view knowledge.
The overall framework of our P2P framework is illustrated in Figure 2. The network architecture consists of four components: 1) a geometry encoder to extract point-level geometric features from the input point clouds, 2) a Point-to-Pixel Prompting module to produce colorful images based on geometric features, 3) a pre-trained image model to leverage pre-trained knowledge from image domain, and 4) a task-specific head to perform various kinds of point cloud tasks. We will introduce the geometry encoder, the Point-to-Pixel Prompting module and task-specific heads in detail in the following sections. As for the choice of the pre-trained image model, we investigate both convolution-based and attention-based architectures in Section 4.2.1.
With the proposed architecture that can be optimized in an end-to-end manner, we are able to exploit 2D pre-trained knowledge for point cloud analysis from two perspectives. In the forward process, the point clouds are projected into images with preserved geometry information and the resulting images can be recognized and handled by the pre-trained image model. In the backward optimization, the frozen pre-trained weights of the image model act as an anchor and guide the learnable coloring module to learn extra color knowledge for colorless point clouds, without explicit manual interference and only under the indirect supervision from the overall target functions of downstream tasks. Therefore, the resulting colorful images are expected to mimic patterns in 2D images and to be distinguishable for the pre-trained image model in downstream tasks.
One of the most significant advantages of 3D point clouds over 2D images is that point clouds contain more spatial and geometric information that is compressed or even lost in flat images. Therefore, we first extract geometry features from point clouds for better spatial comprehension, implementing a lightweight DGCNN wang2019dgcnn to extract local features of each point.
Given an input point cloud with points, we first locate -nearest neighbors of each point. Then for each local region, we implement a small neural network to encode the relative position relations between the central point and the local neighbor points . Then we can obtain geometric features with dimension :
where are coordinates of respectively, and
stand for max-pooling and concatenation within all pointsin local neighbor region respectively.
Following the principle of prompt tuning mechanism introduced in Section 2.2, we propose Point-to-Pixel Prompting to adapt point cloud analysis to image representation learning, on which the image model is initially pre-trained. As illustrated in Figure 2, we first introduce geometry-preserved projection to transform the 3D point cloud into 2D images, rearranging 3D geometric features according to the projection correspondences. Then we propose a geometry-aware coloring module to dye projected images, transferring 2D color knowledge in the pre-trained image model to the colorless point cloud and obtaining more distinguishable images that can be better recognized by the pre-trained image model.
Once obtaining geometric features of the input point cloud , we further rearrange them into an image-style layout to prepare for producing colorful image , where are height and width of the target image. We elaborately design a geometry-preserved projection to avoid information loss when casting 3D point clouds to 2D images.
The first step is to find spatial correspondence between point coordinates and image pixel coordinates . Since there is a dimensional diminishing during the projection process, we randomly select a projection view during training to construct a stereoscopic space with flat image components. Equivalently, we rotate the input point cloud with rotation matrix to get 3D coordinates after rotation: . The rotation matrix is constructed through two steps: first rotating around the axis by angle , then rotating around the axis by angle , where and are random rotation angles during training and fix-selected angles during inference. Then we just omit the final dimension and evenly split the first two dimensions into 2D grids: , where denotes point index, denotes coordinate dimension, denotes grid size at dimension .
The second step is to rearrange per-point geometric features into per-pixel according to coordinates correspondence between and . If there are multiple points falling in the same pixel at , which is a common situation, we add the features of these points altogether to produce the pixel-level feature: . The summation operation brings two advantages related to our geometry-preserved design. On the one hand, we take all points in one pixel into consideration instead of the common practice that only keeps the foremost point according to depth and occlusion relation. In this way, we are able to represent and optimize all points in one image and produce images containing semitransparent objects with richer geometric information as shown in Figure 1. On the other hand, we conduct a summation operation instead of taking the average, resulting in larger feature values when there are more points in one pixel. Such design maintains the spatial density information of point clouds during the projection process, which is lacked in image representations and is critical in preserving geometry knowledge.
In conclusion, our proposed geometry-preserved projection is able to produce geometry-aware image feature map that contains plentiful spatial knowledge of the object. Note that we only utilize one projection view during training and do not explicitly design any aggregation functions for multi-view feature fusion. Therefore, we follow a single-view projection pipeline which is much more efficient than its multi-view counterpart.
Despite that 3D point cloud contains richer geometric knowledge than 2D images, colorful pictures embrace more texture and color information than colorless point clouds, which is also decisive in visual comprehension. The frozen image model pre-trained on abundant images learns to perceive the visual world not only based on object shape and outlines, but also heavily relied on discriminative colors and textures. Therefore, the image feature map that contains only geometric knowledge and lacks color information is not most suitable for the pre-trained image model to understand and process. In order to better leverage pre-trained 2D knowledge of the frozen image model, we propose to predict colors for each pixel, explicitly encouraging the network to migrate color knowledge in the pre-trained image model to via the end-to-end optimization. Since the input contains rich geometry information that will heavily affect the coloring process, the resulting images are expected to display different colors on different geometry parts, which has been verified in Figure 1. Therefore, we referred to our dyeing process as geometry-aware coloring.
More specifically, we design a lightweight 2D neural network to predict RGB colors for each pixel : . We implement several convolutions in for image smoothing, as the initial projected image feature are relatively discontinuous due to the sparsity of the original point cloud. Therefore, the smoothing operation is critical in producing more realistic images that the pre-trained image model can recognize. The resulting colorful images are then prepared for further image-level feature extraction through the pre-trained image model.
Take ViT as the pre-trained image model for example. The outputs from the pre-trained image model are image token features and one class token feature , where is the number of image patches and is the token feature dimension. For different downstream tasks, we design different task-specific heads and optimization strategies, which are introduced in the following part.
For object classification, we follow the common protocol in image Transformer models to utilize the class token
as the input to the classifierimplemented as only one linear layer: . We use the CrossEntropy loss as the optimization target.
For part segmentation, we rearrange the token features into image layouts and upsample them to . Then we design a lightweight 2D segmentation head based on SemanticFPN kirillov2019semanticfpn or UPerNet xiao2018uper
to predict per-pixel segmentation logits:. Given that multiple points may correspond to one pixel and that we train the network in a single view pattern, projecting per-pixel predictions back to 3D points will cause supervision conflict. Instead, we project 3D labels into 2D image-style labels, exactly as how the point cloud is projected. Then we implement a per-pixel multi-label CE loss as there may be points from multiple classes projected to the same pixel: . The values of multi-hot 2D label are assigned according to projection correspondences, satisfying . Supervision in 2D domain speeds up the training procedure without much information loss, since we keep all features of points in one pixel and the optimization target is accordingly based on their category distributions. During inference, we select multiple projection views and re-project 2D per-pixel segmentation results back to 3D points, fusing multi-view predictions. Therefore, the per-point segmentation is decided by the most evident predictions from the most distinguishable projection directions.
We conduct object classification experiments on ModelNet40 wu2015modelnet and ScanObjectNN uy2019scanobjectnn datasets, while ShapeNetPart uy2019scanobjectnn is utilized for object-level part segmentation. ModelNet40 is the most popular 3D synthetic dataset for object classification, which contains 12,311 CAD models from 40 categories. We follow the common protocol to split 9,843 objects for training and reserve 2,468 objects for validation. ScanObjectNN is a more challenging point cloud object dataset sampled from real-world scans with background and occlusions. It contains 2,902 samples from 15 categories, and we conduct experiments on the perturbed (PB-T50-RS) variant. ShapeNetPart samples from the synthetic ShapeNet dataset and annotates each object with part-level labels. It consists of 16,881 objects from 16 shape categories and their parts are split into 50 classes.
We implement our proposed P2P architecture with PyTorchpaszke2019pytorch and utilize AdamW loshchilov2017adamw optimizer with learning rate and weight decay , accompanied by the CosineAnnealing learning rate scheduler loshchilov2016sgdr
. We optimize the parameters of the geometry encoder, Point-to-Pixel Prompting module, task-specific head and normalization layers in the pre-trained image model, freezing the remaining pre-trained weights. The model is trained for 300 epochs with a batch size of 64. During training, the rotation angleis randomly selected from and is randomly selected from to keep the objects standing upright in the images. During inference, we evenly choose values of and values of to produce views for majority voting. Please refer to the supplementary material for architectural details.
We implement our P2P framework with different image models of different scales, ranging from convolution-based ResNet he2016resnet and ConvNeXt liu2022convnext to attention-based Vision Transformer dosovitskiy2020vit and Swin Transformer liu2021swinv2. These image models are pre-trained on ImageNet-1k deng2009imagenet with supervised classification. We report the image classification performance of the original image model, the number of trainable parameters after Point-to-Pixel Prompting, and the classification accuracy on ModelNet40 and ScanObjectNN datasets, as shown in Table 1.
From the quantitative results and accuracy curve, we can conclude that enlarging the scale of the same image model will result in higher classification performance, which is consistent with the observations in image classification. Therefore, our proposed P2P prompting can benefit 3D domain tasks by leveraging the prosperous development of 2D visual domain, including abundant training data, various pre-training strategies and superior fundamental architectures.
Comparisons with previous methods on the ModelNet40 and ScanobjectNN are shown in Table 2. For baseline comparisons, we select methods qi2017pointnet++; thomas2019kpconv; wang2019dgcnn; ma2022pointmlp; hamdi2021mvtn; ran2022repsurf; qian2022pointnext that focus on developing 3D architectures and do not involve any pre-training strategies. We also select traditional pre-training work yu2021pointbert; wang2021occo; pang2022pointmae in 3D domain. For our P2P framework, we show results of two versions: (1) baseline version with ResNet-101 as the image model, (2) advanced version with HorNet-L rao2022hornet pre-trained on ImageNet-22k dataset deng2009imagenet
as the image model, additionally replacing the linear head with a multi-layer perceptron (MLP) as the classifier.
From the results we can draw three conclusions. Firstly, our P2P outperforms traditional 3D pre-training methods. This suggests that the pre-trained knowledge from 2D domain is useful for solving 3D recognition problems and is better than directly pre-training on 3D datasets with limited data. Secondly, we obtain competitive performance with architecture methods which require much more trainable parameters than ours and achieve state-of-the-art performance on ScanObjectNN dataset. Therefore, our P2P framework fully exploits the potential of pre-training knowledge from image domain and open a new avenue for point cloud analysis. Finally, our P2P framework performs relatively better on real-world ScanObjectNN dataset than synthetic ModelNet dataset. This may be caused by the data distribution of ScanObjectNN dataset being more similar to the pre-trained ImageNet dataset, as they both contain visualizations of objects from the natural world. This prosperity reveals the potential of our model in real-world applications.
The visualizations of our projected colorful images are shown in Figure 1. The first line shows several point cloud samples from the synthetic ModelNet40 dataset and real-world ScanObjectNN dataset, while the second and third lines illustrate the colorful images from Point-to-Pixel Prompting module from different projection views. Therefore, our geometry-preserved projection design maintains most spatial information, resulting in images of semitransparent objects that avoid occlusion problems, such as the chair leg in the second row column. Additionally, the colors of the objects are reasonable and even suggest part-level distinctions to some extent, making projected images more similar to natural images that can be better processed by the pre-trained image model.
To investigate the architecture design and training strategy of our proposed framework, we conduct extensive ablation studies on ModelNet40 classification. Except for further notice, we use the base version of Vision Transformer (ViT-B-1k) that is pre-trained on ImageNet-1k dataset as our image model. Illustrations of our ablation settings can be found in Figure 3.
We conduct extensive ablation studies to demonstrate the advantages of our proposed P2P Prompting over vanilla fine-tuning and other prompting methods, shown in Table 3(a). As a baseline (Model A), we directly append classification head to the geometry encoder without the pre-trained image model. Then we incrementally insert pre-trained ViT blocks to process point tokens from the geometry encoder, and discuss different fine-tuning strategies including fix all ViT weights (Model B), fine-tuning normalization parameters (Model B) and fine-tuning all ViT weights(Model B). We also implement Vision Prompt Tuning (VPT) jia2022vpt to Model B with shallow (Model C) and deep (Model C) variants.
From the comparisons between Model A and other ones, we can inspect the contribution of pre-trained knowledge from 2D to 3D classification. However, neither vanilla fine-tuning nor previously prompting mechanism VPT fully exploits the pre-trained image knowledge. On the other hand, our proposed Point-to-Pixel prompting is the best choice to migrating 2D pre-trained knowledge to 3D domain at a low trainable parameter cost.
After confirming that Point-to-Pixel Prompting is the most suitable tuning mechanism, we discuss the design choices of the P2P module in detail. In our Point-to-Pixel Prompting, we propose to produce colorful images to adapt to the pre-trained image ViT. In Section 3.3.2, we have discussed our motivation and advantages of obtaining colorful images. Here we further prove the statement via ablation studies in Table 3(b). Model D processes per-pixel feature from Section 3.3.1 to directly generate image tokens, which are the inputs to the ViT blocks. In this variant, we bypass the explicit image generation process and directly adopt patch embedding layers on feature map . Model E generates binary black-and-white images conditioned only on the geometric projection from the point cloud, without predicting pixel colors as in P2P.
According to the results, Model D introduces much more trainable parameters due to the trainable patch embedding projection convolution layer with kernel size 16, while producing inferior classification results than P2P. On the other hand, even though Model E requires fewer trainable parameters, its performance lags far behind. Therefore, producing colorful images as the prompting mechanism can best communicate knowledge between the image domain and point cloud domain, fully exploiting pre-trained image knowledge from the frozen ViT model.
After fixing the architecture of our P2P framework, we investigate the best tuning strategy, adjusting the tuning extent of the pre-trained image model: (1) Model F: training the image model from scratch without loading pre-trained weights. (2) Model G: tuning all ViT parameters. (3) P2P: tuning only normalization parameters. (4) Model H: tuning only bias parameters. (5) Model I: fix all ViT parameters without any tuning.
According to the results in Table 3(c), tuning normalization parameters is the most suitable solution, avoiding 2D information lost during massive tuning (model G). Tuning normalization parameters also adapts the model to point cloud data distribution, which model H and I variant fail to accomplish. Additionally, quantitative comparisons between Model F and others demonstrate that the pre-trained knowledge from 2D domain is crucial in our P2P framework, since the limited data in 3D domain is insufficient for optimizing a large ViT model from scratch with numerous trainable parameters.
In Table 3(d), we show the effects of different strategies for pre-training image models. For supervised pre-training, we load pre-trained weights on ImageNet-1k and ImageNet-22k datasets. For unsupervised pre-training, we select four most representative methods: CLIP radford2021clip, DINO caron2021dino, MoCo chen2021mocov3 and MAE he2022masked. We report the linear probing and fine-tuning results on ImageNet-1k dataset of each pre-training strategy in IN Acc. column with and respectively. Note that we implement CoOp zhou2021coop to report the zero-shot classification accuracy (denoting with ) of the CLIP pre-trained model.
From the experiment results, we can conclude that supervised pre-trained image models obtain relatively better results than unsupervised pre-trained ones. This may because the objective of 3D classification is consistent with that in 2D domain, thus the supervised pre-training weight is more suitable to migrate to point cloud classification task. However, unsupervised approach with strong transferablity such as DINO also achieve competitive performance. Secondly, comparing among unsupervised pre-training methods, the one that achieving higher performance with linear probing on 2D classification produces better result in 3D classification. This suggest that the transferablity of a pre-trained image model is consistent when migrating to 2D and 3D downstream tasks.
The quantitative part segmentation results on ShapeNetPart dataset are shown in Table 4. We implement the base version of ConvNeXt liu2022convnext as the image model and SemanticFPN kirillov2019semanticfpn as 2D segmentation head for baseline comparison. We further implement the large version of ConvNeXt as the image model and more complex UPerNet xiao2018uper as 2D segmentation head to obtain better results. Our P2P framework can achieve better performance than classical point-based methods, which demonstrates its potential in performing 3D dense prediction tasks based on 2D pre-trained image models. We leave it for future work to develop advanced segmentation heads and supervision strategies to better leverage pre-trained 2D knowledge in object-level or even scene-level point cloud segmentation.
In this paper, we propose a point-to-pixel prompting method to tune pre-trained image models for point cloud analysis. The pre-trained knowledge in image domain can be efficaciously adapted to 3D tasks at a low trainable parameter cost and achieve competitive performance compared with state-of-the-art point-based methods, mitigating the data-starvation problem in point cloud field that has been an obstacle for massive 3D pre-training researches. The proposed Point-to-Pixel Prompting builds a bridge between 2D and 3D domains, preserving the geometry information of point clouds in projected images while transferring color information from the pre-trained image model back to the colorless point cloud. Experimental results on object classification and part segmentation demonstrate the superiority and potential of our proposed P2P framework.
We conduct more experiments on point cloud classification tasks with different image models of different scales, ranging from convolution-based ConvNeXt liu2022convnext to attention-based Vision Transformer to Swin Transformer liu2021swinv2. The image model is pre-trained on ImageNet-22k dataset. We report the image classification performance of the original image model finetuned on ImageNet-1k dataset, the number of trainable parameters after Point-to-Pixel Prompting, and the classification accuracy on ModelNet40 and ScanObjectNN datasets.
From the quantitative results and accuracy curve in Table 5, we can conclude that enlarging the scale of the same image model will result in higher classification performance, which is consistent with the observations in image classification.
During training, the rotation angle is randomly selected from and is randomly selected from to keep the objects standing upright in the images. During inference, we evenly divide the range of and into several segments and combine them into multiple views for majority voting. We conduct ablations on the number of views on ModelNet40 wu2015modelnet dataset with ViT dosovitskiy2020vit pre-trained on ImageNet-1k dataset as the image model. From the ablation results in Table 6, we choose values of and values of to produce views for majority voting.
Figure 4 shows feature distributions of ModelNet40 and ScanObjectNN datasets in t-SNE visualization. We can conclude that with our proposed Point-to-Pixel Prompting, the pre-trained image model can extract discriminative features from projected colorful images for point cloud analysis.
The geometry encoder is implemented as a one-layer DGCNN wang2019dgcnn edge convolution. The input points coordinates are first embedded into 8-dim features
with a channel-wise convolution. Then we use the k-nearest-neighbor (kNN) algorithm to locateneighbors of each point , and concat the central point feature with the relative feature between each point and neighboring points . Then the concatenated features are processed by a 2D convolution with kernel size 1 followed by a max-pooling layer within all points in , resulting in a geometry feature of dims.
In the geometry-preserved projection module, we first calculate the coordinate range of the input point cloud. Then we calculate the grid size so that the projected object can be fit in the image with .
The coloring module consists of a basic block from ResNet he2016resnet architecture design with 33 convolutions and a final 2D convolution with kernel size 1, smoothing the pixel-level feature distribution and predicting RGB channels of image .
The classification head is implemented as one linear layer that reduces the feature channels to the number of classes, predicting logits for each class.
The input to the segmentation head is the intermediate output of the layer of the ViT model, which are upsampled of the ViT feature resolution via several transpose convolutions with kernel size 2 or max-pooling layers. Then we implement the feature pyramid network kirillov2019semanticfpn with convolutions and upsampling layers to fuse features from different scales. Finally, the per-pixel segmentation prediction is obtained via a convolution layer with kernel 1.
The implementation details of architectural design and experimental settings are shown in Table 7, where denotes the embedding dimension of image features extracted by pre-trained image models. We use slightly different architectures for classification and part segmentation. We use 4096 points for ModelNet40 to produce projected images that are relatively smoother, while too few points may lead to sparse and discontinuous pixel distribution in projected images that prevent them from being similar to real 2D images.