Point-McBert: A Multi-choice Self-supervised Framework for Point Cloud Pre-training

by   Kexue Fu, et al.
FUDAN University

Masked language modeling (MLM) has become one of the most successful self-supervised pre-training task. Inspired by its success, Point-Bert, as a pioneer work in point cloud, proposed masked point modeling (MPM) to pre-train point transformer on large scale unanotated dataset. Despite its great performance, we find inherent difference between language and point cloud tends to cause ambiguous tokenization for point cloud. For point cloud, there doesn't exist a gold standard for point cloud tokenization. Although Point-Bert introduce a discrete Variational AutoEncoder (dVAE) as tokenizer to allocate token ids to local patches, it tends to generate ambigious token ids for local patches. We find this imperfect tokenizer might generate different token ids for semantically-similar patches and same token ids for semantically-dissimilar patches. To tackle above problem, we propose our Point-McBert, a pre-training framework with eased and refined supervision signals. Specifically, we ease the previous single-choice constraint on patches, and provide multi-choice token ids for each patch as supervision. Moreover, we utilitze the high-level semantics learned by transformer to further refine our supervision signals. Extensive experiments on point cloud classification, few-shot classification and part segmentation tasks demonstrate the superiority of our method, e.g., the pre-trained transformer achieves 94.1 accuracy on the hardest setting of ScanObjectNN and new state-of-the-art performance on few-shot learning. We also demonstrate that our method not only improves the performance of Point-Bert on all downstream tasks, but also incurs almost no extra computational overhead.


page 1

page 2

page 3

page 4


Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling

We present Point-BERT, a new paradigm for learning Transformers to gener...

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

Image BERT pre-training with masked image modeling (MIM) becomes a popul...

Masked Autoencoders in 3D Point Cloud Representation Learning

Transformer-based Self-supervised Representation Learning methods learn ...

POS-BERT: Point Cloud One-Stage BERT Pre-Training

Recently, the pre-training paradigm combining Transformer and masked lan...

Self-Promoted Supervision for Few-Shot Transformer

The few-shot learning ability of vision transformers (ViTs) is rarely in...

Point Cloud Pre-training by Mixing and Disentangling

The annotation for large-scale point clouds is still time-consuming and ...

3D Intracranial Aneurysm Classification and Segmentation via Unsupervised Dual-branch Learning

Intracranial aneurysms are common nowadays and how to detect them intell...


Self-supervised pre-training Grill et al. (2020); Chen and He (2021); He et al. (2021); Yang et al. (2022); Ohri and Kumar (2021); Zhao and Dong (2020); Larsson et al. (2017); He et al. (2020) is attracting growing attentions as it can transfer knowledge learned from large scale unlabeled dataset to boost the performance on downstream tasks. Most self-supervised pre-training methods are based on specific proxy tasks such as permutation prediction Zhao and Dong (2020)

, image colorization

Larsson et al. (2017), instance-level discrimination Grill et al. (2020); Chen and He (2021); He et al. (2020). Among them, the masked language modeling (MLM) task proposed in BERT Devlin et al. (2018) is currently one of the most successful proxy tasks and has been migrated to many other domains Bao et al. (2021); Li et al. (2022); Yu et al. (2021b). Point-Bert Yu et al. (2021b), as a pioneer work in point cloud learning, proposed a variant of MLM called masked point modeling (MPM) to pre-train point cloud transformers. Specifically, it first divides a point cloud into several local point patches and assigns a token id to each patch to convert the point cloud into multiply discrete tokens. Then it masks a proportion of tokens and pre-trains the model by recovering the masked tokens based on the transformer encoding results of the corrupted point cloud. Since there is not well-defined vocabulary to generate token ids for local point cloud patches, Point-Bert utilizes a pre-trained discrete Variational AutoEncoder (dVAE) Rolfe (2016) as the tokenizer. However, we find such discrete tokenization in MPM hinders the framework to achieve better performance due to the difference between point cloud and language.

Languages are naturally composed of discrete words, which are strictly bijective to the token ids. However, there is no such gold standard for point clouds discretization as for language, making it inevitable to introduce noise to MPM. Specifically, dVAE-based tokenization tends to cause the following two kinds of ambiguities. (1) Semantically-dissimilar patches have the same token ids. Due to the lack of well-defined vocabulary, MPM adopts a pre-trained dVAE as tokenizer to generate token ids for local patches. However, the tokenizer focuses on local patches’ geometry but barely considers their semantics, resulting in some wrong token ids. For example, two patches shown in red in Figure 1 consist of similar geometric structure but have different semantics (landing gear and aero-engine). However, they are allocated with the same token ids (#3776) due to their similar geometric structure. (2) Semantically-similar patches have different token ids. As shown in Figure 1, semantically-similar patches of the aero-engine of the airplane are allocated with many different token ids (#599, #1274). It’s reasonable for them to have the same token ids because of their similar semantics and geometry to each other. However, the tokenizer neglects their relations and allocates them with different token ids due to the interference of imperfect discretization and acquisition noise.

Inspired by Mc-Beit Li et al. (2022)

, we propose an improved Bert-style point cloud pre-training framework called Point-McBert with eased and refined masked prediction targets to tackle the above problems. Specifically, since semantically-similar patches might have different token ids, we ease the previous strict single-choice constraint on patches. For each local patch, we utilize a probability distribution vector of token ids as supervision signal rather than previous single token id, which means each patch has


hoice token ids. Moreover, we believe high-level perceptions produced by point cloud transformer can provide extra semantic supervision signals, which benefit our pre-training. Therefore, we refine the above supervision signals, i.e., the probability distribution vectors, using inter-patch semantic similarities, which come from our point cloud transformer. By considering high-level similarities, the ambiguities caused by local geometric similarity can be mitigated.

To verify the effectiveness of our framework, we pre-train a point cloud transformer on ShapeNet Chang et al. (2015) and conduct extensive fine-tuning experiments on downstream tasks including point cloud classification, few-shot classification and part segmentation. Our framework not only improve the performance of previous Point-Bert on all downstream tasks, but also incurs almost no extra computational overhead during pre-training. Our Point-McBert achieves 94.1% accuracy on ModelNet40 Wu et al. (2015) and 84.28% accuracy on the complicated setting of ScanObjectNN Uy et al. (2019), outperforming a series of state-of-the-art methods. Our method also achieves new state-of-the-art on point cloud few-shot learning, indicating powerful generalization of our Point-McBert.

Figure 1: Visualization of improper token ids. To better visualize the improper tokenization, we provide two different views of the same point cloud. Different colors represent the different token ids, and the circles indicate the range of local patches. As shown in above two figures, semantically-dissimilar patches (landing gear and aero-engine, colored in red) have the same token ids (#3776), and semantically-similar adjacent patches (aero-engine, colored in purple and brown) have different token ids (#1274, #599).

Related Works

Point cloud learning

Point cloud is an important type of geometric data structure, which is widely used in many applications, such as remote sensing Liu et al. (2022); Han and Sánchez-Azofeifa (2022), autonomous driving Lu et al. (2019), robotics Yang et al. (2020) and medical image analysis Shen et al. (2021). Since point cloud is irregular, researchers focused on converting point cloud into regular data such as 3D voxel and projected image in the past. Many voxel-based Maturana and Scherer (2015); Wu et al. (2015) or view-based Qi et al. (2016); Su et al. (2015) methods had been proposed, but didn’t perform well. A pioneering work, PointNet Qi et al. (2017a)

utilized shared multi-layer perceptron with pooling to achieve permutation-invariant learning on point cloud, strongly outperformed previous work. Motivated by PointNet, a series of PointNet-style works

Qi et al. (2017b); Wang et al. (2019); Thomas et al. (2019) were proposed subsequently. Recently, inspired by success of vision transformer (VIT) Dosovitskiy et al. (2020), many works Zhao et al. (2021); Park et al. (2021); Guo et al. (2021) investigated the application of transformer Vaswani et al. (2017) to point cloud learning. Due to the permutation-invariance of self-attention, PointTransformer Zhao et al. (2021)

designed a self-attention-based point transformer layer for 3D scene understanding. Yu et al.

Yu et al. (2021b) believed that the above transformer-based point cloud models deviate from the mainstream of standard transformer. Therefore, they used a standard transformer with minimal inductive bias in their Point-Bert Yu et al. (2021b) and achieved new state-of-the-art on many point cloud analysis tasks. In this work, we follow the setting of Point-Bert and also utilize a standard transformer as our backbone. The proposed method differs from Point-Bert in applying a new tokenization mechanism.

Self-supervised pre-training

Pre-training on large-scale data and then fine-tuning on target tasks has been proved to be an effective paradigm to boost the performance of model on downsteam tasks He et al. (2019). However, although many efficient annotation tools Wu et al. (2021); Girardeau-Montaut (2016)

have been proposed, labeling a large-scale dataset is still costly especially for point cloud data. Therefore, self-supervised pre-training which pre-trains models without annotations has more potential. In the past few years, many self-supervised pre-training works had been proposed in both natural language processing

Devlin et al. (2018)

and computer vision

Grill et al. (2020); Chen and He (2021); He et al. (2021); Yang et al. (2022); Ohri and Kumar (2021); Zhao and Dong (2020); Larsson et al. (2017); He et al. (2020), which motivated works in point cloud Yu et al. (2021b); Poursaeed et al. (2020); Huang et al. (2021); Xie et al. (2020); Wang et al. (2021). The core of self-supervised methods is designing a proxy task. For example, Poursaeed et al. Poursaeed et al. (2020) pre-trained networks by predicting point cloud’s rotation. Inspired by the success of a series of contrastive learning-based self-supervised method Grill et al. (2020); Chen and He (2021); He et al. (2020), PointContrast Xie et al. (2020) and STRL Huang et al. (2021) implemented a contrastive learning paradigm to learn deep representations from depth scan. OcCo Wang et al. (2021) pre-trained their encoder by reconstructing occluded point clouds. Recently, Point-Bert Yu et al. (2021b) which designed a proxy task called masked point modeling (MPM) to pre-train point cloud transformer, performed better than previous self-supervised pre-training methods for point cloud. However, we find improper tokenization in Point-Bert tends to result in ambiguous supervision, which hinder it to achieve better performance. In this work, we improve the problem of improper tokenizer in Point-Bert and devise a new framework to achieve better performance.


Similar to Point-Bert Yu et al. (2021b), we also adopt a Masked Point Modeling (MPM) paradigm as our proxy task to pre-train our model. Specifically, given a point cloud , we first sample center points using farthest point sampling Lang et al. (2020) and then select their -nearest neighbor points to build local patches . The points in local patches are normalized by subtracting their center points to further mitigate their coordinate bias. Then a mini-PointNet Qi et al. (2017a) is used to map the normalized patches to a sequence of point embeddings . After that, a tokenizer will take embeddings as inputs and generate token ids for patches, where denotes the length of the vocabulary . As done in other Bert-style works, we randomly mask a proportion of tokens and then send corrupted point embeddings into a backbone implemented by transformer encoder. The backbone will learn -normalized representations for both masked and unmasked tokens. Finally, we pre-train the backbone by predicting the masked tokens based on these representations. The objective of MPM can be formulated as follow:


where denotes the set of masked patches, denotes a MLP head to predict the token id of masked patch based on representation . For Point-Bert, the token id is a one-hot vector, which is derived from latent feature in dVAE:


where denotes latent feature in the dVAE’s encoder. However, as mentioned above, this imperfect tokenizer will make some semantically-similar patches have different tokens and some semantically-dissimilar patches have the same tokens. To tackle these problems, we ease the token id to a soft vector satisfying , allowing patches correspond to multiply token ids. Moreover, we re-build based on the high-level semantic relationship for more accurate supervision. The details are presented in the next section.

Figure 2: Overview of our proposed method, Point-McBert.

We improve the Point-Bert by incorporating eased and refined masked supervision signals. We utilize a softmax layer to ease the hard token ids into soft probability distribution vectors of token ids and use the encoded features generated by transformer encoder to refine the soft probability distribution vectors. During pre-training, a proportion of tokens are randomly masked and fed into transformer encoder. The transformer encoder is optimized by predicting the masked eased and refined token signals in the form of a soft-label cross-entropy loss.


Our framework is shown in Figure 2, and we will introduce it in detail in this section. To verify the effectiveness of our proposed idea, we set Point-Bert as our baseline and modify it as little as possible.

Before pre-training, we train a dVAE Rolfe (2016) as our tokenizer in the same way as in Point-Bert. During pre-training, transformer Vaswani et al. (2017) encoder in our framework is trained on unlabeled data over MPM. For fine-tuning, the transformer encoder is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from downstream tasks. We first briefly introduce the implements of the tokenizer and the transformer encoder in section Tokenizer and Point transformer, respectively. Then we provide details about our multi-choice strategy in section Multi-choice tokenization, which plays an important role in improving the tokenization.


Our tokenizer is trained through dVAE-based point cloud reconstruction. As shown in Figure 2, the dVAE adopts an encoder-decoder architecture. The encoder is our tokenizer , which consists of a DGCNN Wang et al. (2019) and Gumbel-softmax Jang et al. (2016). During training dVAE, DGCNN takes a sequence of point embeddings as input and outputs latent features . Then the latent features is discretized into one-hot vectors through a Gumbel-softmax. The decoder consists of a DGCNN followed by a FoldingNet Yang et al. (2018). DGCNN takes one-hot vectors as input and take full advantage of their neighborhood relationship in the feature space to enhance the representation of these discrete tokens. The following FoldingNet reconstructs the original point cloud according to DGCNN’s outputs. As with other VAE-style works Kingma and Welling (2013), we achieve reconstruction objective by optimizing the evidence lower bound (ELBO) of the log-likelihood :


where denotes the reconstructed patches, denotes the training corpus,

follows a uniform distribution. The former of ELBO represents the reconstruction loss and the latter represents the distribution loss. We follow

Yu et al. (2021a) to calculate the reconstruction loss and follow Ramesh et al. (2021) to optimize the distribution difference between one-hot vectors and prior. Our tokenizer has the same architecture as in Point-Bert Yu et al. (2021b) but outputs the latent features rather than the one-hot vectors during pre-training. More details about the architecture of tokenizer can refer to Point-Bert.

Point transformer

We utilize a standard transformer Vaswani et al. (2017) following Point-Bert’s settings as our backbone. The transformer has 12 blocks, each of which consists of a multi-head self-attention layer and a feed forward layer. As shown in Figure 2, the transformer takes point embeddings and center points as inputs, where is a learnable classification token appended to the sequence of point embeddings produced by mini-PointNet Qi et al. (2017a). Center points are used to generate positional encodings through a MLP, where denotes the positional encoding for classification token, which is a learnable parameter. During pre-training, we mask a proportion of local point patches. We replace their corresponding point embeddings with same learnable mask embeddings while keeping their positional embedddings unchanged. The corrupted embedding sequences are fed into transformer and the transformer outputs representations for both masked and unmasked patches, where represents the representation for classification, as part of the input of classification head in the downstream task.

Multi-choice tokenization

As mentioned above, there doesn’t exist a gold standard for point cloud tokenization. Therefore, it’s inevitable for tokenizer to produce improper supervision, including generating the same token ids for semantically-dissimilar patches and generating different token ids for semantically-similar patches. We observe that given a local patch, there may exist multiply suitable token ids as candidates as illustrated in Figure 1. Inspired by Mc-Beit Li et al. (2022), we attempt to release the strict single-choice constraint on patches. Given a local patch , we don’t use unique token id as supervision any more. Instead, we predict the probability distribution vector of the token ids. Specifically, the probability distribution vector is generated by a softmax operation:


where denotes the latent feature in tokenizer, is a temperature coefficient, which controls the smoothness of probability distribution. When is small, tend to be a one-hot vector. When is large, it is equivalent to incorporates more choices.

Moreover, to further tackle the ambiguity of token ids as shown in Figure 1, more high-level semantics should be incorporated. As analyzed in introduction, tokenizer focuses on encoding local geometry of local patches while ignoring the associations between patches, making semantically-dissimilar patches have the same token ids. To alleviate this problem, we use the representations learned by transformer to refine the probability distribution vectors as done in Li et al. (2022)

. Specifically, we use the cosine similarity between patch features learned by transformer to re-weight the probability distribution matrix

. The similarity matrix is calculated as follows:


where denotes the -normalized representations learned by transformer, denotes the inner product between two vectors. The re-weighted probability distribution matrix considers the inter-patch associations, making semantically-similar patches have more similar probability distributions and semantically-dissimilar patches have more discriminable probability distributions.

The final targets for prediction is a weighted sum of the eased probability distribution matrix and the refined probability distribution matrix :


where is a coefficient balancing the low-level semantics from tokenizer and high-level semantics from inter-patch similarity. Slightly different from Mc-Beit Li et al. (2022), we adopt a gradually decreasing rather than a constant one. Specifically, at the beginning of pre-training, we set

due to inadequate training of transformer. After 30 epochs, transformer can well encode both the local geometry of local patches and dependencies between patches. Then the coefficient

begins to decrease in a cosine schedule to boost pre-training. We set the lower bound of as 0.8. Our experiments also show this gradually decreasing paradigm plays an essential role in the pre-training. When we set to a constant at the beginning, the transformer is easy to collapse due to the noisy semantics from initial transformer.


In this section, we first introduce the setup of pre-training (section Pre-training setup). Then we conduct experiments on downstream tasks including point cloud classification (section Point cloud classification), few-shot learning (section Few-shot learning) and part segmentation (section Point cloud part segmentation). We also present various ablations to analyze the effectiveness of our method (section Ablation study) and provide visualizations of our learned representations (section Visualization). Furthermore, we also compare the computational overhead of our Point-McBert with Point-Bert (section Computational overhead). Our code will be publicly available.

Pre-training setup

Dataset: For all experiments, we use ShapeNet Chang et al. (2015) as the pre-training dataset, which originally cotains over 50,000 CAD models from 55 common object categories. We randomly sample 1024 points from each CAD model and divide them into 64 local patches. Each patch contains 32 points. For the Bert-style pre-taining, we randomly mask patches for MPM in a block masking manner Yu et al. (2021b).

Architecture: The mini-PointNet Qi et al. (2017a) in our framework is a MLP followed by a global pooling layer. Our backbone network is a standard transformer Vaswani et al. (2017) with 12 layers. We set the number of heads of each layer to 6 and dimension of features to 384. Our tokenizer is the encoder of dVAE, whose vocabulary size is 8192.

Hyperparameters: We set the temperature coefficient to 0.005 and set to 1.0 as initial. After pre-training 30 epochs, we gradually decrease in a cosine manner, whose lower bound is 0.8. We follow Point-Bert Yu et al. (2021b) to train our dVAE. We use an AdamW Loshchilov and Hutter (2017) optimizer with 0.0005 learning rate and 0.05 weight decay during pre-training. Our transformer is pre-trained for 300 epochs with a batch size of 128.

Method Acc (%)
PointNet Qi et al. (2017a) 89.2
PointNet++ Qi et al. (2017b) 90.5
PointCNN Li et al. (2018) 92.2
DGCNN Wang et al. (2019) 92.2
DensePoint Liu et al. (2019a) 92.8
KPConv Thomas et al. (2019) 92.9
PTC Guo et al. (2021) 93.2
PointTransformer Zhao et al. (2021) 93.7
GLR Rao et al. (2020) 92.9
SRTL Huang et al. (2021) 93.1
Baseline 91.4
OcCo Wang et al. (2021) 92.2
Point-Bert Yu et al. (2021b) 93.8
Ours 94.1
Table 1: Comparisons of the classification on ModelNet40 Wu et al. (2015).
Figure 3: Convergence curve. We compare the performance of transformers training from scratch (green) and pre-training with our method (red) in terms of validation accuracy (%) on ModelNet40. The red dotted line denotes the best performance of our method while the green dotted line denotes the best performance of the baseline. Our method reaches baseline’s best performance in only 19 epochs.

Point cloud classification

We perform our classification experiments on both the synthetic dataset ModelNet40 Wu et al. (2015) and real-world dataset ScanObjectNN Uy et al. (2019).

Experiment on synthetic dataset

Dataset: ModelNet40 Wu et al. (2015) is the most popular 3D dataset for point cloud classification, which contains 12311 CAD models from 40 object categories. We randomly sample 8k points from each CAD model for training and test. We follow previous setting Qi et al. (2017a); Wu et al. (2021) to split the dataset into 9843/2468 for training and test.

Fune-tuning: We follow the setting in Yu et al. (2021b) and employ a two-layer MLP with a dropout of 0.5 as the classification head. Specifically, we fine-tune the pretrained backbone and the classification head using a AdamW Loshchilov and Hutter (2017) optimizer with a weight decay 0.05 and learning rate of 0.0005 under a cosine schedule. We set the batch size to 32.

Competitors: We compare our method with supervised methods, i.e., PointNet Qi et al. (2017a), PointNet++ Qi et al. (2017b), PointCNN Li et al. (2018), DGCNN Wang et al. (2019), DensePoint Liu et al. (2019a), KPConv Thomas et al. (2019), PCT Guo et al. (2021), PointTransformer Zhao et al. (2021) and recently published self-supervised pre-training methods Yu et al. (2021b); Huang et al. (2021); Wang et al. (2021); Rao et al. (2020). Moreover, to illustrate the effectiveness of pre-training, we also set a baseline model, which use the same backbone model as ours but trained from scratch.


We adopt the classification accuracy as the evaluation metric and the experiment results are listed in Table

1. As observed from the results, our method obtain 94.1% accuracy on ModelNet40, outperforming the competitive methods and achieving new state-of-the-art performance. We also provide a convergence curve in Figure 3. We can see our method improves the baseline by 2.7% and converges to a high performance faster. Our method also outperforms Point-Bert by 0.3% with negligible extra computational overhead, which indicates the effectiveness of our multi-choices strategy.

PointNet 73.3 79.2 68.0
SpiderCNN 77.1 79.5 73.7
PointNet++ 82.3 84.3 77.9
PointCNN 86.1 85.5 78.5
DGCNN 82.8 86.2 78.1
BGA-DGCNN - - 79.7
BGA-PointNet++ - - 80.2
Baseline 79.86 80.55 77.24
OcCo 84.85 85.54 78.79
Point-Bert 87.43 88.12 83.07
Ours 88.98 90.02 84.28
Table 2: Comparisons of the classification on ScanObjectNN Uy et al. (2019). We report the accuracy (%) of three different settings (OBJ-BG, OBJ-ONLY, PB-T50-RS).

Experiment on real-world dataset

Dataset: To test our method’s generalization to real scenes, we also conduct an experiment on a real-world dataset. ScanObjectNN Uy et al. (2019) is a dataset modified from scene mesh datasets SceneNN Hua et al. (2016) and ScanNet Dai et al. (2017). It contains 2902 point clouds from 15 categories. We follow previous works Yu et al. (2021b); Uy et al. (2019) to carry out experiments on three variants: OBJ-BG, OBJ-ONLY and PB-T50-RS, which denote the version with background, the version without background and the version with random perturbations, respectively. More details can be found in Uy et al. (2019).

Fune-tuning: We implement the same setting as on the synthetic dataset.

Competitors: We compare our method with supervised methods, i.e., PointNet Qi et al. (2017a), SpiderCNN Xu et al. (2018), PointNet++ Qi et al. (2017b), PointCNN Li et al. (2018), DGCNN Wang et al. (2019), models equipped with background-aware Uy et al. (2019) (BGA) module and some state-of-the-art self-supervised pre-training methods Yu et al. (2021b); Wang et al. (2021). In this experiment, we also set a baseline as we do on ModelNet40.

Results: The classification accuracy on ScanObjectNN is shown in Table 2. As observed from results, all the methods perform worse on real-world dataset than on synthetic dataset ModelNet40, which is caused by less data for fine-tuning and the interference of noise, background, occlusion, etc. However, our method still achieves best performances on all three variants. It’s worth noting that our method significantly improves the baseline by 9.12%, 9.47%, and 7.04% on the three variants, which strongly confirms the generalization of our method. The above results also indicate that pre-training can transfer useful knowledge to downstream task and plays important roles in downstream task especially when downstream task is challenging.

5-way 10-way
10-shot 20-shot 10-shot 20-shot
3D GAN 55.8±3.4 65.8±3.1 40.3±2.1 48.4±1.8
FoldingNet 33.4±4.1 35.8±5.8 18.6±1.8 15.4±2.2
L-GAN 41.6±5.3 46.2±6.2 32.9±2.9 25.5±3.2
3D-Caps 42.3±5.5 53.0±5.9 38.0±4.5 27.2±4.7
PointNet 52.0±3.8 57.8±4.9 46.6±4.3 35.2±4.8
PointNet++ 38.5±4.4 42.4±4.5 23.1±2.2 18.8±1.7
PointCNN 65.4±2.8 68.6±2.2 46.6±1.5 50.0±2.3
RSCNN 65.4±8.9 68.6±7.0 46.6±4.8 50.0±7.2
DGCNN 31.6±2.8 40.8±4.6 19.9±2.1 16.9±1.5
Baseline 87.8±5.2 93.3±4.3 84.6±5.5 89.4±6.3
OcCo 94.0±3.6 95.9±2.3 89.4±5.1 92.4±4.6
Point-BERT 94.6±3.1 96.3±2.7 91.0±5.4 92.7±5.1
Ours 97.1±1.8 98.3±1.2 92.4±4.3 94.9±3.7
Table 3: Few-shot classification results on ModelNet40 Wu et al. (2015)

. We list the average accuracy (%) and the standard deviation over 10 independent experiments.

Few-shot learning

Few-shot learning aims to tackle new tasks containing limited labeled training examples using prior knowledge. Here we conduct experiments on ModelNet40 Wu et al. (2015) to evaluate our method. All our experiments follow K-way, m-shot setting Sharma and Kaul (2020). Specifically, we randomly select K classes and sample m+20 samples for each class. The model is trained on K*m samples (support set), and evaluated on K*20 samples (query set). Here, we employ a deeper classification head with three layers and adopt fune-tuning settings identical to former classification experiments. We compare our method with other competitors under “5 way, 10 shot”, “5 way, 20 shot”, “10 way, 10 shot”, “10 way, 20 shot” settings and report the mean and standard deviation over 10 runs. Our competitors can be roughly divided into three categories: (1) unsupervised methods; (2) supervised methods; (3) self-supervised pre-training methods. For the unsupervised methods Yang et al. (2018); Wu et al. (2016); Achlioptas et al. (2018); Zhao et al. (2019), we train a linear SVM based on their extracted unsupervised representations. For supervised methods Qi et al. (2017a, b); Wang et al. (2019); Li et al. (2018); Liu et al. (2019b) including our baseline, we train the model from scratch. For self-supervised pre-training methods Yu et al. (2021b); Wang et al. (2021) including our Point-McBert, we fine-tune the model with pre-trained weight as initializations.

The results are shown in Table 3. We can see that when the labeled training data is insufficient, our method can still perform well. Our method has the highest average accuracy under four different settings. It’s noticeable that compared with other methods, our method commonly has a smaller standard deviation, which indicates our method is more stable. Our method also significantly improves the baseline by 9.3%, 5.2%, 7.8%, and 5.5%, which demonstrates the strong generalization ability of our method.

PointNet PointNet++ DGCNN Baseline OcCo Point-Bert Ours
aero 83.4 82.4 84 82.9 83.3 84.3 84.8
bag 78.7 79 83.4 85.4 85.2 84.8 85.1
cap 82.5 87.7 86.7 87.7 88.3 88 88.4
car 74.9 77.3 77.8 78.8 79.9 79.8 80.6
chair 89.6 90.8 90.6 90.5 90.7 91 91.5
earphone 73 71.8 74.7 80.8 74.1 81.7 80.5
guitar 91.5 91 91.2 91.1 91.9 91.6 91.8
knife 85.9 85.9 87.5 87.7 87.6 87.9 87.7
lamp 80.8 83.7 82.8 85.3 84.7 85.2 85.5
laptop 95.3 95.3 95.7 95.6 95.4 95.6 96.3
motor 65.2 71.6 66.3 73.9 75.5 75.6 76.3
mug 93 94.1 94.9 94.9 94.4 94.7 94.6
pistol 81.2 81.3 81.1 83.5 84.1 84.3 83.7
rocket 57.9 58.7 63.5 61.2 63.1 63.4 62.7
skateboard 72.8 76.4 74.5 74.9 75.7 76.3 78.4
table 80.6 82.6 82.6 80.6 80.8 81.5 82.2
80.4 81.9 82.3 83.4 83.4 84.1 84.4
83.7 85.1 85.2 85.1 85.1 85.6 86.1
Table 4: Part segmentation results on the ShapeNetPart Yi et al. (2016). We report the mean across all part categories (%) and the mean across all instances (%), as well as the (%) for each categories.
Figure 4: Qualitative results for part segmentation. We visualize the part segmentation results across all 16 object categories.

Point cloud part segmentation

Point cloud part segmentation is a challenging task aimed at predicting point-wise label for point cloud. Here, we evaluate our method on the widely used ShapeNetPart Yi et al. (2016) dataset.

Dataset: ShapeNetPart contains 16881 CAD models from 16 object categories, annotated with 50 parts. We follow the setting in Qi et al. (2017a) to randomly sample 2048 points from each CAD model for training and test. We also double the number of local patches to 128 in the pre-training for part segmentation.

Fune-tuning: We follow Point-Bert Yu et al. (2021b) to adopt a upsampling-based segmentation head for fine-tuning. We set the batch size to 16 and other settings are the same as for the classification tasks.

Competitors: We compare our method with some widely used methods Qi et al. (2017a, b); Wang et al. (2019) and recently published self-supervised pre-training methods Yu et al. (2021b); Wang et al. (2021). We also adopt a standard transformer trained from scratch as our baseline.

Results: We evaluate all the methods on two types of , i.e. and . The former denotes the mean across all part categories and the later denotes the mean across all instances. As shown in Table 4, our method achieves the best performance on both metrics. We also list the IoU for each category. As observed from these results, our method outperform the other methods on most categories. Our method also boosts baseline’s performance while OcCo fails to do it. Moreover, we visualize the segmentation ground truths and our predictions in Figure 4. Although the cases are challenging, our predictions are quite close to the ground truth.

Ablation study

In this section, we conduct extensive experiments on ModelNet40 Wu et al. (2015) to study the effect of hyperparameters.

The temperature coefficient

The hyperparameter controls the smoothness of supervision signals. When is small, we will get a sharp probability distribution of token ids. And conversely, when is large, the probability distribution tends to be a uniform distribution. To study the effect of , we conduct an ablation study on and the result is shown on Table 5. We also add a single-choice version i.e., the same as in Point-Bert as competitor, which equals to . Observed from the result, we find our multi-choice strategy performs best when temperature coefficient is set to 0.005 empirically and improves previous Bert-style pre-training.

0.005 0.05 0.5 5
Acc (%) 93.8 94.1 93.6 93.5 93.6
Table 5: Ablation study on the temperature coefficient . The ablation is conducted on point cloud classification downstream task on ModelNet40 Wu et al. (2015).

The weight coefficient

The hyperparameter balances the low-level semantics from tokenizer and high-level semantics from transformer encoder. We conduct an ablation study on and the result is shown on Table 6. In this ablation study, we not only compare our Point-Mcbert under different settings, but also compare with a baseline, i.e., Point-Bert and a version without warm up strategy for . In the version without warm up, we set at the beginning of the pre-training rather than progressively decreasing after pre-training several epochs. Observed from the result, the version without warm up collapses during the pre-training, because the transformer is not well-trained enough to provides accurate semantics. Moreover, pre-training seems to perform better as the weight coefficient goes larger. But when , the supervision signals don’t incorporate high-level semantics, making the pre-training has the same performance as the baseline. As can be seen from the above, both the warm up strategy and the high-level semantics generated from transformer play important roles.

0 0.2 0.4 0.6 0.8 1.0
Acc (%) 93.5 93.4 93.6 93.6 94.1 93.8
w/o warm up Point-Bert
Acc (%) collapse 93.8
Table 6: Ablation study on the weight coefficient . The ablation is conducted on point cloud classification downstream task on ModelNet40 Wu et al. (2015).

Masking strategy and masking ratio

Bert-style pre-training follows a mask-and-then-predict paradigm, where masking strategy and masking ratio play important roles and determine the difficulty of the prediction. Here, we test two kinds of masking strategy: random masking and block-wise masking. The former strategy randomly selects a proportion of patches and then masks them. The latter strategy selects a proportion of adjacent patches and then masks them. We test the above two masking strategies under different masking ratios. The result is shown in Table 7. It’s observed that using block-wise masking under masking ratio performs best. Block-wise masking commonly has better performance than the random masking, which is also proved both empirically and experimentally in many other similar works Yu et al. (2021b); Wang et al. (2022).

Masking strategy Masking Ratio (%) Acc (%)
Block 1525 93.8
Block 2545 94.1
Block 5575 93.7
Random 1525 93.6
Random 2545 93.5
Random 5575 93.9
Table 7: Ablation study on the masking strategy and masking ratio. The ablation is conducted on point cloud classification downstream task on ModelNet40 Wu et al. (2015).


To further understand the effectiveness of our method, we visualize the learned features via t-SNE Van der Maaten and Hinton (2008). Figure 5(a) shows our learned features before fune-tuning. As observed from it, features are well separated even trained without annotations, which are suitable for model initialization. Figure 5(b) and (c) provide visualization of features fine-tuned on ModelNet40 Wu et al. (2015) and ScanObjectNN Uy et al. (2019). As can be seen, features form multiply clusters that are far away from each other, indicating the effectiveness of our method.

Figure 5: Visualization of feature distributions. We utilize t-SNE Van der Maaten and Hinton (2008) to visualize the features learned by our Point-McBert. Features from different categories are visualized in different colors. (a) Features after pre-training; (b) Features fine-tuned on ModelNet40; (c) Features fine-tuned on ScanObjectNN.

Computational overhead

To demonstrate that our method improves the performance while introducing almost no extra computational overhead, we compare to the computational overhead of Point-Bert Yu et al. (2021b) on the same device. The pre-training is implemented on an Intel Xeon Platinum 8260 CPU and two RTX3090 GPU. We select the time overhead of each pre-training epoch as our evaluation metric. As shown in Table 8, our method only incurs 1% extra time consuming, which is almost negligible.

Point-Bert Ours
Running time (s/epoch) 143.48 144.94
Table 8: Comparisons of the time overhead of pre-training.


In this paper, we propose Point-McBert, a Bert-style pre-training method for point cloud pre-training, which tackle the problem of improper tokenizer in previous work. We release the previous strict single-choice constraint on patches and utilize the probability distribution vector of token ids as supervision signals to avoid semantically-similar patches corresponding to different token ids. In addition, we use the high-level semantics generated by transformer to refine the probability distribution vector to further avoid semantically-dissimilar patches corresponding to the same token ids. Extensive experiments on different datasets and downstream tasks are conducted to evaluate the performance of our Point-McBert. The results show that our Point-McBert not only improves the performance of previous work on all downstream tasks with almost no extra computational overhead, but also achieve new state-of-the-art on point cloud classification and point cloud few-shot learning. The experiments also reveal that our pre-training method successfully transfer knowledge learned from unlabeled data to downstream tasks, which has great potential in point cloud learning.