Point cloud is an intuitive, flexible and memory-efficient 3D data representation and has become indispensable in 3D vision. Learning powerful point cloud representation is very crucial for facilitating machines to understand the 3D world, which is beneficial for promoting the development of many important real-world applications, such as autonomous driving ref28, augmented reality ref30 and robotics ref29
. With the rapid development of deep learning in these yearsref43; ref44, supervised 3D point cloud analysis methods have made great progress ref19; ref20; ref21; a35. However, both exponentially increasing demand for data and expensive 3D data annotation hinder further performance improvement of supervised methods. On the contrary, due to the widespread popularity of 3D sensors (Lidar, ToF camera, RGB-D sensor or camera stereo-pair), a large number of unlabeled point cloud data are available for self-supervised point cloud representation learning.
Unsupervised or self-supervised learning methods have shown their effectiveness in different fieldsref22; ref23; ref24; ref26; ref27. Recent work method3; ref23; method2; ref26; method11
has achieved good performance by combining point clouds with self-supervised learning techniques, such as generative adversarial networks (GAN)ref23, variational autoencoders (VAE) ref22
, and Gaussian mixture models (GMM)ref24
. These methods usually rely on tasks such as distribution estimation or reconstruction to provide supervisory signals, and can learn good local detail features, but it is difficult to capture higher-level semantic features. To learn higher-level semantic features, some methods learn point cloud representations, such as orientation estimation, by constructing a series of transformation prediction tasksa10; a11; a12
. Inspired by unsupervised learning of 2D imagesref12; ref13; ref17, point cloud representation is learned by constructing a series of contrast views a14; a15; a16; a17 and combining the most advanced comparative learning methods. However, these methods rely on network structures with specific inductive bias to achieve good performance, such as PointNet++, DGCNN, and so on. In addition, previous methods have never studied the performance of standard transformers in point cloud analysis tasks.
Recently, Transformer has achieved impressive results in language and image tasks through extensive unlabeled data learning and is becoming increasingly popular. Inspired by NLP, Point-BERT devise a mask patch modeling (MPM) task to pre-train point cloud Transformers. To generate meaningful representations for masked patches to guide point cloud Transformers learning, Point-BERT additionally trains a discrete Variational AutoEncoder (dVAE) based on DGCNN as a tokenizer, as shown in Fig.1
(a). As a result, Point-BERT is a two-stage approach, in which the weight of tokenizer is frozen, and its feature extraction capabilities directly affect the learning of point cloud Transformers. Unlike Point-BERT, we extract meaningful representations of masked patches by replacing the frozen tokenizer with momentum encoder, which is dynamically updated, as shown in Fig.1 (b). Therefore, our approach is one-stage, and the meaningful representation of mask patches will become better as the training progresses. In this article, we propose a one-stage BERT point cloud pre-training method named POS-BERT. Inspired by BERT and MoCo, we used MPM task to pre-train on point cloud and chose standard Transformer without specific inductive biases as backbone. Specifically, we first divide the point cloud into a series of patches, then randomly mask out some patches and feed them into an encoder based on standard transformer. Then, we use a dynamically updated momentum encoder as the tokenizer. The Momentum Encoder has the same network structure as the Encoder, but it does not have gradient backward. Its weight is jointly optimized with MPM through momentum update during the pre-training stage. This greatly simplifies the pre-training step. Next, the point cloud patches before masked are fed to the Momentum Encoder. The objective of MPM is to make the Encoder recover output consistent with the Momentum Encoder output at the masked patches position as much as possible. However, recovering the masked patch information independently leads to limited ability of point cloud transformer’s class token to extract high-level semantic information. To address this problem, we perform contrastive learning to maximize the class token consistency between different augmentation (for example, cropping) point cloud pairs. The main contributions are summarized as follows:
We propose a Point Cloud One-Stage BERT pre-training method, and named POS-BERT. We use momentum encoder to provide continuous and dynamic supervision signals for masked patches in mask patch modeling pretext task. The Momentum Encoder is updated dynamically during the pre-training stage and does not require extra pre-training processing.
We introduce a contrastive learning strategy on transformer’s class token between different augmentation point cloud pairs, which can help point cloud transformer’s class token obtain a better high-level semantic representation.
Experiments demonstrate that POS-BERT achieves state-of-the-art performance in linear SVM classification task and downstream tasks, such as classification and segmentation.
2 Related work
Point Cloud Self-Supervised Learning The goal of self-supervised learning is to learn good feature representations from unlabeled raw data so that they can be well adapted to various downstream tasks a1. Currently, self-supervised learning has been extensively studied in point cloud representation learning, and they focus on constructing a pretext task to help the network better learn 3D point cloud representations. A commonly adopted pretext task is to reconstruct the input point cloud from the latent encoding space, which can be implemented through Variational AutoEncoders a2; a3; a4; a5; a6; a13, Generative Adversarial Learning (GANs) a7; a8, Gaussian Mixed Model ref24; a9, etc. However, these methods are computationally expensive, and rely excessively on reconstructing local details, making it difficult to learn high-level semantic features. Hence, some researchers employed Transformation Prediction as a prediction pseudo-task. Sauder et al. a10 proposed to use jigsaw puzzle as a pretext task for 3D point cloud representation learning. Wang et al. a11 destroyed the point cloud and then pretrained the network by a self-supervised manner with the help of point cloud complementation task. Poursaeed et al. a12 used orientation estimation as a pretext task by randomly rotating the point cloud and then allowing the network to predict the rotation. As contrastive learning becomes increasingly popular, Jing and Afham et al. a14; a15, proposed a task training network for finding cross-modality correspondences. Specifically, they obtain the corresponding 2D view by rendering the 3D model, and then extracts 2D view features and 3D point cloud features using 2D convolutional networks and graph convolutional networks. Finally, the instance correspondence between the two modalities is estimated based on these features. Qi et al. a19
calculated the contrastive loss on matched point pairs by rigidly transforming the point clouds with feature vectors for each point of the two point clouds before and after the transformation. Wang et al.a16 designed a multi-resolution contrastive learning training strategy that can train point-by-point and shape feature vectors simultaneously. Inspired by BYOL a18, Huang et al. a17 constructed point cloud pairs that undergo spatio-temporal transformations, and forced the network to learn the consistency between different augmented views. However, all previous studies resort to point cloud domain-specific network architectures to achieve promising performance, which would greatly hinder the development of deep learning towards a generalized model. More importantly, these studies have never investigated self-supervised representation learning using a transformer-based point cloud processing network. Recently, Point-BERT a20
has proposed a modeling approach using standard transformer network combined with mask language modeling for the first time to achieve self-supervised representation learning of point clouds, which is a direct extension of BERTa27 (popular in the field of NLP) on point clouds. However, there is no mature BPE a26 algorithm in the point cloud domain as in NLP, leading to a lack of an effective vocabulary to guide the learning of mask language modeling. For this reason, Point-BERT a20 pre-trained a discrete Variational AutoEncoder (dVAE) a21 as tokenizer through an additional point cloud network DGCNN to construct vocabularies for point cloud patches. This directly brings about two problems: First, the whole method becomes a complex two-stage solution; Second, the weights of the pre-trained tokenizer are frozen and cannot change adaptively with the network training process, and the performance of the fixed tokenizer will directly doom the performance of the pre-trained model. Unlike Point-BERT, we use dynamically updated momentum encoder instead of a frozen tokenizer to extract features from point cloud patches. Additionally, our solution is one-stage, and the Momentum Encoder can be continuously updated as the network training progresses, providing the network with a suitable feature representation of point cloud patches for the current training stage.
Transformer has made great advances in the field of machine translation and natural language processing with its long-range modeling capability brought by the attention mechanism. Inspired by the successful applications of Transformer in NLP field, it has also been introduced into the image fielda29; a30; a34, leading to backbone networks such as ViT a29, SWin a30, Containerref44, etc., which surpassed CNN-based ResNet and showed excellent performance in downstream tasks such as classification a29, segmentation a32, object detection a33. Although there is a trend of grand unification of transformer in the field of NLP and image, the development of transformer in the field of point cloud is highly slow. PCT ref35 and PointTransformer a31 have modified the transformer layer in standard transformer and combined with layer aggregation operation to achieve point cloud classification and segmentation. Unlike these approaches, Point-BERT a20 achieves comparable performance with a standard transformer without introducing a bias structure, but it requires a specific point cloud network DGCNN to provide supervised signals for pre-training. By comparison, our proposed method completely rejects the introduction of other networks and uses only the standard transformer-based network to learn point cloud representations.
Mask Language Modeling Paradigm Mask language modeling was proposed in BERT a22, which revolutionized the pre-training paradigm for natural language. Inspired by BERT, Bao et al. proposed BEiT a23 for pre-training a standard transformer applicable to images. It maps the input image patches into meaningful discrete tokens by dVAE a21, then randomly masks some of the image patches, and feeds the masked image patches and the remaining images into the standard transformer to reconstruct the tokens of these masked image patches. Following BEiT, Zhou et al. a35 perform masked prediction with an online tokenizer. Unlike BEiT, He et al. a24 trained the network by directly reconstructing the original image patches. Inspired by BEiT, Yu et al. a20 proposed Point-BERT for point cloud pre-training and demonstrated that the MLM paradigm is feasible for point cloud pre-training. We inherit the idea of Yu et al. and also adopt the MLM approach for point cloud pre-training.
Contrastive learning Contrastive learning is a branch of self-supervised learning, which learns knowledge from the data itself without the demand of data annotation. The main idea of contrastive learning is to maximize the consistency between positive sample pairs and the differences between negative sample pairs. Representative methods of contrastive learning include MoCo series ref11; ref12; ref13 and SimCLR ref14. Recently, BYOL ref17 and Barlow Twins ref18 pointed out that only using positive samples can still obtain powerful features. In this paper, we introduce the idea of contrastive learning to help point cloud Transformer learn the high-level semantic representation.
We propose a Point Cloud One-Stage BERT pre-training approach POS-BERT, which is simple and efficient. Fig.2 illustrates the overall framework of POS-BERT. Firstly, the global point cloud set and the local point cloud set are obtained by cropping the raw point clouds with different cropping ratios. Then, we use the PGE module to divide both global and local point clouds into smaller patches with fixed number of points and embed the patches into high-dimensional representation (patch token) though standard Transformer-based encoders. Because local point clouds do not represent complete objects very well, only global point clouds are input into the Momentum Encoder, which is dynamically updated to encode meaningful representations to provide learning objectives for the Encoder. The Encoder is trained using the mask patch modeling task to match the Momentum Encoder outputs. Some patches of the global point clouds are randomly masked out and position information is added to the corresponding masked patches, and then they are input into the Encoder together with the local point cloud set. Finally, we calculate the mask patch modeling loss between the Encoder outputs’ patch tokens and the Momentum Encoder outputs’ patch token, and the global feature loss loss
between the Encoder outputs’ class token and the Momentum Encoder outputs’ class token. Overall, our framework consists of four key components: Encoder, Momentum Encoder, Mask Patch Modeling and Loss Function and they will be introduced in detail the following part of this section. We will start with Section3.1 on how to transform point into patch embedding with the Encoder. Next, mask patch modeling is described in section 3.2. Then we introduce the dynamic tokenizer implemented by the Momentum Encoder for providing supervision for the MPM tasks in section3.3. Finally, we describe our loss function in section 3.4.
3.1 Point2Patch Embedding and Encoder Architecture
The simplest way to extract point cloud features is to input each point into the transformer as one token. Because the complexity of transformer is , where is the length of the input token, extracting feature of each point directly will result in memory explosion. Fig.3 describes the overall pipeline of the Transformer-based feature extraction in this paper. Following Point-BERT, we divide a given global/local point cloud into local patches with a fixed number of points. In order to minimize overlap between patches , we first calculate the number of patches , then use farthest point sampling (FPS) algorithm to sample the center point of each patch. The -nearest neighbor algorithm is used to obtain neighbors for each center point, and the center point and corresponding neighbor points form a local patch . Next, Using the PointNet and maxpooling operations to map point coordinates of each patch to a high-dimensional embedding as patch tokens. Finally, these patch tokens are fed into the standard transformer with a learnable class token.
We used a standard transformer as the Encoder backbone, which consists of a series of stacked multi-head self-attention layers and fully connected feed-forward network. As mentioned earlier, class tokens and a series of patch tokens are concatenated along the patch dimension to get the transformer’s input . After passes through the h-layer transformer block, we get the feature of each patch with global receptive field. Finally, we map the features of each patch to the loss space, where the projector is composed of multiple layers of . In the inference stage and downstream tasks, we do not need the projector. Decoupling the feature representation and loss function can make the learned patch’s features more general.
3.2 Mask Patch Modeling
Inspired by Point-Bert, we also use a mask patch modeling task to pretrain the point cloud Transformer. As described in Section 3.1, we have obtained the transformer’s input . Masked patch tokens is obtained by randomly masking the tokens of some patches in , except . Next, we randomly mask/replace [20%, 40%] patch tokens with a learnable mask token , where masked tokens are defined as . Then, the center point position embedding corresponding to patch tokens is added to , represents the coordinate of the patch center point. Finally, the transformer’s input tokens obtained by high-dimensional embedding after masking can be expressed as , and the lost information of masked tokens is recovered from through Encoder.
3.3 Dynamic Tokenizer by Momentum Encoder
Momentum Encoder is often used in contrastive learning to provide a global semantic supervision for target network. Inspired by MoCo, we propose a dynamically updated tokenizer, which is implemented by momentum Encoder. Grill’s preliminary experiments show that even using the output of random initialization network as supervision, target network can also learn a better output representation than random initialization network ref17. This result provides a strong support for the replacement of dVAE by the dynamically updated momentum encoder during early training. Therefore, we use a random network to initialize the Momentum Encoder. Although randomly initialized networks can help Encoder get better representation in the early stages of training, if the performance of tokenizer is not continuously improved, the ability of Encoder will stop as tokenizer stops. Accordingly, we need a tokenizer that can dynamically update and improve its quality while at the same time its output does not change rapidly before and after each update. The momentum encoder in contrastive learning solves these two concerns well, and its update formula is as follows:
where, represents the weight of Momentum Encoder, represents the weight of Encoder. is a momentum coefficient, which follows a cosine schedule from 0.996 to 1 during training.
Momentum Encoder enhances itself by constantly introducing new knowledge learned from Encoder, so Momentum Encoder also has the ability to recover lost information. Moreover, it dynamically integrates the Encoder weights of multiple training stages, and has better feature extraction ability than the Encoder. Therefore, our final pre-training model weights come from Momentum Encoder.
3.4 Loss Function
We hope that the pre-training model can not only recover the lost information, but also learn the high-level semantic representation. Therefore, our loss function consists of two parts: mask patch modeling loss and global feature contrastive loss .
For mask patch model loss , we encourage the Encoder to recover the information lost by masked patch under the supervision of meaningful representations, which is generated by Momentum Encoder. The formula of mask patch model loss is as follows:
where, represents the output of the Momentum Encoder corresponding to the -th patch, represents the output of the Encoder corresponding to the -th patch.
Although the idea of contrastive learning was also used in Point-BERT to achieve high-level semantic features, the results were not ideal, which can be observed from Tab .1. In addition, it needs to maintain a memory bank to store a large number of negative samples, which takes up a large amount of storage space. In contrast, we utilize different cropping rate to obtain different augmentation state point clouds: global point clouds and local point clouds with the following formula:
where represents cropping an area at a fixed ratio, represented by the second parameter. generates a random value between the maximum and the minimum values. Here, and are the minimum and maximum cropping ratio for generating the global point cloud set, respectively. Similarly, and are the minimum and maximum cropping ratios for generating of the local point cloud set, respectively. and are the number of point clouds in and , respectively. During training phase, the Encoder encodes masked global point clouds and local point clouds, while the Momentum Encoder only encodes global point clouds.
Finally, we combine all the above-mentioned loss function as our final self-supervised objectives:
where, the hyperparameterscontrol the balance between loss functions, for all the experiments in this paper, we set , .
4 Implementation and Dataset
Pre-training We use Adamw optimizer ref39
to train the network with the initial learning rate 0.0001. The learning rate increases linearly for the first 10 epochs and then decays with a cosine schedule. We train the pre-training model with the batch size 64 and 200 epochs, and the whole pre-training is implemented on NVIDIA A100. For the exponential moving average weightof the target network, the starting value is set to 0.996 and then gradually increases to 1. The dimension of the final features used to calculate the loss is set to 512. When cropping the global point cloud, the crop ratios , are set to 0.7 and 1.0, respectively, and the number of crops is 2. When cropping local point clouds, the crop ratios , are set to 0.2 and 0.5, respectively, and the number of crops is 8. Additionally, we use the FPS sample half of the original point cloud as different resolution point clouds and add them to local point cloud set. The number of different resolution point clouds is 2.
We use a fully connected MLP network that combines ReLU, BN, and Dropout operations as the classification head. The SGD is used as the optimizer to fine tune the classification network with cosine schedule. We set the batch size to 32.
Segmentation Different from the classification task, the segmentation task needs to predict pre-point labels. We first select multiple stage features of network, including the initial input feature of standard transformer and the output features of layer 3 and layer 7. We cascade the features of these different layers, and then use the point feature propagation in PointNet++ to propagate the features of the 256 down sampled points to the 2048 raw input points. Finally, MLP is used to map the features to the segmentation label space. Our batch size is 16 with a learning rate initialized to 0.0002 and decayed via the cosine schedule. We use the Adamw optimizer to train the segmentation network.
In the experiments of this paper, four datasets (ShapeNet ref33, ModelNet40 ref31, SacnObjectNN a25, and ShapeNetPart ref34) are used.
ShapeNet contains 57448 CAD models, with a total of 55 categories. For the acquisition of point cloud data, we follow the processing method of Yang et al., and sample 2048 points from each CAD model surface. We use ShapeNet dataset as pre-training dataset. In the pre-training stage, we use the farthest point sampling algorithm to select 64 group center points, and divide 2048 points into 64 groups, where each group contains 32 points.
ModelNet40 contains 12,331 handmade CAD models of from 40 categories and is widely used for point cloud classification tasks. We follow Yu et al. to sample 8192 points from each CAD model surface. According to the official split, 9,843 are used for training and 2,468 for testing. Following the work of Yu et al. a20, we generated a Fewshot-ModelNet40 dataset based on ModelNet40. "M-way N-shot" represents the data under different settings, where M-way represents the number of categories selected for training, N-shot represents the number of samples for each category, and the number of samples used for testing is 20. M is selected from 5 and 10, and N is selected from 10 and 20.
is a 3D point cloud classification dataset derived from real-world scanned data. It contains 2902 point clouds from 15 categories. Due to the noise of occlusion, rotation and background, it is more difficult to classify. Following Yu et al.a20, we selected three variant datasets to conduct experiments, including OBJ-BG, OBJ-ONLY, and PB-T50-RS.
ShapeNetPart contains 16811 objects from 16 categories. Each object consists of 2 to 6 parts with total of 50 distinct parts among all categories. Following Yu et al. a20, we randomly select 2048 points as input.
5.1 Linear SVM Classification
Linear SVM classification task has become a classic task to evaluate self-supervised point cloud representation learning. This experiment was designed to directly verify that our POS-BERT has learned better representation. To make a fair comparison with previous studies, we followed the common settings used in previous work a14; a15; a16; a19
, pre-trained the model on ShapeNet and tested it on the ModelNet40. We used our pre-training model to extract the features of each point cloud, then trained a simple linear Support Vector Machine (SVM) on the training set of ModelNet40, and finally tested the SVM on the ModelNet40 test set. We compared a series of competitive methods, including handcrafted descriptor methods, generation-based method, contrastive learning method, and the method based on mask patch modeling. The results of all methods are summarized in Tab.1. The results of the comparison methods we reported adopt the best results in the original papers. As shown in Tab.1, our method outperforms all other methods by a large margin, including the latest method CrossPoint based on contrastive learning and ParAE based on generation model. More importantly, it can surpass Point-BERT, which is also based on MPM paradigm, by 3.5%. This result fully shows that our Momentum Encoder can provide more meaningful supervision representation for masked patches. Finally, it is worth mentioning that our linear classification results exceed some supervised point cloud networks, such as PointNet (89.7%) and PointNet++ (91.9%). For a more intuitive understanding of the performance of our model, we use t-SNE to map the self-supervised learn features to a 2D space, as shown in Fig.4. It can be observed that different categories are separated from each other. These experimental results demonstrate that our method can learn a better representation.
5.2 Downstream Tasks
3D Object Classification on Synthetic Data To test whether POS-BERT can help boost downstream tasks. We first performed fine-tuning experiments on point cloud classification tasks using a pre-training model. Here, From scratch stands for training the model on ModelNet40 from randomly initialized network and Pretrain stands for pre-training the model on ShapeNet and then fine-tune the network on ModelNet40. We fine-tuned the classification network weights using different initialization methods on ModelNet40, and the final classification results were summarized in Tab.2. Tab.2 shows that the original transformer’s accuracy in point cloud classification task is just 91.4 percent. The transformer’s classification accuracy was greatly increased to 93.56 percent using our pre-training weights to initialize the network. To achieve a fair comparison with Point-BERT, we also use voting strategy during the test, and the voting results are annotated with *. By comparison, we can see that our method outperforms OcCo and Point-BERT without voting by 1.4% and 0.4%, respectively. When using the voting strategy, even if the accuracy is already high, our method is slightly better than Point-BERT.
|From scratch||PointNet ref19||point|
To demonstrate that our pre-training model can learn quickly from few-shot samples, we conduct experiment on the Few-shot ModelNet40 dataset. We experimented with four different settings, including, "5-way 10-shot", "5-way 20-shot", "10-way 10-shot" and "10-way 20-shot", way represents the number of categories and shot represents the number of samples per category. During the test, 20 samples not in the training set were selected for evaluation. We conducted 10 independent experiments under each different setting, and reported the mean and variance of 10 experiments. We compared with the current SOTA methods OcCo and Point-BERT, and the results are summarized in Tab.3. Our approach produces the best results on the Few-shot Classification task. Compared with baseline, the mean was increased by 8.6%, 3.7%, 8%, and 5.5%, respectively. The variance is almost halved. Compared with point-Bert, the mean increased by 1.8%, 0.8%, 1.6% and 2.2% respectively, and the variance was smaller. This completely demonstrates that POS-BERT has learned a universal representation suitable for quick knowledge transfer with limited data.
We report the average accuracy (%) as well as the standard deviation over 10 independent experiments.
3D Object Classification on Real-world Data In this experiment, we aim to explore whether the knowledge POS-BERT learns from ShapNet can be transferred to real-world data. We conduct experiments on three variants of ScanObjectNN a25 dataset, including OBJ-BG, OBJ-ONLY, and PB-T50-RS. We compare to several methods, including supervised methods using specific point cloud networks: PointNet, BGA-PN++, SimpleView, et al., as well as pre-training methods: OcCo, Point-BERT. The experimental results are summarized in Tab.4. It can be found from the table that our method obtains the best results. With OBG-BG and OBJ-ONLY, we have surpassed Point-BERT by 3.45% and 2.76%, respectively. We also outperform Point-BERT with the PB-T50-RS settings. The results of the experiments suggest that the knowledge learned by POS-BERT can easily transfer into real-world data.
Part Segmentation In this section, we explore how the pre-training model performs in the pre-point classification. We experimented on ShapeNetPart, a benchmark dataset commonly used in point cloud segmentation tasks. Compared with the classification task, the segmentation task needs to obtain the label of each point intensively. We compare it with the commonly used point cloud analysis networks and the most advanced self-supervised methods. The mean Intersection Over Union (mIOU) metric of various methods is reported in Tab.5. From the table, our method is significantly better than the most advanced method Point-BERT on . From a category perspective, we have exceeded other methods in most categories. These results show that our methods can also learn to distinguish details very well.
5.3 Ablation study
To demonstrate the effectiveness of our key modules, we conducted ablation study on the ModelNet40 Linear SVM classification task. We have designed four variants. The first variant uses a randomly initialized Transformer network to extract features directly without any pre-training, and then classifies them using SVM, which is defined as POS-BERT-Var1. The second variant, defined as POS-BERT-Var2, uses only masking patch modeling’s pretext task for pre-training. The third variant uses the randomly initialized momentum encoder as the tokenizer to pre-training, which is defined as POS-BERT-Var3. The fourth variant, which uses only contrastive loss to train the point cloud transformer, is defined as POS-BERT-Var4. The results are summarized in Tab.6. From the table we can see that a fixed Momentum Encoder does not help the network train well. Pre-training with masking patch modeling alone is difficult to obtain high-level semantic information. The best results are obtained when masking patch modeling and contrastive learning work together.
|Model Name||MPM||GFC||Momentum Encoder||Acc(%).|
In this paper, we propose a one-stage point cloud pre-training method POS-BERT, which is simple, flexible and efficient. It uses momentum encoder as tokenizer to provide supervision for mask patch model pretext tasks, and joint training of momentum encoder and MPM tasks greatly simplifies the training steps and saves training costs. Experiments show that our method has the best ability to extract high-level semantic information in the Linear SVM classification task, and it improves significantly compared with Point-BERT. At the same time, many downstream tasks, including 3D object classification, few-shot classification, part segmentation, have achieved state-of-the-art performance.