You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module

by   Chenfeng Xu, et al.
berkeley college

3D point-cloud-based perception is a challenging but crucial computer vision task. A point-cloud consists of a sparse, unstructured, and unordered set of points. To understand a point-cloud, previous point-based methods, such as PointNet++, extract visual features through hierarchically aggregation of local features. However, such methods have several critical limitations: 1) Such methods require several sampling and grouping operations, which slow down the inference speed. 2) Such methods spend an equal amount of computation on each points in a point-cloud, though many of points are redundant. 3) Such methods aggregate local features together through downsampling, which leads to information loss and hurts the perception performance. To overcome these challenges, we propose a novel, simple, and elegant deep learning model called YOGO (You Only Group Once). Compared with previous methods, YOGO only needs to sample and group a point-cloud once, so it is very efficient. Instead of operating on points, YOGO operates on a small number of tokens, each of which summarizes the point features in a sub-region. This allows us to avoid computing on the redundant points and thus boosts efficiency.Moreover, YOGO preserves point-wise features by projecting token features to point features although the computation is performed on tokens. This avoids information loss and can improve point-wise perception performance. We conduct thorough experiments to demonstrate that YOGO achieves at least 3.0x speedup over point-based baselines while delivering competitive classification and segmentation performance on the ModelNet, ShapeNetParts and S3DIS datasets.


PointNorm: Normalization is All You Need for Point Cloud Analysis

Point cloud analysis is challenging due to the irregularity of the point...

PSNet: Fast Data Structuring for Hierarchical Deep Learning on Point Cloud

In order to retain more feature information of local areas on a point cl...

APP-Net: Auxiliary-point-based Push and Pull Operations for Efficient Point Cloud Classification

Point-cloud-based 3D classification task involves aggregating features f...

POEM: 1-bit Point-wise Operations based on Expectation-Maximization for Efficient Point Cloud Processing

Real-time point cloud processing is fundamental for lots of computer vis...

Learning Inner-Group Relations on Point Clouds

The prevalence of relation networks in computer vision is in stark contr...

Neural TMDlayer: Modeling Instantaneous flow of features via SDE Generators

We study how stochastic differential equation (SDE) based ideas can insp...

Point is a Vector: A Feature Representation in Point Analysis

The irregularity and disorder of point clouds bring many challenges to p...

I Introduction

Fig. 1: (a) PointNet [24] directly globally pools a point-cloud into a global feature. (b) PointNet++ [25] hierarchically samples amounts of points, groups their nearby points, and aggregates the nearby points into one point via pointNet. (c) YOGO divides the whole point-cloud into several parts, squeezes it into several tokens and models the token-token relations to capture the part structure of the car and the token-points relations to project the tokens into the original points.

Increasing applications, such as autonomous driving, robotics, augmented and virtual reality, require efficient and accurate 3D perception. As LiDAR and other depth sensors become popular, 3D visual information is commonly captured, stored, and represented as point-clouds. A point-cloud is essentially a set of unstructured points. Each point in a point-cloud consists of its 3D-coordinates and, optionally, features like normal, color, and intensity.

Different from images, where pixels are arranged as 2D grids, points in a point-cloud are unstructured. This makes it infeasible to aggregate features by using convolutions, the de facto operator for image-based vision models. In order to handle the unstructured point-cloud, popular methods such as PointNet [24]

process each point through multi-layer perceptrons and use a global pooling to aggregate point-wise features to form a global feature. Then, the global feature is shared with all the point-wise features. This mechanism allows points within a point-cloud to communicate local features with each other. However, such pooling mechanism is too coarse as it fails to aggregate and process local information in a hierarchical way.

Another mainstream approach, PointNet++ [25], extends PointNet [24]

by hierarchically organizing point-cloud. Precisely, they sample center points, group nearby neighbours, and apply PointNet to aggregate local features. After PointNet++, hierarchical feature extraction

[17, 19, 33] has been further developed, but several limitations are not adequately addressed: 1) Sampling and grouping for point-cloud are handcraft operations, and hierarchically operating them is computationally heavy. 2) This feature extraction requires computation on all points, many of which are redundant because adjacent points don’t provide extra information. 3) As local features are aggregated, point-wise features are discarded, which leads to information loss.

Based on the aforementioned motivation, instead of directly max-pooling the whole point-cloud into one single feature (PointNet

[24]) or traversing to aggregating the neighbors for many points (PointNet++ [25]), we propose a better way to aggregate point-cloud features, as shown in Fig 1. We only need to group points into a few sub-regions once, which has less computation cost than previous point-based methods. Then each sub-region is squeezed into a token representation [39], which concisely summarizes the point-wise features within a region and eliminates redundancy. Next, we apply self-attention [34] to capture the relations between each token (region). While the representations are computed on the token-level, we also use a cross-attention module to project computed token features to point-wise features, which preserves full details with minimum computational cost. The series of operations above is what we call relation inference modules (RIM), and we propose YOGO which is composed of a stack of relation inference modules.

Our proposed YOGO can efficiently and effectively handle 3D point-cloud object classification, part segmentation, and scene segmentation. For the object classification task, YOGO receives the raw point sets with x, y, z coordinates and will output the class based on the tokens in the final RIM layer. For the segmentation task, YOGO also leverages the original points and outputs the per-point labels based on the final point-features. We conduct experiments on ShapeNetParts and S3DIS datasets respectively, to demonstrate the efficiency and efficacy of the proposed method. Specifically, outperforms and achieves 3.0x speedup over the classic PointNet++ [25]. We also provide extensive arguments for how the model works to obtain such improvements in the experiment section. Note that we do not expect to beat all the state-of-the-art methods via superior local-feature aggregation operators since they have evolved over several years. Instead, we provide a new, simple, and elegant baseline from the perspective of relational inference. We do expect it can inspire the research community to explore this approach further.

Overall, the key contributions of our work are as follows:

  • We design a new, simple and elegant baseline termed YOGO which efficiently and effectively processes the unordered point-cloud.

  • We provide thorough empirical analysis on the efficacy and efficiency of our method.

  • We illustrate the relations extracted via the proposed YOGO and develop intuitive explanations for its performance.

Ii Related work

Ii-a Point-Cloud Processing

Recent point-cloud processing methods mainly include volumetric-based method, projection-based method, and point-based method. We briefly introduce them as follows.

Volumetric-based method is to rasterize the point-cloud into 3D grids (voxels) [22, 2, 26, 21] thus the convolution operator can be easily applied. However, the number of voxels is usually limited to no more than or due to the constraint of GPU memory [21]

. Thereby many points are grouped into the same voxel, which introduces information loss. Besides, To structure a regular format, many empty voxels are added for padding, which causes redundant computation. Using a permutohedral lattice reduces the kernel to 15 lattices

[28], yet the numbers are still limited. Recently, sparse convolutions [18] are proposed to apply on the non-empty voxels [6, 31, 47, 44], largely improving the efficiency of 3D convolutions and boosting the scales of voxel numbers and the models.

Projection-base method attempts to project the 3D point-cloud into a 2D plane and exploit 2D convolution to extract features [37, 38, 40, 41, 29, 15, 3]. Specifically, the bird-eye-view projection [45, 14] and the spherical projection [38, 40, 41, 23] make great progresses in outdoor point-cloud segmentation (e.g. autonomous driving). However, it is hard for them to capture the point-cloud geometry structure especially when the point-cloud becomes complex, such as indoor scenes [1] and object parsing [5] since the points are collected in an irregular manner.

Point-based method is to directly process the point-cloud. The most classic method PointNet [24] consumes the points by MLP network and squeezes the final high-dimensional features into a single global feature via global pooling, which significantly improves the performance of the point-cloud recognition and semantic segmentation. Furthermore, PointNet++ [25] extends it into a hierarchical form, in each layer the local features are aggregated. Many works further develop advanced local-feature aggregation operators that mimics the convolution with customised method to structure data [17, 10, 19, 20, 35, 16, 13]. However, the unstructured format of the point-cloud dramatically hinders their efficiency since many point-based methods rely on hierarchical handcraft operations like farthest point sampling, K nearest neighbors and ball query. Although they gradually downsample the points in each layer, the sample-group operations are required to be applied in each stage, which makes it inefficient.

Our proposed YOGO is a new pipeline regarding the point-based method, which only needs to sample-group the points once and is robust to the various sample-group operations.

Fig. 2: The framework of YOGO. It consists of several stacked relation inference module (RIM). ”Locally Squeeze” indicates that we locally squeeze the features into several tokens based on the sub-regions, ”SA” means self-attention module and ”CA” means cross-attention module.

Ii-B Self-attention for perception

Self-attention was first introduced by Vaswani et al. [34]

for the natural language processing and recently becomes especially popular in the perception field

[4, 8, 39, 30, 48]. It has great advantages to model the relations among per-elements such as tokens (for sequence data [7, 34]) and pixels (for image data [4]) without complex data rearrangement. This is very useful for point-cloud data because it avoids structuring the data as previous point-based method did. Yet, few works explore self-attention on point-cloud, a main challenge is that the self-attention operation is computationally intensive with the time complexity of and makes the model unscalable. For commonly used point-cloud dataset [27, 5, 1], at least 1024 points are required to recognize the point-cloud, which is a relatively large number for self-attention. Thus, this motivates us to explore how to efficiently take advantage of the self-attention mechanism.

In this paper, different from the previous point-based methods that aim to design a more advanced local-feature aggregation operator, we propose a new baseline model termed YOGO that takes advantage of the pooling operation and the self-attention. In particular, instead of directly applying the self-attention to process the point-cloud, we use pooling operation to aggregate the features (in a sub-region) so that we only preserve important features and can apply the self-attention in an efficient manner to model the relations between the important features. The details will be introduced in the method section.

Iii Method

Iii-a Overview

We propose YOGO, a deep learning model for point-cloud classification and segmentation. YOGO accepts a set of points as input , each point is parameterized by its 3D coordinates and optionally, additional features such as normal, color, and so on. At the output, for classification, YOGO generates a global prediction , and for semantic segmentation, it generates point-wise labels .

YOGO consists of a stack of relation inference modules (RIM), which is an efficient module to process a point-cloud. For a point-cloud as input with points, we first divide it to sub-regions by uniform sampling centers and grouping their neighboring points. For points within each sub-region, we use pooling operations to extract a few tokens representing this region. The tokens are then processed by self-attention to capture the feature interactions within each region. Next, tokens are processed by a cross-attention module to exchange information among regions and aggregate long-range features. After this, we project the token features back to point-wise features, to obtain a stronger point-wise representation.

Fig. 3: Relation Inference Module (RIM). The input point features are first locally squeezed into several tokens, then we can get a stronger token representation via self-attention module. Next, the tokens are projected into the original point-wise features in the cross-attention module. The coefficient matrix shown in the figure is a sample from a token.

Iii-B Sub-region Division

We divide the point-cloud into () regions via the commonly used farthest point sampling (FPS) and

nearest neighbouring search (KNN search) or ball query

[25]. Different from the previous point-based methods [25, 17] that re-sample and re-group the points in each layer, we only need to sample-group once because the grouped indices can be re-used for gathering nearby points in each layer. Also note that the point sampling is fast since and is same in each layer. After this step, the point-cloud is divided into sub-regions where denotes the point in the sub-region.

Iii-C Relation Inference Module (RIM)

The relation inference module (RIM) does not require handcrafted grouping and sampling operations to aggregate point-wise features. Instead, it adopts a simple pooing operation an self-attention. The pooling operation is used to aggregate features within a sub-region [24, 20], while self-attention is used to capture relations between sub-regions. Self-attention has been demonstrated to have a superior ability to capture long-range relations in many vision and language tasks. However, as self-attention’s computational cost is quadratic with number of input elements, it is infeasible to be directly used for point-cloud. In our work, our design is to let self-attention operate on region/tokens, instead of points, this greatly reduces the computational cost while leveraging the advantages of self-attention. The structure of RIM is presented in Fig. 3.

Specifically, given a set of point features corresponding to a point-cloud , we divide the points into sub regions with corresponding features . RIM computes a token to represent points within each region as


where denotes point- in region-. is the max-pooling operation to squeeze points in to one. is a linear function that maps the pooled feature to the output . This is similar to PointNet [24], yet PointNet applies a pooling to all the points, while we focus on different sub-regions.

Next, to model the relations between different regions using self-attention as [34]. We first combine all the tokens

to form a tensor

. Since self-attention is a permutation-invariant operation, we can choose any order to stack to form . On the token, we compute


where is the output of the self-attention module, is the coefficient matrix regarding token-token relation in , and , , , are the parameterized matrix, representing the weights for project, value, key and query [34], respectively.

Next, we project the output tokens back to point-wise features. To do this, we feed and the original point-wise feature to a cross-attention module as


where is the output point-feature of the cross-attention module, which is parametrized by the matrix , the matrix , the matrix , and the matrix . is a MLP mapping the into and is a coefficient matrix between the tokens and a point feature .

Analytically, each token element in encodes the most important and representative components in each corresponding sub-region. The self-attention regarding the can help capture the structure of the point-cloud, e.g., as shown in Fig. 3, the head, the wings, the airplane body, and the tail have strong structure relations between each other. This is useful to learn semantics since we can recognize the airplane through these representative elements. After this, the coefficient matrix is calculated to indicate the relations between the tokens and each point which helps to project tokens with rich semantic information into original point-cloud features.

Iii-D Training objective

Our proposed network YOGO can handle point-cloud classification and segmentation tasks and can be simply trained end-to-end. The tokens and the point feature in the final are used for classification and segmentation tasks, respectively. We simply apply the commonly used cross-entropy loss to train the model for both classification and segmentation task, which are respectively given by


where is the cross-entropy loss for classification, is the number of categories of a point-cloud, is the ground truth and

is the softmax probability prediction to the

class, then


where is for segmentation, is the total categories of each point, is the whole number of the point-cloud, is the ground-truth for point, and is the softmax probability prediction.

Iv Experiment

The experiments are conducted on the ShapeNetParts [5] and S3DIS [1] for segmentation task. We conduct both comparison experiments for effectiveness and efficiency evaluation and ablation study for deep analysis of why YOGO works. The details are illustrated as follows.

Fig. 4: Visualization of the coefficient matrix in cross-attention module on the ShapeNetParts dataset. Blue means the response is small and Red means the response is large. Taking the rocket point-cloud and the chair point-cloud as examples, as for the rocket point-cloud, we choose token 3, token 7, and token 11 that attend the head, the tail, and the tail, respectively. As for the chair point-cloud, we choose token 2, token 11, and token 14 that attend the sitting board and the legs, the back board and the legs, the back board and the stting board, respectively.

Iv-a Implementation detail

We stack eight relation inference modules on all experiments and set equal to 256 and equal to 32, 32, 64, 64, 128, 128, 256, 256 respectively for the eight relation inference modules.

We first conduct the experiment on the ShapeNetParts [5]

dataset, which involves 16681 different objects with 16 categories and 2-6 part labels for each. By utilizing the evaluation metric mean intersection-over-union (mIoU), we first calculate the part-averaged IoU for each of the 2874 test models and average the values as the final metrics. We set the

equal to 32, equal to 96. All the coordinates of points are normalized into [0, 1]. As for the ball query grouping, we set the radium to 0.2. During training, we input the data with the size of 2048 points and the batch size of 128, and use the cosineAnnealing learning rate strategy with initial learning rate 1e-3 and Adam optimizer to optimize the network. During the inference, we input the points with the size of 2048 and vote 10 times for each point-cloud.

We then conduct the experiment on the S3DIS [1], which is collected from the real-world indoor scenes and includes 3D scans of Matterport Scanners from 6 areas. We also use the mIoU as the evaluation metric. We apply the 1, 2, 3, 4, 6 areas for training and use the 5 area for the test since it is the only area that does not overlap with any other area. We set the equal to 32, equal to 128. For ball query grouping, the radium is set to 0.3. We train the networks via feeding the input with the 4096 points and batch size 64, and utilize the same training strategy as the aforementioned for ShapeNetParts dataset. During the inference, we also input the points with the size of 4096 and vote one times for each point-cloud.

Finally, we apply our method to the classification tasks on the ModelNet40 dataset [27]. This is a CAD dataset with 12311 meshed models and contains 40 categories. We set equal to 16, equal to 128. For ball query grouping, the radium is set to 0.15. We train the network with the input of 1024 points and batch size 64, and use the same training strategy as [17]. Note that all the experiments are conducted on one Titan RTX GPU.

Iv-B Comparison experiment

We first perform YOGO on the ShapeNetParts and S3DIS datasets for semantic segmentation evaluation, as shown in Table. I and Table. II. We can observe that YOGO has very competitive performance and speedups at least 3x over point-based baselines on the ShapeNetParts dataset. Precisely, YOGO slightly outperforms the classic baseline PointNet++ [25] and performs at least 3x faster. Although the unofficial PointNet++ [42] slightly outperforms YOGO, YOGO largely speedups it over 9.2x faster. As for the S3DIS dataset, YOGO also achieves at least 4.0x speedup and delivers competitive performance. PointCNN [17] outperforms YOGO over 0.9 mIoU (resp. 3.26 mIoU) on the ShapeNetParts (resp. S3DIS) dataset but is not efficient. Regarding the classification task conducted on the ModelNet40 dataset presented as Table. III, YOGO still has competitive performance over the popular baselines.

We can also observe that YOGO with two different grouping method KNN and ball query have similar performance, which shows that our proposed method is relatively stable with regard to different grouping strategies. Note that there remain many excellent methods that perform better than YOGO, we are not attempting to beat all of them. Instead, we propose a novel baseline from the perspective of token-token relation and token-point relation.

Method Mean IoU Latency GPU Memory
PointNet [24] 83.7 21.4 ms 1.5 GB
RSNet [11] 84.9 73.8 ms 0.8 GB
SynSpecCNN [46] 84.7 - -
PointNet++ [25] 85.1 77.7 ms 2.0 GB
PointNet++* [25] 85.4 236.7 ms 0.9 GB
DGCNN [36] 85.1 86.7 ms 2.4 GB
SpiderCNN [43] 85.3 170.1 ms 6.5 GB
SPLATNet [28] 85.4 - -
SO-Net [16] 84.9 - -
PointCNN [17] 86.1 134.2 ms 2.5 GB
(KNN) 85.2 25.6 ms 0.9 GB
(Ball query) 85.1 21.3 ms 1.0 GB
TABLE I: Quantitative results of semantic segmentation on the ShapeNetPart dataset. The latency and GPU memory are measured under the batch size of 8 and the point number of 2048. PointNet++* is a reproduced version from the popular repository [42].
Method Mean IoU Latency GPU Memory
PointNet [24] 42.97 24.8 ms 1.0 GB
DGCNN [36] 47.94 174.3 ms 2.4 GB
RSNet [11] 51.93 111.5 ms 1.1 GB
PointNet++* [25] 50.7 501.5 ms 1.6 GB
TangentConv [32] 52.6 - -
PointCNN [17] 57.26 282.43 ms 4.6 GB
(KNN) 54.0 27.7 ms 2.0 GB
(Ball query) 53.8 24.0 ms 2.0 GB
TABLE II: Quantitative results of semantic segmentation on the S3DIS dataset. The latency and GPU memory are measured under the batch size of 8, the point number of 4096. PointNet++* is a reproduced version from the popular repository [42].
Method Overall Accuracy
PointNet [24] 89.2
PointNet++ [25] 90.7
SO-Net [16] 90.9
SpiderCNN [43] 90.5
MCConv [9] 90.9
PointCNN [17] 92.2
DGCNN [36] 92.2
ViT-B-2* [8] 78.9
Transformer* [34] 82.1
Perceiver [12] 85.7
YOGO (KNN) 91.4
YOGO (Ball query) 91.3
TABLE III: Quantitative results on the ModelNet40 dataset. The results of ViT-B-2* and Transformer* are directly taken from Perceiver [12].

Iv-C Ablation study

In this subsection, we do an in-depth study of our proposed YOGO, including the effectiveness of the Relation Inference Module (RIM) and related techniques on the ShapeNetParts Parts dataset. More importantly, we analyze what RIM learns, thus explicitly explain why YOGO works.

The effectiveness of RIM. The self-attention module and the cross-attention module are the two key components in RIM. We study them by removing them respectively. In particular, when we remove the self-attention module (SA), we insert several MLP layers that have similar FLOPs as the self-attention modules. When we remove the cross-attention module, we use different related techniques: 1) Directly squeeze the tokens into one global feature and concatenate it with the point feature , which is similar to PointNet [24]; 2) add or concatenate the tokens into the corresponding to different sub-regions. The results are shown in Table. IV.

We can observe that if we coarsely use the tokens by squeezing them into one global feature, the performance is largely worse than the original YOGO by 1.1 mIoU. A more reasonable method is to leverage them on the corresponding sub-regions by adding/concatenating each token with the corresponding point features of sub-regions. It can be seen that the original YOGO still outperforms them by at least 0.5 mIoU, which shows the cross-attention method is more effective. On the other hand, when we substitute the self-attention module with several MLP, the performance also drops 0.5 mIoU, which also shows that it is important to learn the relations between different sub-regions.

Visualization analysis of RIM. To deeper explore what RIM learns, we visualize some samples of the coefficient matrix in the cross-attention module, as shown in Fig. 4. Note that the coefficient matrix of the cross-attention module indicates the relation between the tokens and the original point features, yet there is no explicit supervision to guide the learning coefficient matrix. We can observe that different tokens can automatically attend to different parts even without supervision. For example, the tokens regarding the rocket point-cloud have stronger responses to the head, the tail, and the flank, respectively. Similarly, the tokens regarding the chair point-cloud have stronger responses to the back, the sitting board, and the legs, respectively. These can demonstrate that tokens indeed carry strong semantic information, and the cross-attention module indeed helps the tokens with rich semantic information be projected into point-wise features.

We also visualize the coefficient matrix in the self-attention module, for better visualization, we choose the models with equal to 16. As shown in Fig. 5, the figure (a) in the right shows the token attending the body has more relations to the token indicating the flank and the wings, which together form a whole airplane. The figure (c) and (d) on the right are also similar, they attend different sub-regions which are connected together and build the structure of the airplane point-cloud. We find that for the tokens attending the relatively same parts, their coefficient values are low, which indicates that the self-attention module does not simply learn the similarity of different tokens instead capture the relations of different sub-regions that build the structure of a point-cloud. This can also be indicated from the small coefficient values between the same tokens.

Method Mean IoU
YOGO w/o CA (concat global feature) 84.1
YOGO w/o CA (adding tokens) 84.7
YOGO w/o CA (concat tokens) 84.6
YOGO w/o SA 84.7
YOGO 85.2
TABLE IV: Ablation study for self-attention module and cross-attention module in RIM. YOGO represents that we are using the full modules, YOGO w/o SA means no self-attention module, YOGO w/o CA means no cross-attention.
Fig. 5: Visualization of the coefficient matrix (left) in self-attention module and the corresponding attentions to different parts (right). Blue (resp. red) means the response is small (resp. large). The figure (a), (b), (c), (d) show the token 3 (resp. token 10) attending the body (resp. the wings and tail), the token 7 (resp. token 14) attending the wings and head (resp. the wings and head), the token 5 (resp. token 8) attending the wings and tail (resp. the wings and head), the token 15 (resp. token 1) and the wings (resp. head to the wings and tail), respectively. We can observe that the token 3 and the token 10 in figure (a), and token 5 and token 8 in figure (c) correspond to large coefficient value since their attending parts have strong structure relation, which can form a airplane. Yet token 7 and token 14 in figure (b) correspond to small coefficient value since they attend very similar part thus have less structure relation.

Other techniques. We study other techniques related to our proposed YOGO including the effectiveness of choices of and , the sampling and grouping methods, and the different pooling strategies.

Regarding the choices of and , we conduct experiments by fixing the equal to 64 and increasing the under both farthest point sampling (FPS) and random sampling, as well as fixing the equal to 32 and increasing the under both KNN and ball query [25], as presented in the Fig. 6. We find that when using different sampling methods, the performance will be better if increasing . Besides, even for random sampling strategy, it has relatively small margin to farthest point sampling, which may indicate that our proposed YOGO is robust to different sampling methods. When using different grouping method, the performance also consistently improves when become larger. Besides, we find there is not too much performance gap between ball query and KNN, this indicate our method is robust to different grouping methods.

Regarding the choices of different pooling operations, we study the differences between average pooling and max pooling, as shown in Table. V. The experiment demonstrates the max pooling achieves higher performance than the average pooling. The reason is that the max pooling can better aggregate the most important information in a sub-region, which is more important for our proposed YOGO since it highly depends on their relations.

Fig. 6: The left figure shows the effect of the number of sub-regions regarding the farthest point sampling and random sampling and the right figure shows the effect of the number of grouping points regarding the KNN grouping and ball query grouping.
Method Mean IoU
YOGO with max-pool 85.2
YOGO with avg-pool 85.0
TABLE V: Ablation study for different pooling strategies.

V Discussion and conclusion

In this paper, we propose a novel, simple, and elegant framework YOGO (you only group once) for efficient point-cloud processing. Different from previous point-based methods, YOGO divides point-cloud into a few sub-regions applying sampling and grouping operation once, leverages efficient token representation, models token relations and projects token to obtain strong point-wise features. YOGO achieves at least 3x speedup over popular point-based baselines. Meanwhile, YOGO delivers competitive performance on semantic segmentation on ShapeNetParts and S3DIS dataset, and classification on ModelNet40 dataset. Moreover, YOGO is also robust to different sampling and grouping strategies, even for random sampling. It is noteworthy to mention that YOGO introduced in this paper is a new framework and has huge potentials for point-cloud processing. In particular, it can be improved from the aspects of how to better sample then group out sub-regions, how to obtain better tokens, and how to better combine tokens and point-wise features etc. We hope that it can inspire the research community to further develop this framework.


  • [1] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese (2017)

    Joint 2d-3d-semantic data for indoor scene understanding

    arXiv preprint arXiv:1702.01105. Cited by: §II-A, §II-B, §IV-A, §IV.
  • [2] Y. Ben-Shabat, M. Lindenbaum, and A. Fischer (2018)

    3dmfv: three-dimensional point cloud classification in real-time using convolutional neural networks

    IEEE Robotics and Automation Letters 3 (4), pp. 3145–3152. Cited by: §II-A.
  • [3] A. Boulch, B. Le Saux, and N. Audebert (2017) Unstructured point cloud semantic labeling using deep segmentation networks.. 3DOR 2, pp. 7. Cited by: §II-A.
  • [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §II-B.
  • [5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §II-A, §II-B, §IV-A, §IV.
  • [6] C. Choy, J. Gwak, and S. Savarese (2019)

    4d spatio-temporal convnets: minkowski convolutional neural networks


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 3075–3084. Cited by: §II-A.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §II-B.
  • [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §II-B, TABLE III.
  • [9] P. Hermosilla, T. Ritschel, P. Vázquez, À. Vinacua, and T. Ropinski (2018) Monte carlo convolution for learning on non-uniformly sampled point clouds. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–12. Cited by: TABLE III.
  • [10] B. Hua, M. Tran, and S. Yeung (2018) Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–993. Cited by: §II-A.
  • [11] Q. Huang, W. Wang, and U. Neumann (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: TABLE I, TABLE II.
  • [12] A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira (2021) Perceiver: general perception with iterative attention. External Links: 2103.03206 Cited by: TABLE III.
  • [13] A. Komarichev, Z. Zhong, and J. Hua (2019) A-cnn: annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7421–7430. Cited by: §II-A.
  • [14] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §II-A.
  • [15] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg (2017) Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pp. 95–107. Cited by: §II-A.
  • [16] J. Li, B. M. Chen, and G. H. Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §II-A, TABLE I, TABLE III.
  • [17] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) PointCNN: convolution on -transformed points. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 828–838. Cited by: §I, §II-A, §III-B, §IV-A, §IV-B, TABLE I, TABLE II, TABLE III.
  • [18] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky (2015) Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 806–814. Cited by: §II-A.
  • [19] Y. Liu, B. Fan, G. Meng, J. Lu, S. Xiang, and C. Pan (2019) Densepoint: learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5239–5248. Cited by: §I, §II-A.
  • [20] Z. Liu, H. Hu, Y. Cao, Z. Zhang, and X. Tong (2020) A closer look at local aggregation operators in point cloud analysis. In European Conference on Computer Vision, pp. 326–342. Cited by: §II-A, §III-C.
  • [21] Z. Liu, H. Tang, Y. Lin, and S. Han (2019) Point-voxel cnn for efficient 3d deep learning. arXiv preprint arXiv:1907.03739. Cited by: §II-A.
  • [22] D. Maturana and S. Scherer (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §II-A.
  • [23] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss (2019) Rangenet++: fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. Cited by: §II-A.
  • [24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: Fig. 1, §I, §I, §I, §II-A, §III-C, §III-C, §IV-C, TABLE I, TABLE II, TABLE III.
  • [25] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, Cited by: Fig. 1, §I, §I, §I, §II-A, §III-B, §IV-B, §IV-C, TABLE I, TABLE II, TABLE III.
  • [26] X. Roynard, J. Deschaud, and F. Goulette (2018) Classification of point cloud scenes with multiscale voxel deep network. arXiv preprint arXiv:1804.03583. Cited by: §II-A.
  • [27] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser (2004) The princeton shape benchmark. In Proceedings Shape Modeling Applications, 2004., pp. 167–178. Cited by: §II-B, §IV-A.
  • [28] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang, and J. Kautz (2018) Splatnet: sparse lattice networks for point cloud processing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2530–2539. Cited by: §II-A, TABLE I.
  • [29] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §II-A.
  • [30] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, and P. Luo (2020) SparseR-CNN: end-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450. Cited by: §II-B.
  • [31] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han (2020) Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, pp. 685–702. Cited by: §II-A.
  • [32] M. Tatarchenko, J. Park, V. Koltun, and Q. Zhou (2018) Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3887–3896. Cited by: TABLE II.
  • [33] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) KPConv: flexible and deformable convolution for point clouds. Proceedings of the IEEE International Conference on Computer Vision. Cited by: §I.
  • [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: 1706.03762 Cited by: §I, §II-B, §III-C, TABLE III.
  • [35] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong (2017) O-cnn: octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–11. Cited by: §II-A.
  • [36] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: TABLE I, TABLE II, TABLE III.
  • [37] Z. Wang, W. Zhan, and M. Tomizuka (2018) Fusing bird’s eye view lidar point cloud and front view camera image for 3d object detection. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1–6. Cited by: §II-A.
  • [38] B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In ICRA, Cited by: §II-A.
  • [39] B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, M. Tomizuka, K. Keutzer, and P. Vajda (2020) Visual transformers: token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677. Cited by: §I, §II-B.
  • [40] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer (2019) SqueezeSegV2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, Cited by: §II-A.
  • [41] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka (2020) Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation. In European Conference on Computer Vision, pp. 1–19. Cited by: §II-A.
  • [42] Y. Xu (2019) Pointnet_Pointnet2_pytorch. GitHub. Note: Cited by: §IV-B, TABLE I, TABLE II.
  • [43] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: TABLE I, TABLE III.
  • [44] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §II-A.
  • [45] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §II-A.
  • [46] L. Yi, H. Su, X. Guo, and L. J. Guibas (2017) Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: TABLE I.
  • [47] H. Zhou, X. Zhu, X. Song, Y. Ma, Z. Wang, H. Li, and D. Lin (2020) Cylinder3d: an effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550. Cited by: §II-A.
  • [48] Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai (2019) Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 593–602. Cited by: §II-B.