Log In Sign Up

COLA: COarse LAbel pre-training for 3D semantic segmentation of sparse LiDAR datasets

Transfer learning is a proven technique in 2D computer vision to leverage the large amount of data available and achieve high performance with datasets limited in size due to the cost of acquisition or annotation. In 3D, annotation is known to be a costly task; nevertheless, transfer learning methods have only recently been investigated. Unsupervised pre-training has been heavily favored as no very large annotated dataset are available. In this work, we tackle the case of real-time 3D semantic segmentation of sparse outdoor LiDAR scans. Such datasets have been on the rise, but with different label sets even for the same task. In this work, we propose here an intermediate-level label set called the coarse labels, which allows all the data available to be leveraged without any manual labelization. This way, we have access to a larger dataset, alongside a simpler task of semantic segmentation. With it, we introduce a new pre-training task: the coarse label pre-training, also called COLA. We thoroughly analyze the impact of COLA on various datasets and architectures and show that it yields a noticeable performance improvement, especially when the finetuning task has access only to a small dataset.


page 1

page 2

page 3

page 4


RGB-based Semantic Segmentation Using Self-Supervised Depth Pre-Training

Although well-known large-scale datasets, such as ImageNet, have driven ...

Does Monocular Depth Estimation Provide Better Pre-training than Classification for Semantic Segmentation?

Training a deep neural network for semantic segmentation is labor-intens...

Are We Hungry for 3D LiDAR Data for Semantic Segmentation?

3D LiDAR semantic segmentation is a pivotal task that is widely involved...

PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding

Arguably one of the top success stories of deep learning is transfer lea...

Enhanced Prototypical Learning for Unsupervised Domain Adaptation in LiDAR Semantic Segmentation

Despite its importance, unsupervised domain adaptation (UDA) on LiDAR se...

Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Recent advances in 3D semantic segmentation with deep neural networks ha...

Demystifying Unsupervised Semantic Correspondence Estimation

We explore semantic correspondence estimation through the lens of unsupe...

I Introduction

LiDAR-based deep learning has been gaining a lot of traction in recent years, especially for autonomous driving. Thus, an increasing amount of data has been released for scene understanding, such as KITTI

[1] and nuScenes [2] for detection or SemanticKITTI [3] for semantic segmentation.

So far 3D semantic segmentation methods have focused on architecture design and tuning rather than data-driven approaches to improve performance. Some works have started to emerge on domain adaptation and transfer learning to improve performance. A simple explanation for these two types of work is that the first kind of method attempts to adapt a method trained on a specific dataset to work well on another dataset. The other kind focus on leveraging a large amount of available data to improve performance on a specific dataset. Both methods are key in the domain of autonomous driving, as there is a large variety of setups due to changes in country, urban density or seasonal change, and the available data cannot cover every case.

SemanticKITTI coarse labels
(a) SemanticKITTI coarse labels
(b) SemanticPOSS fine labels
Fig. 1: COLA pipeline.

Int his work, we aim to partially bridge the gap between these two concepts, as we propose a transfer learning method, trained on a large variety of data topology and scenes types. This way, it is expected that our method will work on any finetuning data.

Thus far, most of the work done in 3D transfer learning has focused on dense data, thanks to contrastive learning. Contrastive learning was introduced for unsupervised pre-training for image description in [4] and was expanded to work on 3D data, such as in PointContrast [5] and Info3D [6]. The idea behind contrastive learning is the ability to learn to describe instances, such as being able to tell whether two 3D points are the same even if they belong to two different scans. The most promising work is Contrastive Scene Context (CSC) [7], which incorporates a spatial partition of the scenes before applying a contrastive learning pipeline. However, little research has been conducted on leveraging existing and available autonomous driving data to improve performance when the amount of annotation is limited, especially as these data have a very different topology from dense point clouds.

In this work, we propose a novel approach to leverage existing datasets by circumventing the main issue to use them together, the different label sets. We introduce an intermediate-level label set called coarse labels that can be applied to any autonomous driving semantic segmentation dataset. The mapping from existing available datasets to the proposed label set can be found at We demonstrate that using this intermediate label set enables us to use all datasets together to pre-train neural architectures and improve performance after finetuning. Furthermore, we apply contrastive learning inspired by PointContrast[5] to naturally sparse autonomous driving data, a method that we call SPC, for Sparse PointContrast. This pre-training alongside CSC will be used to benchmark our approach.

In practice, our contributions are as follows:

  • Introducing a label set specific to the task of semantic segmentation of autonomous driving generic enough to leverage every dataset at once.

  • Thoroughly experimenting a pre-training leveraging this new label set, method called COLA (COarse LAbel pre-training) shown in Figure 1 and comparisons to contrastive pre-training on dense and sparse data.

Ii Related Works

Ii-a 3D Deep Learning

With the increase in available data alongside the superhuman performance of 2D deep learning [8][9], 3D deep learning has been receiving growing attention in the last decade. However, contrary to typical images, 3D data are fundamentally unordered, and thus, typical convolutions cannot be applied directly.

Several types of approaches have been developed, either using only permutation invariant operation [10][11][12], redefining the convolution [13][14][15][16], seeing point clouds as a graph [17][18][19] or voxelizing the point cloud and applying typical convolutions [20][21].

While these approaches performed very well on shape classification and more refined tasks for dense point clouds, new methods were needed to be designed to tackle autonomous driving requirements. Indeed, alongside SemanticKITTI[3] new challenges have arisen, notably the need for real-time computations and the sparser nature of the data. Most of the older methods didn’t fit and two new kinds dominate the field: projection-based image, which projects the LiDAR data into a 2D image and allows for very fast computations with a range-based image [22][23], a bird’s-eye-view image [24] or a combination of the two[25]. The second type of method is sparse convolution-based architectures as they were introduced by MinkowskiNet[26], under the name Sparse Residual U-Net (SRU-Net). they have been used as the backbone for more refined methods [26], [27], [28], [29]. These last methods are the best performing in the different benchmarks.

Through this quick overview, we can see that SRU-Net [26] has been introduced as a reference in the field of 3D autonomous driving semantic segmentation and thus will be our reference architecture.

Ii-B 3D Transfer Learning

Most older works on input representation, which would make a good pre-training tasks, were focused on part-based and shape classification [30][31][32][6][33] or focused on registration [34], and are now a bit too unrelated to current autonomous driving scene understanding to make for a decent baseline.

Recently, a proper transfer learning approach for LiDAR-based scene understanding was proposed with PointContrast[5]. It follows a pretty standard approach of contrastive learning, with the instance level used in loss computations being the voxel. This approach knows the matching between voxels of different scans by using different views from the same scene in ScanNet and thus knows how to register scans together. For all of its work, it uses a SRU-Net architecture and highlights the performance gained by this proposed approach.

CSC[7] extends PointContrast and refines the unsupervised pre-training thanks to a geometric partitioning of the scans. It also completes the study of its predecessor by looking into performance gain when the data or annotations are limited. DepthContrast[35] takes a different approach to the contrastive learning pre-training by leveraging depth images; this method does not require any registration between two scenes and is fully unsupervised.

These works relied on the availability of a large quantity of dense point clouds and RGB-D scans. On the contrary, ONCE [36] released a very large sparse outdoor dataset to keep on investigating pre-task for 3D deep learning. However, this approach was used only for detection.

These aforementioned transfer learning initiatives focused on learning a low-level task to transfer to high-level task. Here we propose to go from a high-level task to a high-level task in order to fully exploit available datasets.

Iii Coarse Labels

Fig. 2: The proposed coarse labels and their mapping to SemanticKITTI fine labels.

Since the release of SemanticKITTI in 2019, other autonomous driving semantic segmentation LiDAR datasets have been available. We focus on five different datasets, which are the most important at the time of the writing due to the variety in size, scene types and hardware used for acquisition. They are presented in Table I. As seen in Figure 3, limiting itself to one dataset would skip a large number of data. However, thus far, it is impossible to use them all at once, as there is no unique fine label set.

Dataset Size Scene type # Fine labels
KITTI-360 [37] 80K Suburbs 27
nuScenes [2] 35K Urban 16
SemanticKITTI [3] 23K Suburbs 19
PandaSet [38] 6K Urban & Road 37
SemanticPOSS [39] 3K Campus 13
TABLE I: Summary of the datasets used in the different experiments. The size corresponds to the number of labelized scans. Fine labels correspond to the labels used at the inference time.
Fig. 3: Visualization of the proportion of each data with regard to the total available data. KITTI-360 (blue) 42%, nuScenes (red) 30%, SemanticKITTI (green) 20%, PandaSet (purple) 5% and SemanticPOSS (orange) 3%.

Despite this obvious obstacle, a study of the datasets shows that labels are organized in a tree-like structure with more or less refined labels. Moreover, higher levels of trees are more uniform across datasets.

Based on this observation, we propose a set of coarse labels such that we can easily map fine labels from any dataset to this one. These coarse labels define categories that retain information relevant to the end task, are consensual across all datasets and display little ambiguity. The coarse labels we propose are as follows:

  • Driveable ground cover areas where cars are expected to be driving.

  • Other ground corresponds to other large planar surfaces, where pedestrians can be found.

  • Structure corresponds to fixed human-made structures, that can be hard to subdivide into smaller groups.

  • Vehicles are the typical road users.

  • Nature corresponds to the vegetation-like components of the scene.

  • Living being.

  • Dynamic objects are objects that can typically move, such as strollers.

  • Static objects are objects that are fixed, such as poles.

You can find an illustration of the labels and the mapping of the fine labels from SemanticKITTI to the proposed label set in Figure 2. All mappings are available in our public GitHub repository111

Mapping fine label sets to the coarse label set is a simple task. Nonetheless, when mapping several datasets together to the coarse labels, some issues arise. As different datasets have different refinements of the labels, sometimes non-intuitive mapping must be made. For instance, in some dataset, wheelchair is categorized as a single label, and in another it is not. We can safely assume that in the label sets where wheelchair is not a label, it is labelized as a person. Thus, even if a wheelchair can be seen as a dynamic object, it makes sense to map it to living being to avoid having a discrepancy in the labelization of wheelchairs from one scene to another.

Those labels can be used to directly perform coarse segmentation on any dataset or on all the data at once. as addition to providing a common label set, the coarse labels are much less imbalanced than fine labels, as seen in Figure 4, and thus make for a simpler task. As an example, in SemanticKITTI, the ratio between the most and least frequent labels is around 7000 for fine labels and around 600 for coarse labels.

Fig. 4: Label distribution of the SemanticKITTI coarse labels and the SemanticKITTI fine labels.

This coarse segmentation task, performed on all the data at once, is the proposed COarse LAbel (COLA) pre-training presented in the introduction. To reuse such a pre-trained model, only the last layer of the model needs to be replaced for finetuning, to account for the new number of labels (see Figure 1). Due to the size of the architecture, no layers need to be frozen at finetuning time. We expect features learned during the training on coarse segmentation to be highly valuable for finetuning, allowing for faster training and better results especially when the finetuning dataset is small.

Iv Experiments

In this section, we describe the experimental setup used for evaluating the performance of the proposed coarse label pre-training (COLA). For all our experiments, we split the data into two categories, the target data, which is the data we used for finetuning, on which we compute performance metrics, and the pre-training data. The target data are split into training, validation and test sets, whereas the pre-training data is only split between training and validation sets. We identify four different setups shown in Table II. We have a difference in the size ratio observed and the expected size ratio from Table I due to the creation of a test set from the available labelized scans for the experiments. KITTI-360 is not used for pre-training in the case of the SemanticKITTI target as there is a significant geographic overlap of their scenes.

The four cases we chose correspond to experimenting on the two reference datasets in the community, nuScenes and SemanticKITTI, and on two limited data availability setups. For the last two cases, we introduce a homemade set of datasets, called partial nuScenes which corresponds to subsets of nuScenes in order to have access to a wider range of dataset sizes; otherwise, we would be limited to only SemanticPOSS.

In each case, we compare four pre-training procedures: no pre-train, dense contrastive pre-trained with CSC[7], sparse contrastive pre-trained with SPC and COLA pre-trained. The dense contrastive experiment is not expected to perform well as the data used in the pre-training are dense, contrary to autonomous driving point clouds, which are sparse.

As highlighted in Table I, target datasets present different scene types, and furthermore nuScenes and SemanticKITTI are acquired with sensors of very different resolutions. This way, we can evaluate the methods in many settings.

For a fair comparison, we align with the neural architecture used in [7], as we agree that sparse convolutions have proven to be the leading architecture design in recent years for autonomous driving semantic segmentation.

Furthermore, we also experiment on another architecture, SPVCNN[27], as we focus on semantic segmentation, and this architecture reaches state-of-the-art performance on the SemanticKITTI benchmark. Following unpromising results with SRU-Net (as seen below in Section V) from the contrastive pre-trainings alongside their heavy computation costs, we do not perform the dense and sparse contrastive pre-train experiments on the SPVCNN architecture.

To compare methods, we compute the mean Intersection over Union (mIoU) computed on the validation or test set. All experiments are performed on NVIDIA GeForce RTX 3090 GPUs.

Dataset for pre-training
# Scans for
# Size
SemanticKITTI - KITTI-360
SemanticPOSS - PandaSet
110K nuScenes 25%
SemanticKITTI - KITTI-360
PandaSet - nuScenes
140K SemanticPOSS 2%
SemanticPOSS - nuScenes
45K SemanticKITTI 33%
SemanticKITTI - KITTI-360
SemanticPOSS - PandaSet
TABLE II: Datasets used for pre-training depending on the target and the size ratio between the target set and the pre-training set.

Iv-a Sparse Contrastive Pre-training

As mentioned, we want to benchmark our method with regard to the standard approaches developed in 3D transfer learning. Thus, we compare our COLA pre-training with CSC [7]. But CSC was trained on dense data (ScanNetV2[40]), and we try to use the learned weights on a very different type of data: sparse point clouds from LiDAR scans. Here we use KITTI-360, SemanticKITTI and nuScenes.

That is why we train our own contrastive pre-training by using the PointContrast[5] methodology and code but apply it to the sparse data. This way we can compare contrastive pre-training and COLA pre-training assuming the same amount of data and same type of input. In practice, we use a total of around 100K scans for the sparse pre-training. In comparison PointContrast leveraged ScanNetV2, with 870K point cloud pairs, which represent nine times more pre-training data.

The architecture is SRU-Net, and all the parameters are the same as the one presented in PointContrast as they were fit for sparse data as well. As stated in the introduction, the proposed pre-training is called SPC.

In total, the SPC pre-training took the equivalent of 50 GPU days of training, which is five times slower than the COLA pre-training (see below).

Iv-B Networks and Parameters

Iv-B1 SRU-Net

SRU-Net was introduced in [26] and leverages a U-Net architecture along with the sparse convolutions. This method has been established as a staple and is used as a backbone on many state-of-the-art architectures as highlighted in Section II. We follow the exact same architecture as the one used in [7] to avoid having to compute the pre-training ourselves.

Regarding the implementation for the COLA pre-training, we used SGD optimizer, with an initial learning rate of 0.4, a momentum of 0.9 and a cosine annealing scheduler. We had a batch size of 48 over 10 epochs.

Regarding the implementation for finetuning, whether the model has been pre-trained or not, we use the same parameters. We used SGD optimizer, with an initial learning rate of 0.8, a momentum of 0.9 and a cosine annealing scheduler. We had a batch size of 36 over 30 epochs. There are more epochs for finetuning as the dataset is much smaller.

In both cases, we used a mixed Lovasz loss function, a combination of the Lovasz loss

[41] and cross-entropy. We follow the same data augmentation as [26]. We apply model selection based on the best validation mIoU.

The batch size chosen every time is the maximum possible for the available hardware. The discrepancy between finetuning and pre-training comes from the difference in scan size between the various datasets.

The other parameters are close to the default parameters used for the finetuning of SRU-Net in the PointContrast paper. Selecting a parameter set not specifically tailored to the dataset is a choice, in order to highlight that the proposed method works without needing a tedious optimization of the parameters.

Iv-B2 Spvcnn

SPVCNN was introduced in [27]. It is a SRU-Net architecture, with a PointNet-like architecture running in parallel. The point-based branch is assumed to be able to gather information at a finer resolution than the voxelized branch. We chose to follow the architecture the original authors provide in their GitHub repository for SemanticKITTI.

Regarding the implementation for the COLA pre-training and finetuning, we used SGD optimizer, with an initial learning rate of 0.24, a momentum of 0.9 and a cosine warmup scheduler. We had a batch size of 16 over 15 epochs. The mixed Lovasz loss function was used as previously. We follow the same data augmentation as [26]. We applied model selection based on the best validation mIoU.

These selected parameters correspond to the default proposed in the SPVCNN repository to train with SemanticKITTI. It means that contrary to our parameter choice for SRU-Net, here it is expected to perform very well on SemanticKITTI. We use this specific set of parameters to finetune on nuScenes and SemanticPOSS as well.

Iv-C COLA (Coarse Label) Pre-training

As described in Section III, we perform a novel pre-training, based on a simplified supervised autonomous driving semantic segmentation. This way, it would learn representations tailored for the specific task and the specific data type. Due to the size of the training set, the trainings are quite large, and thus, we did not spend too much time tuning the hyperparameters. We demonstrate only the ability of the architecture to learn on the massive dataset, without digging too deep in the exact performances as they are not the target. The validation mIoU reaches between

and %, depending on the target dataset, which we deem reasonable enough.

At training time, batch elements are selected randomly across all sub-datasets that are part of the coarse label dataset.

In practice, it took the equivalent of ten GPU days of training time, resulting in a much shorter pre-training procedure than SPC.

Iv-D Partial nuScenes

We introduced partial nuScenes previously. Partial nuScenes is a subset of nuScenes, based on the number of scenes. We propose three levels: 10%, 25% and 50%, which represent 95, 235 and 470 scenes, respectively, for training and validation, which are split in the same proportion as the full nuScenes 70/30. The test set is the same one as the full nuScenes test set and represents 230 scenes.

V Results

In this section, we show the results from the thorough experiments described in the previous sections. They are divided into two parts. First, we look at results on full datasets (SemanticKITTI, SemanticPOSS and nuScenes), on each network architectures, and then, we investigate more precisely the case of limited data with partial nuScenes. In each case, we show the mIoU over a test set, which we extract from the training set, representing between 10 and 20% of the available data. We compare all COLA results to the no pre-training case.

V-a Results on Full Datasets

nuScenes 67.2 68.32 66.3 69.3 (+2.1)
SemanticKITTI 50.5 50.7 48.9 52.4 (+1.9)
SemanticPOSS 55 55.5 54.4 55.8 (+0.8)
TABLE III: Test set mIoU for each target dataset. No pre-training, dense contrastive pre-training (CSC), sparse contrastive pre-training (SPC) and COLA pre-training with the SRU-Net architecture.
No pre-training COLA (Ours)
nuScenes 65.9 66.2 (+0.3)
SemanticKITTI 57.7 58.4 (+0.7)
SemanticPOSS 50.1 54.5 (+4.4)
TABLE IV: Test set mIoU for each target dataset. No pre-training and COLA pre-training with the SPVCNN architecture.
% of total scenes # of scans for training SPVCNN SRU-Net
10% 1900 44.9 51.3 (+6.4) 46.2 42.4 46.9 48.4 (+2.2)
25% 4700 52.5 55.6 (+3.1) 58.0 57.5 55.3 60.0 (+2.0)
50% 9400 56.8 57.2 (+0.4) 62.4 62.3 62.3 65.0 (+2.6)
100% 16000 65.9 66.2 (+0.3) 67.2 68.3 66.3 69.3 (+2.1)
TABLE VI: Test set mIoU for each partial nuScenes. No pre-training, dense contrastive pre-training (CSC), sparse contrastive pre-training (SPC) or COLA pre-training with the SPVCNN and the SRU-Net architectures.
(a) Results for no pre-training
(b) Results for
dense contrastive pre-training
(c) Results for
sparse contrastive pre-training
(d) Results for COLA
Fig. 5: Example of results on SemanticKITTI after finetuning with SRU-Net. From left to right: no pre-train, Dense Contrastive pre-train, Sparse Contrastive pre-train, COLA. In blue, points with correct semantic segmentation. In red, errors.

We separated the results based on the architectures as they displayed slightly different final results.

V-A1 Results for SRU-Net

Overall, the models pre-trained thanks to the proposed COLA pre-training show significant improvement compared to the other methods, with a gain of up to + as seen in Table III.

Furthermore, the CSC and SPC pre-training improve the test results only marginally. We conclude that CSC can only be applied to dense point clouds and is less pertinent in the case of sparse point clouds, which are typical of the autonomous driving use case. For SPC there are two options, either not enough data were used to perform the pre-training, which would confirm the observation by [36], or the contrastive learning is not adapted as a pre-training task for sparse semantic segmentation.

Although the results from SRU-Net are very promising, we further investigate the COLA method on a more refined architecture.

V-A2 Results for SPVCNN

SPVCNN is an architecture that reached the state of the art on SemanticKITTI. As described in Section IV, we did not investigate the effect of contrastive pre-trainings due to their very expensive computation costs and their unsatisfactory results on SRU-Net.

The results, found in Table IV are different from those for SRU-Net. There is a minor performance improvement for SemanticKITTI and nuScenes but a massive improvement in the case of SemanticPOSS.

SPVCNN leverages a large dataset more efficiently but a small dataset less effectively and thus, the improved initialization is very helpful here.

V-A3 Analysis of the first epochs

Iteration # 1 33.7 29.1 36.6 39.6
Iteration # 6 50.0 50.9 51.6 55.0
Iteration # 11 52.5 57.9 52.4 60.0
Iteration # 1 27.8 25.5 32.5 25.8
Iteration # 6 43.8 45.9 43.8 43.2
Iteration # 11 51.2 49.6 48.4 51.9
Iteration # 1 7.2 12.3 10.8 31.5
Iteration # 6 29.9 29.9 48.9 61.0
Iteration # 11 56.0 57.3 58.4 65.2
TABLE V: Validation mIoU at the finetuning iterations # 1, 6 and 11 for each target dataset depending on the pre-training. No pre-training, dense contrastive pre-training (CSC), sparse contrastive pre-training (SPC) and COLA pre-training with the SRU-Net architecture.

Although final test mIoUs differ depending on the architectures, the training curves can be analyzed at the same time. Results for the validation mIoUs for few of the first epochs for SRU-Net can be found in Table V. We observe two trends:

For nuScenes and SemanticPOSS, for a low number of iterations, the models pre-trained thanks to the COLA method show improved performance.

For SemanticKITTI, the model pre-trained with SPC attains a higher validation mIoU at very low iteration (

). This probably stems from the specific dataset used for SPC, which is mostly composed of SemanticKITTI and KITTI-360. They share the same topology as the finetuning data. It is hard to conclude that there is a better method in the first few iterations in this case.

We conclude that the COLA pre-training method taught geometric and semantic primitives to the models, which explains the better early performances during the training. The results for SPVCNN, and the early training lead us to investigate more precisely the behavior of the models in the case of limited data availability.

V-B Results with Limited Data Availability

We compiled results for partial nuScenes for both architectures in Table VI as they shared very similar results. We confirmed the previous conclusions that the proposed COLA pre-training improves the performance significantly, whatever the architecture, when a small amount of dataset is available. It is even more significant for SPVCNN.

Moreover, using non-pertinent pre-trainings, such as CSC, can even lead to a decrease in the final performances.

V-C Qualitative results

We show some qualitative results for SemanticKITTI with SRU-Net in Figure 5. ISegmentation errors are red,, and we can see a significant decrease in mistakes for COLA especially for buildings and vegetation. Zoom in for a better visualization.

Vi Conclusion

In this paper, we proposed a model-agnostic pre-training paradigm in order to leverage existing annotations and extract meaningfully semantic and geometric information from the available data.

We demonstrated that the proposed COLA method shows great promise. Specifically, it works very well when the amount of data is limited, but the proposed method is not restricted to this use case. Furthermore, we have shown the proposed method works on various data topologies, and manages improved performance even when the target topology is not seen during the pre-training. COLA needs much less raw data than typical contrastive approaches followed thus far, in addition to needing significantly less computational power and time.


  • [1] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In

    Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    , pages 3354–3361, 2012.
  • [2] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2020.
  • [3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proc. of the IEEE/CVF International Conf. on Computer Vision (ICCV), 2019.
  • [4] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox.

    Discriminative unsupervised feature learning with convolutional neural networks.

    In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  • [5] Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In European Conference on Computer Vision, pages 574–591. Springer, 2020.
  • [6] Aditya Sanghi. Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. ArXiv, abs/2006.02598, 2020.
  • [7] Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15582–15592, 2021.
  • [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • [9] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [10] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. 12 2016.
  • [11] Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5105–5114, Red Hook, NY, USA, 2017. Curran Associates Inc.
  • [12] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Agathoniki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11105–11114, 2020.
  • [13] M. Tatarchenko, J. Park, V. Koltun, and Q. Zhou. Tangent convolutions for dense prediction in 3d. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3887–3896, Los Alamitos, CA, USA, jun 2018. IEEE Computer Society.
  • [14] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 90–105, Cham, 2018. Springer International Publishing.
  • [15] Zhiyuan Zhang, Binh-Son Hua, and Sai-Kit Yeung. Shellnet: Efficient point cloud convolutional neural networks using concentric shells statistics. In International Conference on Computer Vision (ICCV), 2019.
  • [16] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. Proceedings of the IEEE International Conference on Computer Vision, 2019.
  • [17] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4558–4567, 2018.
  • [18] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spectral graph convolution for point set feature learning. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 56–71, Cham, 2018. Springer International Publishing.
  • [19] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convolution for point cloud semantic segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10288–10297, 2019.
  • [20] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928, 2015.
  • [21] Huan Lei, Naveed Akhtar, and Ajmal Mian. Octree guided cnn with spherical kernels for 3d point clouds. IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [22] Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. Rangenet ++: Fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4213–4220, 2019.
  • [23] Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Vajda, Kurt Keutzer, and Masayoshi Tomizuka. SqueezeSegV3: Spatially-Adaptive Convolution for Efficient Point-Cloud Segmentation, pages 1–19. 11 2020.
  • [24] Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Zerong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: An improved grid representation for online lidar point clouds semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [25] Martin Gerdzhev, Ryan Razani, Ehsan Moeen Taghavi, and Bingbing Liu.

    Tornado-net: multiview total variation semantic segmentation with diamond inception module.

    2021 IEEE International Conference on Robotics and Automation (ICRA), pages 9543–9549, 2021.
  • [26] Christopher Bongsoo Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3070–3079, 2019.
  • [27] Haotian* Tang, Zhijian* Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, 2020.
  • [28] Hui Zhou, Xinge Zhu, Xiao Song, Yuexin Ma, Zhe Wang, Hongsheng Li, and Dahua Lin. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. ArXiv, abs/2008.01550, 2020.
  • [29] Jianyun Xu, Ruixiang Zhang, Jian Dou, Yushi Zhu, Jie Sun, and Shiliang Pu. Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. ArXiv, abs/2103.12978, 2021.
  • [30] Jonathan Sauder and Bjarne Sievers. Self-supervised deep learning on point clouds by reconstructing space. In NeurIPS, 2019.
  • [31] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas J Guibas. Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392, 2017.
  • [32] Idan Achituve, Haggai Maron, and Gal Chechik.

    Self-supervised learning for domain adaptation on point clouds.

    2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 123–133, 2021.
  • [33] Kaveh Hassani and Mike Haley. Unsupervised multi-task feature learning on point clouds. pages 8159–8170, 10 2019.
  • [34] Yue Wang and Justin Solomon. Deep closest point: Learning representations for point cloud registration. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3522–3531, 2019.
  • [35] Zaiwei Zhang, Rohit Girdhar, Armand Joulin, and Ishan Misra. Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10252–10263, October 2021.
  • [36] Jiageng Mao, Minzhe Niu, Chenhan Jiang, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Jie Yu, Chunjing Xu, et al. One million scenes for autonomous driving: Once dataset. 2021.
  • [37] Jun Xie, Martin Kiefel, Ming-Ting Sun, and Andreas Geiger. Semantic instance annotation of street scenes by 3d to 2d label transfer. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3688–3697, 2016.
  • [38] & hesai. pandaset open datasets., 2020.
  • [39] Yancheng Pan, Biao Gao, Jilin Mei, Sibo Geng, Chengkun Li, and Huijing Zhao. Semanticposs: A point cloud dataset with large quantity of dynamic instances. 2020 IEEE Intelligent Vehicles Symposium (IV), pages 687–693, 2020.
  • [40] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  • [41] Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4413–4421, 2018.