DeepAI
Log In Sign Up

Unsupervised Representation Learning for Point Clouds: A Survey

02/28/2022
by   Aoran Xiao, et al.
0

Point cloud data have been widely explored due to its superior accuracy and robustness under various adverse situations. Meanwhile, deep neural networks (DNNs) have achieved very impressive success in various applications such as surveillance and autonomous driving. The convergence of point cloud and DNNs has led to many deep point cloud models, largely trained under the supervision of large-scale and densely-labelled point cloud data. Unsupervised point cloud representation learning, which aims to learn general and useful point cloud representations from unlabelled point cloud data, has recently attracted increasing attention due to the constraint in large-scale point cloud labelling. This paper provides a comprehensive review of unsupervised point cloud representation learning using DNNs. It first describes the motivation, general pipelines as well as terminologies of the recent studies. Relevant background including widely adopted point cloud datasets and DNN architectures is then briefly presented. This is followed by an extensive discussion of existing unsupervised point cloud representation learning methods according to their technical approaches. We also quantitatively benchmark and discuss the reviewed methods over multiple widely adopted point cloud datasets. Finally, we share our humble opinion about several challenges and problems that could be pursued in the future research in unsupervised point cloud representation learning. A project associated with this survey has been built at https://github.com/xiaoaoran/3d_url_survey.

READ FULL TEXT VIEW PDF

page 2

page 4

page 6

04/20/2022

Sequential Point Clouds: A Survey

Point cloud has drawn more and more research attention as well as real-w...
09/18/2020

Deep Learning for 3D Point Cloud Understanding: A Survey

The development of practical applications, such as autonomous driving an...
07/15/2022

3DVerifier: Efficient Robustness Verification for 3D Point Cloud Models

3D point cloud models are widely applied in safety-critical scenes, whic...
06/11/2019

Solving Large-Scale 0-1 Knapsack Problems and its Application to Point Cloud Resampling

0-1 knapsack is of fundamental importance in computer science, business,...
04/12/2017

Semantic3D.net: A new Large-scale Point Cloud Classification Benchmark

This paper presents a new 3D point cloud classification benchmark data s...
09/15/2021

Sequential Point Cloud Prediction in Interactive Scenarios: A Survey

Point cloud has been widely used in the field of autonomous driving sinc...
11/20/2019

Utility Analysis of Network Architectures for 3D Point Cloud Processing

In this paper, we diagnose deep neural networks for 3D point cloud proce...

1 Introduction

3D acquisition technologies have experienced fast development in recent years. This can be witnessed by different 3D sensors that have become increasingly popular in both industrial and our daily lives such as LiDAR sensors in autonomous vehicles, RGB-D cameras in Kinect and Apple devices, 3D scanners in various reconstruction tasks, etc. Meanwhile, 3D data of different modalities such as meshes, point clouds, depth images and volumetric grids, which capture accurate geometric information for both objects and scenes, have been collected and widely applied in different areas such as autonomous driving, robotics, medical treatment, remote sensing, etc.

Point cloud as one source of ubiquitous and widely used 3D data can be directly captured with entry-level depth sensors before triangulating into meshes or converting to voxels. This makes it easily applicable to various 3D scene understanding tasks

[52] such as 3D object detection and shape analysis, 3D scene understanding, etc. With the advance of deep neural networks, point cloud understanding has attracted increasing attention as observed by a large number of deep architectures and deep models developed in recent years [38]. On the other hand, effective training of deep networks requires large-scale human-annotated training data such as 3D bounding boxes for object detection and point-wise annotations for semantic segmentation, which are usually laborious and time-consuming to collect due to 3D view changes and visual inconsistency between human perception and point cloud display. Efficient collection of large-scale annotated point clouds has become one bottleneck for effective design and deployment of deep networks while handling various real-world tasks [47].

Fig. 1: The general pipeline of unsupervised representation learning on point clouds: Deep neural networks are first pre-trained with unannotated point clouds via unsupervised learning over certain pre-text tasks. The learned unsupervised point cloud representations are then transferred to various downstream tasks to provide network initialization, with which the pre-trained network models can be fine-tuned effectively with a small amount of annotated task-specific point cloud data.
Fig. 2: Taxonomy of existing unsupervised methods for point cloud representation learning.

Unsupervised representation learning (URL), which aims to learn robust and general feature representations from unlabelled data, has recently been studied intensively for mitigating the laborious and time-consuming data annotation challenge. As Fig. 1 shows, URL works in the similar way to pre-training which learns useful knowledge from unlabelled data and transfers the learnt knowledge to various downstream tasks [58]

. More specifically, URL can provide helpful network initialization with which well-performing network models can be trained with a small amount of labelled training data without suffering from much over-fitting as compared with training from random initialization. URL can thus help reduce training data and annotations which has demonstrated great effectiveness in the areas of natural language processing (NLP) 

[83, 60]

, 2D computer vision 

[44, 36, 14, 43], etc.

Similar to other types of data such as texts and 2D images, URL of point clouds has recently attracted increasing attention in the computer vision research community. A number of URL techniques have been reported which learn different point cloud representations by designing different pre-text tasks such as 3D object reconstruction [107], partial object completion [108], 3D jigsaws solving [93], etc. However, URL of point clouds still lags far behind as compared with its counterparts in NLP and 2D computer vision tasks. For the time being, training from scratch on various target new data is still the prevalent approach in most existing 3D scene understanding development. At the other end, URL from point cloud data is facing increasing problems and challenges, largely due to the lack of large-scale and high-quality point cloud data, unified deep backbone architectures, generalizable technical approaches, as well as standardized evaluation benchmarks.

In addition, URL for point clouds is still short of systematic survey that can offer a clear big picture about this new yet challenging task. To fill up this gap, this paper presents a comprehensive survey on the recent progress in unsupervised point cloud representation learning from the perspective of datasets, network architectures, technical approaches, performance benchmarking, and future research directions. As shown in Fig. 2, we broadly group existing methods into four categories based on their technical approaches, including URL methods using data generation, global and local contexts, multimodality data and local descriptors, more details to be discussed in the ensuing subsections.

The major contributions of this work can be summarized in three points as listed:

  1. It presents a comprehensive review of the recent development in unsupervised point cloud representation learning. To the best of our knowledge, it is the first survey that provides an overview and big picture for this exciting research topic.

  2. It discusses the most recent progress of unsupervised point cloud representation learning, including a comprehensive benchmarking and discussion of existing methods over multiple public datasets.

  3. It shares several research challenges and potential research directions that could be pursued in unsupervised point cloud representation learning.

The rest of this survey is organized as follows: In section 2, we introduce background knowledge of unsupervised point cloud learning including term definition, common tasks of point cloud understanding and relevant surveys to this work. Section 3 introduces widely-used datasets and their characteristics. Section 4 introduces commonly used deep point cloud architectures with typical models that are frequently used for point cloud URL. In Section 5 we systematically review the methods for point cloud URL. Section 6 summarizes and compares the performances of existing methods on multiple benchmark datasets. At last, we list several promising future directions for unsupervised point cloud representation learning in Section 7.

2 Background

2.1 Basic concepts

To make a clear representation, we first define the terms used in the remaining sections.

Point cloud data: A point cloud

is a set of vectors

where each vector represents a point . Here, means 3D coordinate of the point; is the feature attribute of the point, which is optional and various due to 3D sensors and applications, e.g. RGB value, LiDAR intensity value, normal value, etc.

Supervised learning: Under the paradigm of deep learning, supervised learning aims to train deep network models with human-labeled training data.

Unsupervised learning: Unsupervised learning means training networks without human-annotated labels [57].

Self-supervised learning: Self-supervised learning refers to learning with labels that are generated from raw data itself (without human annotation). It is a subset of unsupervised learning [69].

Semi-supervised learning:

In semi-supervised learning, networks are trained on a small number of labeled data and a large number of unlabeled data.

Pre-training: The network models are first pre-trained for pre-text tasks on other data. The learned weights are then used as model initialization for downstream tasks.

Transfer learning: Transfer learning aims to transfer knowledge across tasks, modalities or datasets. A typical scenario in this survey is using unsupervised learning methods as pre-training approaches to transfer knowledge of unlabeled data into downstream models.

2.2 Common 3D understanding tasks

This subsection introduces common 3D understanding tasks including object-level tasks in object classification and object part segmentation and scene-level tasks in 3D object detection, semantic segmentation and instance segmentation. These tasks have been widely used to evaluate the quality of point cloud representation learned by unsupervised learning methods. Specifically, the learned parameters of networks by unsupervised learning methods are employed to initialize models for these tasks, and the performance demonstrates the generalization ability of the learned features.

2.2.1 Object classification

Object classification aims to classify point cloud objects into their pre-defined categories. Two criteria are most frequently used: The

overall Accuracy (OA) represents the averaged accuracy for all instances in the test set; The mean class accuracy (mAcc) represents the mean accuracy of the test set for all shape classes.

When using object classification as the downstream task to evaluate the quality of features by unsupervised learning methods, two different protocols are often adopted as performance criteria. One is the linear classification protocol: a linear SVM classifier is trained on the representation features learned by unsupervised learning methods, and classification accuracy is reported for evaluation and comparison. This requires the unsupervised learning methods to learn hierarchical and local-smooth feature representations from point clouds, i.e. objects that are from the same categories should be close in the feature space and vise versa. The other is the fine-tuning protocol: The pre-trained models by unsupervised learning methods serve as the point cloud encoder’s initial weights, and we fine-tune (re-train) the networks given the labels on the classification dataset.

2.2.2 Object part segmentation

Fig. 3: An illustration of object part segmentation. Left column: Object examples including an airplane and a table from ShapeNetPart dataset [8]; Right column: Ground truth with different colors representing separate parts.

Object part segmentation is an important task for point cloud representation learning. As shown in Fig. 3, models are trained to predict a part category label (e.g. wing, table leg) to each point. The mean Intersection over Union (IoU) [78] is the most frequently used criteria for performance evaluation: For each instance, IoU is computed for each part belonging to that object category. The mean of the part IoUs represents the IoU for that object instance. Overall IoU is computed as the average of IoUs over all instances while category-wise IoU (or class IoU) is calculated as the mean over instances under that category.

When choosing object part segmentation as the downstream task, similar to object classification, either the linear classification protocol or the fine-tuning protocol are often adopted by existing methods to evaluate the quality of the unsupervised point cloud features.

2.2.3 3D object detection

(a) ScanNet-V2 [18] dataset
(b) KITTI [31] dataset
Fig. 4: Illustration of 3D bounding boxes in point cloud object detection: The two graphs are cropped from [76] and [77], respectively.

3D Object detection on point clouds is crucial and indispensable for many real-world applications, such as autonomous driving and domestic robots. The task is to localize the 6 Degrees-of-Freedom (DoF) of 3D objects in spaces,

i.e. 3D bounding boxes as demonstrated in Fig. 4. The average precision (AP) metric is often used for evaluation [76, 95]. When evaluating unsupervised learning methods, models pre-trained on unlabeled point clouds are fine-tuned on 3D object detection task to test generalization ability of learned representations.

2.2.4 3D semantic segmentation

Fig. 5: Illustration of semantic point cloud segmentation: For the point cloud sample from S3DIS [3] shown on the left, the graph on the right shows the corresponding ground truth with different categories highlighted by different colors.

3D semantic segmentation on point clouds, as shown in Fig. 5

, is another critical task for 3D understanding. Similar to the object part segmentation task, it aims to assign a category label for each point while the input of networks are scene point clouds instead of point cloud objects. The OA, mIoU on semantic classes, and mAcc are often used as the evaluation metrics.

When demonstrating the generalization ability of unsupervised learned features, networks that trained with the pre-text task on unlabeled large data are served as the pre-trained model and fine-tuned on 3D semantic segmentation datasets, then the performances on the semantic segmentation task are reported for evaluation.

2.2.5 3D instance segmentation

Fig. 6: Illustration of instance segmentation on point clouds: For the point cloud sample from ScanNet-V2 [18] on the left, the graph on the right shows the corresponding ground truth with different instances highlighted by different colors.

As shown in Fig. 6, 3D instance segmentation aims to detect and delineate each distinct object of interest appearing in point cloud scenes. Unlike semantic segmentation, different objects of the same class will have different labels for instance segmentation. Mean Average Precision (mAP) is used for quantitative evaluation in the task. Similar to prior tasks, networks initialized with unsupervised pre-trained models are fine-tuned on instance segmentation datasets, and the performances are used to test the generalization ability of unsupervised point cloud features.

2.3 Relevant surveys

To the best of our knowledge, this paper is the first comprehensive survey on unsupervised point cloud learning. There are several excellent surveys that are related to this topic.

Several surveys of deep learning for point clouds are available: [54] reviewed deep learning advances on 3D data; [120] provided literature review on the point cloud segmentation task; [38] provided a comprehensive and detailed survey on deep learning of point cloud including classification, detection and tracking, as well as segmentation tasks. These works focus on supervised learning while this paper is about unsupervised learning. There are several works that introduce self-supervised learning on other modalities: [57] introduced advances on self-supervised learning in 2D computer vision; [69] looked into latest progress about self-supervised learning methods in 2D computer vision, NLP, and graph learning; [80] introduced recent progress on small data learning including unsupervised- and semi-supervised methods. Readers are recommended to read this paper with these related surveys to formulate a more comprehensive understanding of the research area.

3 Point cloud datasets

Dataset Year #Samples #Classes Type Representation Label
KITTI [31] 2013 15K frames 8 Outdoor driving RGB & LiDAR Bounding box
ModelNet10 [114] 2015 4,899 objects 10 Synthetic object Mesh Object category label
ModelNet40 [114] 2015 12,311 objects 40 Synthetic object Mesh Object category label
ShapeNet [8] 2015 51,190 objects 55 Synthetic object Mesh Object/part category label
SUN RGB-D [99] 2015 5K frames 37 Indoor scene RGB-D Bounding box
S3DIS [3] 2016 272 scans 13 Indoor scene RGB-D Point category label
ScanNet [18] 2017 1,513 scans 20 Indoor scene RGB-D & mesh Point category label & Bounding box
ScanObjectNN [106] 2019 2,902 objects 15 Real-world object Points Object category label
ONCE [72] 2021 1M scenes 5 Outdoor driving RGB & LiDAR Bounding box
TABLE I: Summary of commonly used datasets for training and evaluation by unsupervised point cloud representation learning methods.

In this subsection, we summarize the commonly used datasets for training and evaluating methods of unsupervised point cloud representation learning. Existing works learn unsupervised point cloud representations mainly from 1) synthetic object datasets including ModelNet [114] and ShapeNet [8], or 2) real scene datasets including ScanNet [18] and KITTI [31]. The learned models are then used as initialization weight to fine-tune downstream tasks, e.g. classification on ScanObjectNN [106], ModelNet40 [114] or ShapeNet [8], part segmentation on ShapeNetPart dataset [8], semantic segmentation on S3DIS [3], ScanNet [18], or Synthia4D [89], object detection on indoor datasets (SUNRGB-D [99] and ScanNet [18]) or outdoor ONCE dataset [72].

  • ModelNet10/ModelNet40 [114]: ModelNet is a synthetic object-level dataset for 3D classification. The original ModelNet provides CAD models represented by vertices and faces. Point clouds are generated by sampling from the models uniformly. The ModelNet40 contains 13,834 objects from 40 categories, among which 9,843 objects belong to training set and the other 3,991 samples for testing. Similarly, the ModelNet10, with samples from 10 categories, is split into 2,468 training samples and 909 testing samples.

  • ShapeNet [8]:

    ShapeNet is a dataset of synthetic 3D objects of 55 common categories. It was curated by collecting CAD models from online open-sourced 3D repositories. Similar to ModelNet, objects in the synthetic ShapeNet datasets are also complete, without any occlusion and background. Its extension

    ShapeNetPart dataset contains 16,881 objects from 16 categories, represented as point clouds. Each object consists of 2 to 6 parts, and in total there are 50 parts in the dataset.

  • ScanObjectNN [106]: ScanObjectNN is a real-world object-level dataset, where 2,902 3D point cloud objects from 15 categories are extracted from scans captured in real indoor scenes. Different from synthetic object datasets, point cloud objects in ScanObjectNN are noisy (background points, occlusions, holes in objects) and are not axis-aligned.

  • S3DIS [3]: Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset comprises 3D scans of 6 large-scale indoor areas collected from 3 office buildings with an area of 6,000 square meters and contain over 215 million points. The scans are represented as point clouds and point-wise semantic labels of 13 object categories are annotated.

  • ScanNet-V2 [18]: ScanNet-V2 is an RGB-D video dataset containing 2.5 million views in more than 1500 scans, captured in indoor scenes such as offices and living rooms and annotated with 3D camera poses, surface reconstructions, as well as semantic and instance labels for segmentation.

  • SUN RGB-D [99]: SUN RGB-D dataset is a collection of single view RGB-D images from indoor environments. There are in total 10,335 RGB-D images annotated with amodal, 3D oriented bounding boxes for objects from 37 categories.

  • KITTI [31]: The KITTI dataset is a pioneer outdoor dataset providing dense point clouds from a lidar sensor together with other modalities including front-facing stereo images and GPS/IMU data. It provides 200k 3D boxes over 22 scenes for 3D object detection.

  • ONCE [72]: The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. There are 581 sequences in total, where 560 sequences as unlabeled for unsupervised learning, 10 sequences with annotations as the testing set. It provides a self-supervised learning benchmark for object detection in outdoor environments.

The publicly available datasets for URL of point clouds are very limited in both data size and scene variety as compared with those of images for 2D computer vision and texts for NLP. For example, [22]

used more than 3 billion words for NLP pre-training and ImageNet

[21] has more than 10 million images for unsupervised visual representation learning. Large-scale and high-quality point cloud data are needed as a foundation for future research.

4 Common deep architectures for point cloud learning

Deep learning for 3D point clouds, as compared with that of NLP and 2D computer vision, still remains a relatively nascent field and unsupervised representation learning is even more so. One reason is the lack of a highly regular representation for point cloud data: There are word embeddings for NLP and images for 2D computer vision. While point-clouds, represented as unordered point sets, enjoy no such universal and structural data format. Traditional 3D vision algorithms transform such data to structures like Octrees [46] or Hashed Voxel Lists [73], while deep learning approaches favor structures more amenable to differentiability and/or efficient neural processing and reached impressive performance improvements over various 3D tasks.

This section introduces deep point cloud architectures by category. Considering the fact that it is still an active area of research to find the proper universal “3D backbone” that can become as ubiquitous as VGG [97] or ResNet [45] for 2D vision, and the abundance of existing 3D deep architectures and models, we only focus on those that are frequently used in URL of point clouds, including point-based architectures, graph-based architectures, sparse voxel-based architectures and spatial CNN-based. There are other types of networks having been proposed, such as projection-based networks [101, 116]

, recurrent neural networks

[51], capsule networks [136], etc, while these models are not frequently used in existing unsupervised learning approaches. A more comprehensive introduction on deep point cloud architectures can be found in [38].

4.1 Point-based deep architectures

Fig. 7: A simplified architecture of PointNet [78] for object classification. represents point numbers and is feature dimension.

Point-based networks [78, 79, 135, 50]

are designed to process raw point-cloud data directly without other data pre-transformations. Networks are stacked with Multi-Layer Perceptrons (MLPs) for independent point feature extraction, which are aggregated into global features by using symmetric aggregation functions.

PointNet [78] is a pioneer point-based network. As shown in Fig. 7

, it stacks several MLP layers to learn point-wise features independently, which then are forwarded into a max-pooling layer to extract global features and achieve permutation invariance. However, PointNet fails to capture local information. Qi et al. further proposed PointNet++

[79] to capture geometry details from the neighborhood of points. Its set abstraction level consists of the sampling layer, the grouping layer, and the PointNet-based learning layer to learn hierarchical features layer by layer. Since PointNet++ has shown success in object classification and semantic segmentation, VoteNet [76] adopts it as the backbone and formulates the first point-based network for 3D object detection. By taking the strategy of Hough voting, it generates new points that lie close to objects centers, which can be grouped and aggregated to output 3D box proposals.

4.2 Graph-based deep architectures

Fig. 8: Schematic depiction of graph convolutional network (GCN): denotes input channels, denotes output feature dimensions, and denotes labels. Graphs are shared across network layers, each of which consists of multiple vertexes highlighted by circular dots representing points and edges connecting the vertexes.

Graph-based networks build graphs in the Euclidean space of the point clouds where vertexes are points and edges are neighborhood relations. As shown in Fig. 8, networks input graphs for graph convolution operations to extract spatial information. There are advantages of this approach including reducing the degrees of freedom in the learned models by enforcing some kind of weight sharing, and extracting localized features that successfully capture dependencies among neighboring points.

The Dynamic Graph Convolutional Neural Network (DGCNN) 

[111] is a typical graph-based model and have been frequently used for unsupervised learning, in which a graph convolution named EdgeConv is executed on edges of the k-nearest neighbor graph of the point clouds, which is dynamically recomputed in feature space after each layer. DGCNN integrated EdgeConv into the basic version of PointNet structures and reached impressive performances.

4.3 Sparse voxel-based deep architectures

Fig. 9: Figure is reproduced based on [119]. An illustration of SR-UNet [119] that adopts a unified U-Net [88] architecture for sparse convolution. The graph is reproduced based on [119].

This type of architecture adopts the sparse tensor as the representation for point cloud data and extends it into sparse convolutional networks

[35]. Recently, Choy et al. proposed Minkowski Engine [17], a generalized sparse convolution and an auto-differentiation library for sparse tensors. Based on it, Xie et al. [119] adopted a unified U-Net [88] architecture and built a backbone network (SR-UNet) for unsupervised pre-training as shown in Fig. 9. The learned encoder is then transferred into a series of downstream tasks including classification, object detection and semantic segmentation.

4.4 Spatial CNN-based deep architectures

Spatial CNN-based networks [71, 105] extend regular grid convolutional neural networks (CNN) to irregular configuration for point cloud analysis. According to the convolutional kernels, this type of deep architecture can be divided into continuous and discrete convolution models [38]. As Fig. 10 indicates, the former defines the convolutional kernels on a continuous space while the latter operates on a discrete space (e.g. regular grids).

Fig. 10: An illustration of 3D spatial convolution including continuous and discrete convolutions. Parameters and denote the center point and its neighbor points, respectively. The graph is reproduced based on [38].

As a typical model example, RS-CNN [71] learns the geometric topology relations among points and their surrounding points as neighbors for convolutional weights, and builds a hierarchical deep network for point cloud processing.

5 Unsupervised point cloud representation learning

As shown in Fig. 2, we divide existing unsupervised point cloud representation learning methods into four main categories, including generation-based methods, context-based methods, multi-modal-based methods, and local descriptor-based methods. According to this taxonomy, we sort out existing methods and will review them in detail as follows.

5.1 Generation-based methods

Method Publication Year Category Contribution
VConv-DAE [94] ECCV 2016 Completion Learning by predicting missing parts in 3D grids
TL-Net [33] ECCV 2016 Reconstruction Learning by 3D generation and 2D prediction
3D-GAN [113] NeurIPS 2016 GAN Pioneer GAN for 3D voxels
3D-DescriptorNet [118] CVPR 2018 Completion

learning with energy-based models

FoldingNet [125] CVPR 2018 Reconstruction learning by folding 3D object surfaces
SO-Net [65] CVPR 2018 Reconstruction Performing hierarchical feature extraction on individual points and SOM nodes
Latent-GAN [1] ICML 2018 GAN Pioneer GAN for raw point clouds and latent embeddings
MRT [28] ECCV 2018 Reconstruction

A new autoencoder with multi-grid architecture

VIP-GAN [39] AAAI 2019 GAN Learning by solving multiple view inter-prediction tasks for objects with an RNN-based network
G-GAN [107] ICLR 2019 GAN Pioneer GAN with graph convolution
3DCapsuleNet [136] CVPR 2019 Reconstruction Learning with 3D point-capsule network
L2G-AE [70] ACM MM 2019 Reconstruction Learning by global and local reconstruction
MAP-VAE [40] ICCV 2019 Reconstruction Learning by reconstruction and half-to-half prediction
PointFlow [123] ICCV 2019 Reconstruction Learning by modeling point clouds as a distribution of distributions
PDL [96] CVPR 2020 reconstruction A probabilistic framework for point distribution learning
GraphTER [30] CVPR 2020 Reconstruction Proposed a graph-based autoencoder
SA-Net [112] CVPR 2020 Completion Learning by completing point cloud objects with a skip-attention mechanism
PointGrow [103] WACV 2020 Reconstruction

Presented an autoregressive model that can recurrently generate point cloud samples.

PSG-Net [124] ICCV 2021 Reconstruction Learning by reconstruct point cloud objects with seed generation
OcCo [108] ICCV 2021 Completion Learning by completing occluded point cloud objects
TABLE II: Summary of generation-based methods of unsupervised representation learning for point clouds.

Generation-based unsupervised methods for learning point cloud representations involve the process of generating point cloud objects, which according to the pre-text tasks, can be further summarized into four subcategories including point cloud self-reconstruction (to generate point cloud objects that are the same as the input), point cloud GAN (to generate fake point cloud objects), point cloud up-sampling (to generate shape-similar but denser point clouds than the input) and point cloud completion (to predict missing parts from partial point cloud objects). The ground truth for training these methods are point clouds themselves, which require no human annotations and can be regarded as unsupervised learning methods. A list of generation-based methods can be found in Table II.

5.1.1 Learning through point cloud self-reconstruction

Fig. 11: An illustration of AutoEncoder in unsupervised point cloud representation learning: The Encoder learns to represent the point cloud object by a Codeword vector while the Decoder reconstruct the Output Object from the Codeword.

One of the most common unsupervised approaches for learning point cloud representations is self-reconstructing 3D objects, which encodes point cloud samples into representation vectors and decodes them back into the original input data. Shape information and semantic structures are extracted and encoded into representations during this process. Since no human annotations are involved, it belongs to unsupervised learning. A typical and most commonly used model is Autoencoder [61]. As Fig. 11 describes, it consists of an encoder network and a decoder network. The encoder compresses and encodes a point cloud object into a low-dimensional embedding vector named codeword [125]. It is then be decoded back into the 3D space and requires the output to be the same as the input. The encoding is validated and refined by attempting to regenerate the input from the encoding and the autoencoder learns the representations for dimensionality reduction by training the network to ignore insignificant data (“noise”) [4]. Permutation invariant losses [25] including Chamfer Distance ( defined as eq. 1) or Earth Mover’s Distance (or EMD, defined as eq. 2) are often used as the training objective to describe how similar between input and output point clouds:

(1)

Where and represent corresponding input and output point clouds, and are points.

(2)

where is a bijection, and are equal sizes.

A series of unsupervised methods based on self-reconstruction has been proposed: In an early work TL-Net [33], Girdhar et al. proposed that point cloud representations should be generative in 3D space and predictable from 2D space. So they designed a 3D autoencoder to reconstruct 3D volumetric grids and a 2D convolutional network to learn 2D features from projected images. The 2D embeddings should be close to the 3D embeddings so as to enrich feature distribution across modalities. Yang et al. [125] introduced a novel folding-based decoder in FoldingNet, which deforms a canonical 2D grid with the encoded codeword onto the underlying 3D object surface of a point cloud object. SO-Net [65]

is built as an auto-encoder structure with designed Self-Organizing Map (SOM) to learn hierarchical features of point clouds. Gadelha

et al. [28] designed an autoencoder with multi-resolution tree structures, which learns point cloud representations through coarse-to-fine analysis. Zhao et al. [136] extended capsule networks [90] into processing 3D point clouds and presented an unsupervised 3D Point-Capsule Network, which also adopted the auto-encoder structure for generic representation learning in unstructured 3D data. Gao et al. [30] proposed a graph-based auto-encoder to learn intrinsic patterns of point-cloud structures under both global and local transformations.

To further learn local geometries, in ther work L2G-AE [70], Liu et al. introduced a hierarchical self-attention mechanism in the encoder to aggregate different levels of information, and employed a recurrent neural network (RNN) as decoder to reconstruct a sequence of scales in local regions; Han et al. proposed MAP-VAE [40], which on top of global reconstruction, introduced multi-angle analysis with a half-to-half prediction using RNN, which splits a point cloud into a front half and a back half, and learn to predict a back half sequence from the corresponding front half sequence. Besides, Yang et al. [123] proposed a principled probabilistic framework named PointFlow to generate 3D point clouds by modeling the distribution of shapes as well as the distribution of points given a shape. Shi et al. [96]

introduced a probabilistic framework to extract unsupervised deep shape descriptors with point distribution learning, which associates each point with a Gaussian and models point clouds as the distribution of the points. They train DNNs with an unsupervised self-correspondence L2 distance loss to solve the maximum likelihood estimation process. Sun

et al. [103] presented an autoregressive model named Pointgrow that generates diverse and realistic point cloud samples from scratch or conditioned on semantic contexts. They also designed self-attention modules to capture long-range dependencies within 3D objects. Chen et al. [12] designed a deep autoencoder with graph topology inference and filtering, which extracts compact representations from 3D point clouds. Yang et al. [124] proposed an autoencoder structure with the seed generation module to extract input-dependent point-wise features in multiple stages with gradually increasing resolution for point cloud reconstruction. Chen et al. [11] proposed to learn sampling-invariant features by reconstructing point cloud objects with different resolution and minimizing Chamfer distances between them.

5.1.2 Learning through point cloud GAN

Fig. 12: An illustration of GAN which typically consists of a generator and a discriminator that contest with each other during the training process (in the form of a zero-sum game, where one agent’s gain is another agent’s loss).

Generative and Adversarial Network (GAN) [34] is a typical deep generative model. As demonstrated in Fig. 12, it consists of a generator and a discriminator. The generator aims to produce realistic data samples while the discriminator tries to classify real samples and synthesized samples output by the generator. The technique learns to generate new data with the same statistics as the training set and the modeling can be formulate as a minmax problem:

(3)

where is the generator and represents the discriminator. and represent a real sample and a randomly sampled noise vector from a distribution , respectively.

The pipeline for point cloud unsupervised learning with GAN tasks usually two steps: At first, as illustrated in Fig. 12, the generator learns from a sampled vector or latent embedding to generate point cloud instances , while the discriminator try to distinguish whether point clouds are from real data distribution or generated data distribution. During the training, the discriminator learns to capture semantic structure of point cloud objects; Then the learnt parameters of the discriminator can be served as initialized weight of networks for downstream tasks. This procedure involves no human annotations thus it belongs to unsupervised learning.

GAN for point clouds is a hot research topic and has been implemented in many different 3D areas [66, 122, 98, 110, 131], some works specifically employ the point cloud generation with GAN as the pretext for unsupervised representation learning [113, 1, 107, 64] and evaluate the features learned for discriminator on other downstream high-level tasks with transfer learning: As a pioneering work, 3D-GAN [113], as the first GAN model for 3D voxels, generates 3D objects by sampling a latent vector and mapping it to the object space. The adversarial discriminator that learned without supervision provides a powerful 3D shape descriptor for 3D applications such as object recognition. However, voxelization process for point clouds has to either sacrifice the representation accuracy or incurs huge redundancies. Later, Achlioptas et al. proposed Latent-GAN [1] as the first generative model for raw point clouds: An autoencoder is first trained to learn features of point clouds in a latent space; Then a generative model is then trained on this fixed latent space and achieved superior reconstruction and coverage of the data distribution. Li et al. [64] proposed a point cloud GAN model with a hierarchical sampling and inference network, which learns a stochastic procedure to generate new point clouds. Valsesia et al. [107] designed a graph-based GAN model to extracts localized features from point clouds.

5.1.3 Learning through point cloud up-sampling

Fig. 13: An illustration of point cloud up-sampling: The network DNN learns point cloud representations by solving a pre-text task that reproduces an object with the same geometry but denser point distribution.

As Fig. 13 shows, given a set of points, the point cloud up-sampling task aims to generate a denser set of points, which requires deep point cloud networks to learn underlying geometry of 3D shapes. There is no human annotations involves, and it falls into the unsupervised learning.

Li et al. [66] introduced GAN into point cloud up-sampling and presented PU-GAN to learn a rich variety of point distributions from the latent space and up-sample points over patches on object surfaces. The generator aims to produce up-sampled point clouds while the discriminator try to distinguish whether its input point cloud is produced by the generator or the real one. Similar to GANs introduced in Section 5.1.2, the parameters of discriminator can be transferred to other downstream tasks. Besides, Remelli et al. [86] designed an autoencoder that is able to up-sample sparse point clouds into dense representations. The learned weight of the encoder can also be used as initialization weights for downstream task as introduces in Section 5.1.1. On the other hand, although point cloud up-sampling is a relative popular research topic [127, 126, 66, 82, 81, 67], the performances of these networks are mainly evaluated on the quality of generated point clouds while the performances for transferred learning have not been tested yet.

5.1.4 Learning through point cloud completion

Method Publication Year Category Contribution
MultiTask [42] ICCV 2019 Hybrid Learning by clustering, reconstruction, and self-supervised classification
Jigsaw3D [93] NeurIPS 2019 Spatial-context Learning by solving 3D jigsaws
Constrast&Cluster [132] 3DV 2019 Hybrid Learning by contrasting and clustering with GNN
GLR [85] CVPR 2020 Hybrid Learning by global-local reasoning
Info3D [91] ECCV 2020 Context-similarity Learning by contrasting global and local parts of objects
PointContrast [119] ECCV 2020 Context-similarity Learning by contrasting different views of scene point clouds
ACD [27] ECCV 2020 Context-similarity Learning by contrasting decomposed convex components
Rotation3D [75] 3DV 2020 Spatial-context Learning by predicting rotation angle
HNS [23] ACM MM 2021 Context-similarity Learning by contrasting local patches of point cloud objects with hard negative sampling
CSC [47] CVPR 2021 Context-similarity Techniques to improve contrasting scene point cloud views
STRL [52] ICCV 2021 Temporal-context Learning spatio-temporal data invariance from point cloud sequences
RandomRooms [84] ICCV 2021 Context-similarity Constructing pseudo scenes with synthetic objects for contrastive learning
DepthContrast [134] ICCV 2021 Context-similarity Joint contrastive learning with points and voxels
SelfCorrection [15] ICCV 2021 Hybrid Learning by distinguishing and restoring destroyed objects
TABLE III: Summary of context-based methods of unsupervised representation learning for point clouds.
Fig. 14: The pipeline of OcCo [108]. Taking occluded point cloud objects by camera view as input, an encoder-decoder model is trained to complete the occluded point clouds where the encoder is to learn representations of point clouds and the decoder is to generate complete objects. The learned encoder weights can be used as initialization for downstream tasks. The figure is from [108] with author’s permission.

Point cloud completion is a task of predicting arbitrary missing parts based on the rest of 3D point cloud objects. The network needs to learn inner geometric structures and semantic knowledge of the objects so as to correctly predict missing parts, which can then be transferred into downstream tasks. Since no human annotations are needed for point cloud completion task, these methods belong to unsupervised learning.

Many researches has been proposed for the point cloud completion task: A pioneer work VConv-DAE [94] voxelized point cloud objects into volumetric grids and learns the shape distributions of various classes with an autoencoder that predicts the missing voxels from the rest. Xie et al. [118] presented a framework 3D-DescriptorNet for probabilistic modeling of volumetric shape patterns by joining the advantages of energy-based model and volumetric CNN. It learns object representations through recovering the randomly corrupted voxels of the 3D objects. Achlioptas et al. [1] introduced the first DNN for raw point cloud completion by utilizing an encoder-decoder framework. Yuan et al. [128] proposed PCN that combines the advantages of Latent-GAN [1] and FoldingNet [125] and is specialized in repairing incomplete point clouds. Wen et al. [112] proposed a skip-attention mechanism that can selectively convey geometric information from the local regions to the decoder so as to complete point cloud objects. While most works output whole objects where points may be not consistent with original input, Huang et al. [53] proposed to repair the incomplete point cloud through outputting missing part but keeping existing points. Such kind of prediction retains the geometric features of the original point cloud and helps the network focus on perceiving the location and structure of missing parts. Wang et al. [108], as shown in Fig. 14, proposed to learn an encoder-decoder model which reconstructs the occluded points by different camera views. The encoder parameters are then be used as initialization for downstream point cloud tasks including classification, part segmentation and semantic segmentation. Although a series of methods have been designed for the task [37, 128, 53, 68, 112, 133, 117], only a few of them [94, 118, 112, 108] evaluated the quality of learned representation features in the unsupervised learning benchmarks.

Recently, recovering missing part from incomplete input as the pre-text task has been proved remarkably successful in NLP [83, 60] and 2D vision [43] while few works have been explored in point clouds unsupervised learning. We believe this is a potential and promising direction for future research.

5.1.5 Discussion

Unsupervised learning on point clouds based on generation task has been a major research direction with a relative long history. Existing methods mainly focus on learning from object-level point clouds while few works explore scene-level data, which limits the application of unsupervised learning. On the contrary, generation-based methods for unsupervised learning have shown great successes in both NLP [22, 60] and 2D vision [43]. To this end, we believe there is a large potential in this direction.

5.2 Context-based methods

Another category of unsupervised point cloud learning methods is context-based methods. Unlike generation-based methods that learn by generating point clouds, these methods employ discriminative pre-text tasks to learn different context of point clouds including context similarity, spatial context structures and temporal context structures. A list of methods is summarized in Table III.

5.2.1 Learning with context similarity

Fig. 15: An illustration of instance contrastive learning that learns locally smooth representations by self-discrimination, which pulls Query (from the Anchor sample) close to Positive Key (from Positive Sample) and pushes it away from Negative Keys (from Negative Samples).

This type of method formulates unsupervised learning by exploring underlying context similarities between samples. A typical method is contrastive learning, which has demonstrated superior performances in both 2D [44, 36, 13] and 3D [119, 47, 134] URL in recent years. Fig. 15 shows an example of instance-wise contrastive learning: The input point cloud object serves as the anchor; The positive samples are defined as the augmented views of the anchor while negative samples are different object instances. The DNN learns representations of point clouds by optimizing pre-text task of self-discrimination, i.e. query (feature of the anchor) should be close to the positive keys (features of positive samples) while separate from its negative keys (features of negative samples). Such a learning strategy groups similar samples together and helps networks to learn semantic structures from unlabeled data distribution. InfoNCE loss [74] and its extensions are often employed as the training objective, which is defined as below:

(4)

where is encoded query; are the keys among which is the positive key; is a temperature hyper-parameter that controls how the distribution concentrates.

A series of methods [91, 109, 27, 23, 56] have been presented to learn object representations: Sanghi et al. [91] proposed to learn point cloud representations through maximizing mutual information between objects and their local parts. Similarly, Wang et al. [109] implemented contrasting inter-instances and points-instances. They further created multi-resolution instances from point cloud objects to capture hierarchical features. Gadelha et al. [27] adopted the classical shape decomposition method, i.e. Approximate Convex Decomposition to decompose objects into convex components, which are used for contrastive learning by formulating positive pairs among the same components and negative pairs among different components. Du et al. [23] conducted instance-level and local part-level contrastive learning. They also extracted hard negative samples for discriminative feature learning. Rao et al. [85] combined contrastive learning and self-reconstruction to formulate a multi-task unsupervised representation learning.

Fig. 16: The pipeline of PointContrast [119]: Two scans and of the same scene captured from two different viewpoints are transformed by and for data augmentation. The correspondence mapping between the two views is computed to minimize the distance for matched point features and maximize the distance for unmatched point features for contrastive learning. The graph is from [119] with authors’ permission.

Recently, Xie et.al proposed PointContrast [119] that learns representations of scene point clouds. Figure 16 demonstrates the pipeline of their designed pre-text task: Dense correspondences between two aligned views of indoor scenes are extracted, which are used for point-level contrastive learning. They leveraged a unified backbone (SR-UNet) for multiple 3D tasks including classification, semantic segmentation, and object detection, as well as learning representation on a relatively large and diverse scene-level point cloud dataset ScanNet [18]. They demonstrated, for the first time, that network weights pre-trained on 3D partial frames can lead to a performance boost when fine-tuned on 3D semantic segmentation and object detection tasks.

Since PointContrast brought new insights for the community, a series of unsupervised pre-training works are proposed for scene-level 3D tasks:figures Hou et al. [47] integrated spatial contexts into the pre-training objective by partitioning the space into spatially inhomogeneous cells for correspondence, which mitigates the constrains in PointContrast that the spatial configurations and contexts in scenes are disregarded. They further proved the effect of unsupervised pre-training in instance segmentation task. Hou et al. [48] built different correspondences from RGB-D images for contrastive learning, including 2D correspondences from multi-view images, as well as 2D-3D correspondences between images and point clouds. Zhang et al. [134] instead, designed a new contrastive learning approach DepthContrast. They combined a point-based network and a sparse voxel-based network to capture different representations from the same point clouds, and conducted contrastive learning between these two networks. They demonstrated that the joint-training of two data formats reached better pre-training effects in both detection and segmentation tasks. Rao et al. [84] first generated two random scenes with the same set of synthetic objects from ShapeNet [8] and built sample pairs accordingly. They then conducted instance contrastive learning on two generated scenes to pre-train networks for 3D detection.

Another approach to learn context similarity is by clustering

. Samples are first grouped into clusters by clustering algorithms such as K-Means 

[41]. By assigning cluster id as pseudo-labels to samples, networks can be trained in a supervised classification and learn semantic structures of data distribution. A typical example is DeepClustering [6] which is the first unsupervised clustering method for 2D visual representation learning. However, no prior studies adopted a purely clustering strategy for URL of point clouds. Instead, hybrid approaches are proposed by integrating clustering with other unsupervised learning approaches such as reconstruction [42] or contrastive learning [132].

5.2.2 Learning with spatial context structure

Fig. 17: The pipeline of 3DJigsaw [93]: An object is split into voxels where each point is assigned with a voxel label. The split voxels are randomly rearranged via pre-processing, and a deep neural network is trained to predict the voxel label for each point. The graph is reproduced based on [93].

Point-cloud data with spatial coordinates provide accurate geometric representations for the 3D shapes and environments. The rich spatial contexts contained can be exploited in pre-text tasks for point cloud representation learning. For example, the networks can be trained to sort out the relations among different parts of objects and spatial context information is extracted during the process. Likewise, the learned weights can then be used for transferring to downstream tasks. This type of method requires no human annotations for training and it belongs to unsupervised learning.

The key to this type of method is the pre-text tasks designed to exploit spatial context from point clouds. A pioneer work [93] was proposed to learn through solving 3D jigsaws. As illustrated in Fig. 17, objects are first split into voxels where each point is assigned a voxel label. Then the network is fed with randomly rearranged point clouds and is optimized to predict the correct voxel label for each point. During the training, the networks aims to extract spatial relations and geometric information from point clouds which is exploit as pre-trained knowledge for downstream tasks including object classification and part segmentation. They further designed another pre-text task [92] that predicts one of ten spatial relationships of two parts from the same object, e.g. ”part A above/below part B”, ”part A is behind part B”. Inspired by 2D rotation angle prediction [32], Poursaeed et al. [75] leveraged unsupervised learning by predicting rotation angles of 3D objects. Thabet et al. [104] predicted the next point in a point sequence defined by Morton-order Space Filling Curve. Chen et al. [15] proposed to learn the spatial context of objects by distinguishing the distorted parts of a shape from the correct ones. Sun et al. [102] designed the pre-text task that uses spatial context cues for point cloud unsupervised learning. They first mix point cloud objects and train an encoder-decoder network to disentangle it back into two objects.

5.2.3 Learning with temporal context structure

Point cloud sequence is a common type of point cloud data. Similar to video data, it consists of point cloud frames that have rich temporal and spatial information. For example, single frame RGB-D image in ScanNet [18] can be transferred into a point cloud thus the whole RGB-D videos can produce point cloud sequences; LiDAR sequential datasets [31, 5, 115] consists of point cloud scans where each scan is collected within a sweep by LiDAR sensors. Pre-text tasks can be designed to mine temporal contexts as supervision signals to learn useful representations from point cloud data.

Fig. 18: The pipeline of STRL [52]: An Online Network learns spatial and temporal structures from two neighbouring point cloud frames and . The figure is adopted from [52] with author’s permission.

Recently, Huang et al. [52] proposed a simple yet effective framework named Spatio-Temporal Representation Learning (STRL). As shown in Fig. 18, they extended BYOL [36] framework into learning 3D point cloud representations and regarded two near frames sampled from point cloud sequences as positive pairs. By minimizing the mean squared error between learned feature representations of sample pairs, networks are able to learn temporal invariant information as well as semantic structures from unlabeled data. Chen et al. [16] firstly leveraged synthetic 3D shapes moving in static 3D environments to create dynamic scenarios and sample pairs in the temporal order. They then conducted contrastive learning to learn 3D representations with dynamic understanding, which are transferred to improve performances in downstream 3D scene understanding tasks.

Unsupervised learning with temporal context structures has been proved to be effective in 2D vision with a series of methods having been proposed [26, 100, 49, 62] while few works have been proposed [52, 16] in learning point cloud data. Therefore, this direction is promising and more related methods and techniques are needed.

5.2.4 Discussion

Learning on contexts is a newly rising direction for unsupervised point cloud representation learning and has attracted increasing attention in recent years. While majority of URL methods for point clouds are designed for object-level representation learning, several recent context-based works [119, 52, 134, 17] have successfully proved that the learned representations can generalize across domains by boosting performances of different scene-level tasks. These findings inspire the research of URL on point clouds and encourage more research on unsupervised pre-text task design for 3D deep representation learning.

5.3 Multiple modal-based methods

Fig. 19: The pipeline of CMCV [58]: CMCV employs a 2D CNN to extract 2D image features from rendered views of 3D objects and a 3D GCN to extract 3D features from point clouds directly. The two types of features are then concatenated by a two-layer fully connected network (FCN) to predict cross-modality correspondences. The graph is reproduced based on [58].

Multiple modalities (e.g. images [31] and natural language description [9]) are able to provide additional information for point-cloud data, which helps to train better representation learning models for point clouds. The correspondence between different modalities can also be used as supervision for unsupervised learning. However, few works have been proposed in this direction.

Recently, Jing et al. [58] proposed to jointly learn both 2D image features and 3D point cloud features by exploiting cross-modality and cross-view correspondences. As Fig. 19 shows, they designed a model that consists of a convolutional neural network to extract 2D features of images (rendered from point clouds ) as well as a graph convolution neural network to extract 3D features from point clouds. The whole network is optimized by estimating the cross-modality correspondence between features of the two modalities. After the unsupervised pre-training, the learned convolutional neural network and the graph convolution neural network can be used to fine-tune 2D and 3D downstream tasks, respectively. Since two data formats are exploit together, the model is able to learn multi-modal information jointly when solving the pretext task and the performance improvements on the downstream applications are obvious.

5.4 Local descriptor-based methods

The above-introduced methods aim to learn semantic structures of point clouds for high-level understanding, while the local descriptor-based methods instead focus on learning low-level information from point clouds: PPF-FoldNet [19] extracts rotation invariant 3D local descriptors for 3D matching [129]. It inherits PointNet [78], FoldingNet [125] and PPFNet [20], and formulates a new approach for unsupervised 3D local feature extraction. Several works [130, 63] designed the pre-text task of non-rigid shape correspondence for URL of point clouds, which aims to find the point-to-point correspondence of two deformable 3D shapes. Jiang et al. [55] proposed to learn feature representation through unsupervised 3D registration. They designed a sampling network-guided cross-entropy method to find the optimal rigid transformation including a rotation matrix and a translation vector that can align the source point cloud to the target precisely.

The performances of local descriptor-based methods are mainly evaluated on low-level tasks while the generalizability that adapting feature representations towards other tasks including high-level 3D understanding is rarely discussed, which remains an open application domain for the community.

6 Benchmark performances

This section compares the performances of existing unsupervised learning methods on public point clouds datasets. As we introduced in Section 2.2, the quality of representations learned by unsupervised learning methods is often evaluated on downstream tasks including object-level tasks (object classification and object part segmentation) and scene-level tasks (semantic segmentation, instance segmentation and object detection). We will compare them by tasks in the rest of this section.

6.1 Object-level tasks

6.1.1 Object classification

Method Year Pre-text task Backbone Pre-train dataset ModelNet10 ModelNet40
Supervised learning 2017 - PointNet [78] - - 89.2
2017 - PointNet++ [79] - - 90.7
2019 - DGCNN [111] - - 93.5
2019 - RSCNN [71] - - 93.6
SPH [59] 2003 Generation - ShapeNet 79.8 68.2
LFD [10] 2003 Generation - ShapeNet 79.9 75.5
TL-Net [33] 2016 Generation - ShapeNet - 74.4
VConv-DAE [94] 2016 Generation - ShapeNet 80.5 75.5
3D-GAN [113] 2016 Generation - ShapeNet 91.0 83.3
3D DescriptorNet [118] 2018 Generation - ShapeNet - 92.4
FoldingNet [125] 2018 Generation - ModelNet40 91.9 84.4
FoldingNet [125] 2018 Generation - ShapeNet 94.4 88.4
Latent-GAN [1] 2018 Generation - ModelNet40 92.2 87.3
Latent-GAN [1] 2018 Generation - ShapeNet 95.3 85.7
MRTNet [28] 2018 Generation - ShapeNet 86.4 -
VIP-GAN [39] 2019 Generation - ShapeNet 94.1 92.0
3DCapsuleNet [136] 2019 Generation - ShapeNet - 88.9
PC-GAN [64] 2019 Generation - ModelNet40 - 87.8
L2G-AE [70] 2019 Generation - ShapeNet 95.4 90.6
MAP-VAE [40] 2019 Generation - ShapeNet 94.8 90.2
PointFlow [123] 2019 Generation - ShapeNet 93.7 86.8
MultiTask [42] 2019 Hybrid - ShapeNet - 89.1
Jigsaw3D [93] 2019 Context PointNet ShapeNet 91.6 87.3
Jigsaw3D [93] 2019 Context DGCNN ShapeNet 94.5 90.6
ClusterNet [132] 2019 Context DGCNN ShapeNet 93.8 86.8
CloudContext [92] 2019 Context DGCNN ShapeNet 94.5 89.3
NeuralSampler [86] 2019 Generation - ShapeNet 95.3 88.7
PointGrow [103] 2020 Generation - ShapeNet 85.8 -
Info3D [91] 2020 Context PointNet ShapeNet - 89.8
Info3D [91] 2020 Context DGCNN ShapeNet - 91.6
ACD [27] 2020 Context PointNet++ ShapeNet - 89.8
PDL [96] 2020 Generation - ShapeNet - 84.7
GLR [85] 2020 Hybrid PointNet++ ShapeNet 94.8 92.2
GLR [85] 2020 Hybrid RSCNN ShapeNet 94.6 92.2
SA-Net-cls [112] 2020 Generation - ShapeNet - 90.6
GraphTER [30] 2020 Generation - ModelNet40 - 89.1
Rotation3D [75] 2020 Context PointNet ShapeNet - 88.6
Rotation3D [75] 2020 Context DGCNN ShapeNet - 90.8
MID [109] 2020 Context HRNet ShapeNet - 90.3
GTIF [12] 2020 Generation HRNet ShapeNet 95.9 89.6
HNS [23] 2021 Context DGCNN ShapeNet - 89.6
ParAE [24] 2021 Generation PointNet ShapeNet - 90.3
ParAE [24] 2021 Generation DGCNN ShapeNet - 91.6
CMCV [58] 2021 Context DGCNN ShapeNet - 89.8
GSIR [11] 2021 Context DGCNN ModelNet40 - 90.4
STRL [52] 2021 Context PointNet ShapeNet - 88.3
STRL [52] 2021 Context DGCNN ShapeNet - 90.9
PSG-Net [124] 2021 Generation PointNet++ ShapeNet - 90.9
SelfCorrection [15] 2021 Hybrid PointNet ShapeNet 93.3 89.9
SelfCorrection [15] 2021 Hybrid RSCNN ShapeNet 95.0 92.4
TABLE IV: Comparisons of linear shape classification performances on ModelNet10 and ModelNet40 datasets [114]: A linear SVM classifiers are trained with the representations learned by different unsupervised methods.
Method Backbone Accuracy (%)
Jigsaw3D [93] PointNet 89.2/89.6(+0.4)
Info3D [91] PointNet 89.2/90.2(+1.0)
SelfCorrection [15] PointNet 89.1/90.0(+0.9)
OcCo [108] PointNet 89.2/90.1(+0.9)
ParAE [24] PointNet 89.2/90.5(+1.3)
Jigsaw3D [93] PCN 89.3/89.6(+0.3)
OcCo [108] PCN 89.3/90.3(+1.0)
GLR [85] RSCNN 91.8/92.2(+0.5)
SelfCorrection [15] RSCNN 91.7/93.0(+1.3)
Jigsaw3D [93] DGCNN 92.2/92.4(+0.2)
Info3D [91] DGCNN 93.5/93.0(-0.5)
OcCo [108] DGCNN 92.5/93.0(+0.5)
ParAE [24] DGCNN 92.2/92.9(+0.7)
STRL [52] DGCNN 92.2/93.1(+0.9)
TABLE V: Comparisons of pre-training effects by unsupervised learning methods for object classification on ModelNet40 dataset. Metric ”A/B”: ”A” represents training from scratch while ”B” represents unsupervised pre-training.

Object classification is the most often used downstream task for evaluation since the majority of existing methods learn point cloud representations on object-level datasets. As we introduced in Section 2.2.1, there are two types of classification protocols, i.e. the linear classification protocol and the fine-tuning protocol.

Linear classification on object datasets has been widely tested in the literature. Networks are first trained on ShapeNet [8] or ModelNet40 [114] datasets with unsupervised learning methods. Then a linear SVM classifier is trained on the learned features of the training set in ModelNet10 or ModelNet40 [114], and the classification results of the testing set are reported for evaluating the quality of learned point cloud object representations. Table IV lists the performances of existing methods as compared with supervised learning performances. The performances of unsupervised learning methods keep improving and some have even surpassed supervised learning methods, indicating the effectiveness and potential of unsupervised point cloud representation learning.

Method Type Backbone class mIoU instance mIoU
PointNet Sup. PointNet 80.4 83.7
PointNet++ Sup. PointNet++ 81.9 85.1
DGCNN Sup. DGCNN 82.3 85.1
RSCNN Sup. RSCNN 84.0 86.2
Latent-GAN [1] Unsup. - 57.0 -
MAP-VAE [40] Unsup. - 68.0 -
CloudContext [92] Unsup. DGCNN - 81.5
GraphTER [30] Unsup. - 78.1 81.9
MID [109] Unsup. HRNet 83.4 84.6
HNS [23] Unsup. DGCNN 79.9 82.3
CMCV [58] Unsup. DGCNN 74.7 80.8
SO-Net [65] Trans. SO-Net -/- 84.6/84.9(+0.3)
Jigsaw3D [93] Trans. DGCNN 82.3/83.1(+0.8) 85.1/85.3(+0.2)
MID [109] Trans. HRNet 84.6/85.2(+0.6) 85.5/85.8(+0.3)
CMCV [58] Trans. DGCNN 77.6/79.1(+1.5) 83.0/83.7(+0.7)
OcOc [108] Trans. PointNet 82.2/83.4(+1.2) -/-
OcOc [108] Trans. DGCNN 84.4/85.0(+0.6) -/-
TABLE VI: Comparisons of shape part segmentation on ShapeNetPart datasets [8]. ”Sup.” represents training segmentation models from scratch in a supervised manner; ”Unsup.” means training a linear classifier over the unsupervised point features; ”Trans.” represents fine-tuning segmentation models that are initialized with the unsupervised pre-trained models in a supervised manner.
Method OA on area 5 with different train area mIoU on area 5 with different train area
Area1 Area2 Area3 Area4 Area6 Area1 Area2 Area3 Area4 Area6
from scratch 82.9 81.2 82.8 82.8 83.1 43.6 34.6 39.9 39.4 43.9
Jigsaw3D [93] 83.5(+0.6) 81.2(+0.0) 84.0(+1.2) 82.9(+0.1) 83.3(+0.2) 44.7(+1.1) 34.9(+0.3) 42.4(+2.5) 39.9(+0.5) 43.9(+0.0)
ParAE [24] 91.8(+8.9) 82.3(+1.1) 89.5(+6.7) 88.2(+5.4) 86.4(+3.3) 53.5(+9.9) 38.5(+3.9) 48.4(+8.5) 45.0(+5.6) 49.2(+5.3)
Method OA on area 6 with different train area mIoU on area 6 with different train area
Area1 Area2 Area3 Area4 Area5 Area1 Area2 Area3 Area4 Area5
from scratch 84.6 70.6 77.7 73.6 76.9 57.9 38.9 49.5 38.5 48.6
STRL [52] 85.3(+0.7) 72.4(+1.8) 79.1(+1.4) 73.8(+0.2) 77.3(+0.4) 59.2(+1.3) 39.2(+0.8) 51.9(+2.4) 39.3(+0.8) 49.5(+0.9)
TABLE VII: Semantic segmentation on S3DIS datasets [3] comparing supervised training performance under random weight initialization v.s. pre-trained weights learned from unsupervised pre-training tasks. Using DGCNN as the segmentation model, which is trained on different single Areas and tested performance on Area 5 (upper part) or Area 6 (lower part).
Method Backbone mACC mIoU
from scratch SR-UNet 75.5 68.2
PointConstrast [119] 77.0 70.9
DepthContrast [134] - 70.6
Method Backbone OA mIoU
from scratch PointNet 78.2 47.0
Jigsaw3D [93] 80.1 52.6
OcOc [108] 82.0 54.9
from scratch PCN 82.9 51.1
Jigsaw3D [93] 83.7 52.2
OcOc [108] 85.1 53.4
from scratch DGCNN 83.7 54.9
Jigsaw3D [93] 84.1 55.6
OcOc [108] 84.6 58.0
TABLE VIII: Performances for semantic segmentation on S3DIS [3]. Upper part: Models are tested on Area5 (Fold#1) and trained on the rest of the data. Lower part: Six-fold cross-validation over three-run.

Another evaluation protocol is fine-tuning unsupervised learned models on the object classification task and Table V lists the performances on the ModelNet40 dataset. On one hand, classification models initialized with unsupervised pre-trained weights always reach better classification performances as compared with models trained from scratch with random initialization, regardless of backbone architectures. On the other hand, the performance gains are still limited (less than 1.5%). One possible explanation is due to the limited size and diversity of the pre-training datasets including ShapeNet and ModelNet40. As a comparison, the state-of-the-art methods [44, 36]

for 2D image unsupervised pre-training are able to reach much larger performance gains in the classification task and the 2D dataset ImageNet 

[21] is known to be more huge and diverse.

6.1.2 Object part segmentation

Method Backbone Input SUN RGB-D ScanNet-V2
mAP@0.5 mAP@0.25 mAP@0.5 mAP@0.25
from scratch SR-UNet Geo 31.7 55.6 35.4 56.7
PointConstrast [119] Geo 34.8 57.5 38.0 58.5
from scratch VoteNet Geo 32.9 57.7 33.5 58.6
STRL [52] Geo - 58.2 - -
RandRooms [84] Geo 35.4 59.2 36.2 61.3
DepthContrast [134] Geo - - - 62.2
CSC [47] Geo 33.6 - - -
PointContrast [119] Geo 34.0 - 38.0 -
4DContrast [16] Geo 34.4 - 39.3 -
from scratch PointNet++ Geo - 57.5 - 58.6
PointContrast [119] Geo - 57.9 - 58.5
RandRooms [84] Geo - 59.2 - 61.3
DepthContrast [134] Geo - 60.7 - -
from scratch H3DNet Geo 39.0 60.1 48.1 67.3
RandRooms [84] Geo 43.1 61.6 51.5 68.6
TABLE IX: Comparison of pre-training effects by unsupervised learning methods for 3D object detection on SUN RGB-D [99] and ScanNet-V2 [18] datasets.
Method Vehicle Pedestrian Cyclist mAP
Baseline [121] 69.7 26.1 59.9 51.9
BYOL [36] 67.6 17.2 53.4 46.1 (-5.8)
PointContrast [119] 71.5 22.7 58.0 50.8 (-0.1)
SwAV [7] 72.3 25.1 60.7 52.7 (+0.8)
DeepCluster [6] 72.1 27.6 50.3 53.3 (+1.4)
BYOL [36] 69.7 27.3 57.2 51.4 (-0.5)
PointContrast [119] 70.2 29.2 58.9 52.8 (+0.9)
SwAV [7] 72.1 28.0 60.2 53.4 (+1.5)
DeepCluster [6] 72.1 30.1 60.5 54.2 (+2.3)
BYOL [36] 72.2 23.6 60.5 52.1 (+0.2)
PointContrast [119] 73.2 27.5 58.3 53.0 (+1.1)
SwAV [7] 72.0 30.6 60.3 54.3 (+2.4)
DeepCluster [6] 71.9 30.5 60.4 54.3 (+2.4)
TABLE X: Results of object detection on ONCE dataset [72]. The baseline is training from scratch. Unsupervised learning methods are used for pre-training models. , , represent small, medium, and large amounts of unlabeled data being used for unsupervised learning, respectively.
Method Backbone S3DIS ScanNet
from scratch SR-UNet 59.3 53.4
PointContrast [119] 60.5 55.8
CSC [47] 63.4 56.5
4DContrast [16] - 57.6
TABLE XI: Performances of instance segmentation on S3DIS [3] and ScanNet-V2 [18] datasets. Mean of average precision (mAP) across all semantic classes with 3D IoU threshold 0.25 are reported.

Table VI summarizes performances of object part segmentation on ShapeNetPart dataset [8]. The mIoU over instances and mIoU over categories on the test set are reported. Similar to the classification tasks, there are also two protocols that are often used for evaluating in this task: One is training a linear classifier for each point on top of the unsupervised learning networks (”Unsup.” in the Table); The other is fine-tuning unsupervised pre-training models to the object part segmentation task (”Trans.” in the Table). As we can see from the table, the performances gap between unsupervised learning and supervised learning (”sup.” in the table) is narrowing. Also, the knowledge of unsupervised learning always helps to train better object part segmentation models but the performance gains are limited.

6.2 Scene-level tasks

As we introduced in Section 5.2, unsupervised pre-training for scene-level 3D tasks is a relatively new direction in the research of URL on point clouds. As compared with object-level tasks, only a few works [119, 52, 134, 16] evaluated the transferability of the learned features into high-level 3D tasks including semantic segmentation, object detection and instance segmentation. The same evaluation protocol is used for all tasks, i.e. comparing performances of downstream models that trained from scratch against pre-trained by unsupervised learning methods.

Table VII and Table VIII list performances of semantic segmentation on the S3DIS [3] dataset. We summarized them separately since different setups have been used in prior works: Table VII follows the semi-supervised setup where the segmentation model DGCNN is trained on single areas and tested on either Area 5 or Area 6. On the contrary, Table VIII shows performances of training on the whole dataset following the one-fold (in the upper part of the table) and six-fold cross-validation setups (in the lower part of the table), respectively. Table IX compares object detection performances on indoor datasets including SUN RGB-D [99] and ScanNet-V2 [18]; Table X provides the benchmark performances of object detection on the outdoor LiDAR dataset ONCE [72]. Besides, Table XI summarizes instance segmentation performances on S3DIS [3] and ScanNet-V2 [18].

It is inspiring to see that unsupervised learning representation can boost performances over multiple 3D high-level tasks as compared with training from scratch, indicating the generalization ability across domains and the great potential of unsupervised learning on point clouds. However, the improvements are still limited as compared with its counterparts in 2D computer vision and NLP. More research is encouraged in this area.

7 Future direction

Unsupervised representation learning for point clouds has achieved significant progress during the last decade. In this section, we discuss some future directions of this research field.

Unified 3D backbones are needed: One reason for the success of deep learning in 2D computer vision is the standardization of CNN architectures with canonical examples including VGG [97] and ResNet [45]. It facilitates knowledge transfer among different datasets and tasks due to the unified backbone structures. On the contrary, although a series of 3D architectures have been proposed recently, how to design a unified 3D backbone is still under-explored. As a result, different backbone models have been adopted by existing URL methods as we can observe from tables in Section 6, which brings difficulties for technical design and fair performance comparison, as well as limitations for applications. Finding the proper universal backbone that can become as ubiquitous as ResNet for 2D vision is pivotal for the community of 3D deep learning including unsupervised representation learning.

Larger datasets are needed: As we mentioned in Section 3, existing datasets for unsupervised learning are mainly collected for supervised learning tasks. Due to the difficulty of human annotating, these datasets have small point numbers and limited diversity which limits the upper bound for unsupervised learning methods. It can be seen from the limited performance improvements by unsupervised pre-training in tables of Section 6. Both indoor and outdoor point cloud datasets benchmarks with larger sizes are needed as foundations for future research.

Learning features from multi-modal data: 3D sensors are often equipped together with other types of sensors that can provide additional information to point clouds. For example, LiDAR sensor, RGB camera, GPU, and IMU are often equipped together in autonomous vehicles and many robotics. Learning correspondences among multi-modal data captured by different sensors can be used as pre-text tasks for unsupervised learning and few works have explored this direction.

Learning spatio-temporal features: 3D sensors that can provide sequential data are becoming increasingly popular in real scenarios. Rich temporal information can be extracted as supervision signals for unsupervised learning while most of the existing works focus on static point clouds. More effective pretext tasks that are specifically designed to learn Spatio-temporal features are needed.

Learning features from synthetic data: Leveraging synthetic data rendered from virtual game engines for training networks is a common approach to mitigate data collection and annotation burden in both 2D [89, 2, 29] and 3D [115] vision. Additional information, such as the normal vectors for points, can be easily obtained by game engines but is extremely difficult to collect in real scenarios. Pre-text tasks can be designed to learn relations between point clouds and such additional information. The networks that are pre-trained on synthetic data can be transferred into the real domain. This strategy has been explored in 2D computer vision [87] while remaining under-explored in point cloud representation learning. However, the synthetic point clouds rendered from virtual environments usually have a large ”domain gap” with real point cloud data [115]. Therefore, techniques to address the domain gaps are also needed to be investigated.

8 Conclusion

Unsupervised representation learning aims to learn effective representations from unannotated data, which has demonstrated impressive progress and promising applications in the point cloud area. This paper gave a comprehensive survey of recent deep neural network-based methods for unsupervised point cloud representation learning. We first introduced commonly used datasets and deep network architectures. Then a taxonomy and detailed review of methods are presented. Following that, we summarized and compared the performances of these methods over multiple 3D tasks. Finally, a list of future directions of unsupervised point cloud learning is discussed. We hope this work can lay a strong and sound foundation for the future research.

Acknowledgments

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from Singapore Telecommunications Limited (Singtel), through Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU).

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas (2018) Learning representations and generative models for 3d point clouds. In

    International conference on machine learning

    ,
    pp. 40–49. Cited by: §5.1.2, §5.1.4, TABLE II, TABLE IV, TABLE VI.
  • [2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3d semantic parsing of large-scale indoor spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1534–1543. Cited by: §7.
  • [3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543. Cited by: Fig. 5, 4th item, TABLE I, §3, §6.2, TABLE XI, TABLE VII, TABLE VIII.
  • [4] Autoencoder (2022) Autoencoder — Wikipedia, the free encyclopedia. Note: [Online; accessed 16-Feb-2022] External Links: Link Cited by: §5.1.1.
  • [5] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall (2019) Semantickitti: a dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9297–9307. Cited by: §5.2.3.
  • [6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §5.2.1, TABLE X.
  • [7] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems 33, pp. 9912–9924. Cited by: TABLE X.
  • [8] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: Fig. 3, 2nd item, TABLE I, §3, §5.2.1, §6.1.1, §6.1.2, TABLE VI.
  • [9] D. Z. Chen, A. X. Chang, and M. Nießner (2020) Scanrefer: 3d object localization in rgb-d scans using natural language. In European Conference on Computer Vision, pp. 202–221. Cited by: §5.3.
  • [10] D. Chen, X. Tian, Y. Shen, and M. Ouhyoung (2003)

    On visual similarity based 3d model retrieval

    .
    In Computer graphics forum, Vol. 22, pp. 223–232. Cited by: TABLE IV.
  • [11] H. Chen, S. Luo, X. Gao, and W. Hu (2021) Unsupervised learning of geometric sampling invariant representations for 3d point clouds. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 893–903. Cited by: §5.1.1, TABLE IV.
  • [12] S. Chen, C. Duan, Y. Yang, D. Li, C. Feng, and D. Tian (2020) Deep Unsupervised Learning of 3D Point Clouds via Graph Topology Inference and Filtering. IEEE Transactions on Image Processing 29, pp. 3183–3198. External Links: Document Cited by: §5.1.1, TABLE IV.
  • [13] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §5.2.1.
  • [14] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §1.
  • [15] Y. Chen, J. Liu, B. Ni, H. Wang, J. Yang, N. Liu, T. Li, and Q. Tian (2021) Shape self-correction for unsupervised point cloud understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8382–8391. Cited by: §5.2.2, TABLE III, TABLE IV, TABLE V.
  • [16] Y. Chen, M. Nießner, and A. Dai (2021) 4DContrast: contrastive learning with dynamic correspondences for 3d scene understanding. arXiv preprint arXiv:2112.02990. Cited by: §5.2.3, §5.2.3, §6.2, TABLE XI, TABLE IX.
  • [17] C. Choy, J. Gwak, and S. Savarese (2019) 4d spatio-temporal convnets: minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084. Cited by: §4.3, §5.2.4.
  • [18] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5828–5839. Cited by: 3(a), Fig. 6, 5th item, TABLE I, §3, §5.2.1, §5.2.3, §6.2, TABLE XI, TABLE IX.
  • [19] H. Deng, T. Birdal, and S. Ilic (2018) Ppf-foldnet: unsupervised learning of rotation invariant 3d local descriptors. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 602–618. Cited by: §5.4.
  • [20] H. Deng, T. Birdal, and S. Ilic (2018) Ppfnet: global context aware local features for robust 3d point matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 195–205. Cited by: §5.4.
  • [21] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3, §6.1.1.
  • [22] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3, §5.1.5.
  • [23] B. Du, X. Gao, W. Hu, and X. Li (2021) Self-contrastive learning with hard negative sampling for self-supervised point cloud learning. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 3133–3142. Cited by: §5.2.1, TABLE III, TABLE IV, TABLE VI.
  • [24] B. Eckart, W. Yuan, C. Liu, and J. Kautz (2021) Self-supervised learning on 3d point clouds by learning discrete generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8248–8257. Cited by: TABLE IV, TABLE V, TABLE VII.
  • [25] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §5.1.1.
  • [26] C. Feichtenhofer, H. Fan, B. Xiong, R. Girshick, and K. He (2021) A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309. Cited by: §5.2.3.
  • [27] M. Gadelha, A. RoyChowdhury, G. Sharma, E. Kalogerakis, L. Cao, E. Learned-Miller, R. Wang, and S. Maji (2020) Label-efficient learning on point clouds using approximate convex decompositions. In European Conference on Computer Vision, pp. 473–491. Cited by: §5.2.1, TABLE III, TABLE IV.
  • [28] M. Gadelha, R. Wang, and S. Maji (2018) Multiresolution tree networks for 3d point cloud processing. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–118. Cited by: §5.1.1, TABLE II, TABLE IV.
  • [29] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig (2016) Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4340–4349. Cited by: §7.
  • [30] X. Gao, W. Hu, and G. Qi (2020) GraphTER: unsupervised learning of graph transformation equivariant representations via auto-encoding node-wise transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7163–7172. Cited by: §5.1.1, TABLE II, TABLE IV, TABLE VI.
  • [31] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: 3(b), 7th item, TABLE I, §3, §5.2.3, §5.3.
  • [32] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: §5.2.2.
  • [33] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta (2016) Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision, pp. 484–499. Cited by: §5.1.1, TABLE II, TABLE IV.
  • [34] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §5.1.2.
  • [35] B. Graham, M. Engelcke, and L. Van Der Maaten (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9224–9232. Cited by: §4.3.
  • [36] J. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Pires, Z. Guo, M. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. In Neural Information Processing Systems, Cited by: §1, §5.2.1, §5.2.3, §6.1.1, TABLE X.
  • [37] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018) A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216–224. Cited by: §5.1.4.
  • [38] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun (2020) Deep learning for 3d point clouds: a survey. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.3, Fig. 10, §4.4, §4.
  • [39] Z. Han, M. Shang, Y. Liu, and M. Zwicker (2019) View inter-prediction gan: unsupervised representation learning for 3d shapes by learning global shape memories to support local view predictions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8376–8384. Cited by: TABLE II, TABLE IV.
  • [40] Z. Han, X. Wang, Y. Liu, and M. Zwicker (2019) Multi-angle point cloud-vae: unsupervised feature learning for 3d point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10441–10450. Cited by: §5.1.1, TABLE II, TABLE IV, TABLE VI.
  • [41] J. A. Hartigan and M. A. Wong (1979) Algorithm as 136: a k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics) 28 (1), pp. 100–108. Cited by: §5.2.1.
  • [42] K. Hassani and M. Haley (2019) Unsupervised multi-task feature learning on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8160–8171. Cited by: §5.2.1, TABLE III, TABLE IV.
  • [43] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2021) Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377. Cited by: §1, §5.1.4, §5.1.5.
  • [44] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §5.2.1, §6.1.1.
  • [45] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4, §7.
  • [46] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard (2013) OctoMap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous robots 34 (3), pp. 189–206. Cited by: §4.
  • [47] J. Hou, B. Graham, M. Nießner, and S. Xie (2021) Exploring data-efficient 3d scene understanding with contrastive scene contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15587–15597. Cited by: §1, §5.2.1, §5.2.1, TABLE III, TABLE XI, TABLE IX.
  • [48] J. Hou, S. Xie, B. Graham, A. Dai, and M. Nießner (2021-10) Pri3D: can 3d priors help 2d representation learning?. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5693–5702. Cited by: §5.2.1.
  • [49] K. Hu, J. Shao, Y. Liu, B. Raj, M. Savvides, and Z. Shen (2021) Contrast and order representations for video self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7939–7949. Cited by: §5.2.3.
  • [50] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham (2020) Randla-net: efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11108–11117. Cited by: §4.1.
  • [51] Q. Huang, W. Wang, and U. Neumann (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: §4.
  • [52] S. Huang, Y. Xie, S. Zhu, and Y. Zhu (2021) Spatio-temporal self-supervised representation learning for 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6535–6545. Cited by: §1, Fig. 18, §5.2.3, §5.2.3, §5.2.4, TABLE III, §6.2, TABLE IV, TABLE V, TABLE VII, TABLE IX.
  • [53] Z. Huang, Y. Yu, J. Xu, F. Ni, and X. Le (2020) PF-net: point fractal network for 3d point cloud completion. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 7659–7667. External Links: Document Cited by: §5.1.4.
  • [54] A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris (2017) Deep learning advances in computer vision with 3d data: a survey. ACM Computing Surveys (CSUR) 50 (2), pp. 1–38. Cited by: §2.3.
  • [55] H. Jiang, Y. Shen, J. Xie, J. Li, J. Qian, and J. Yang (2021) Sampling network guided cross-entropy method for unsupervised point cloud registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6128–6137. Cited by: §5.4.
  • [56] J. Jiang, X. Lu, W. Ouyang, and M. Wang (2021) Unsupervised representation learning for 3d point cloud data. arXiv preprint arXiv:2110.06632. Cited by: §5.2.1.
  • [57] L. Jing and Y. Tian (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.1, §2.3.
  • [58] L. Jing, L. Zhang, and Y. Tian (2021) Self-supervised feature learning by cross-modality and cross-view correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1581–1591. Cited by: §1, Fig. 19, §5.3, TABLE IV, TABLE VI.
  • [59] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz (2003) Rotation invariant spherical harmonic representation of 3 d shape descriptors. In Symposium on geometry processing, Vol. 6, pp. 156–164. Cited by: TABLE IV.
  • [60] J. D. M. C. Kenton and L. K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186. Cited by: §1, §5.1.4, §5.1.5.
  • [61] M. A. Kramer (1991)

    Nonlinear principal component analysis using autoassociative neural networks

    .
    AIChE journal 37 (2), pp. 233–243. Cited by: §5.1.1.
  • [62] H. Kuang, Y. Zhu, Z. Zhang, X. Li, J. Tighe, S. Schwertfeger, C. Stachniss, and M. Li (2021) Video contrastive learning with global context. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3195–3204. Cited by: §5.2.3.
  • [63] I. Lang, D. Ginzburg, S. Avidan, and D. Raviv (2021) DPC: unsupervised deep point correspondence via cross and self construction. In 2021 International Conference on 3D Vision (3DV), pp. 1442–1451. Cited by: §5.4.
  • [64] C. Li, M. Zaheer, Y. Zhang, B. Poczos, and R. Salakhutdinov (2018) Point cloud gan. arXiv preprint arXiv:1810.05795. Cited by: §5.1.2, TABLE IV.
  • [65] J. Li, B. M. Chen, and G. H. Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §5.1.1, TABLE II, TABLE VI.
  • [66] R. Li, X. Li, C. Fu, D. Cohen-Or, and P. Heng (2019) Pu-gan: a point cloud upsampling adversarial network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7203–7212. Cited by: §5.1.2, §5.1.3.
  • [67] R. Li, X. Li, P. Heng, and C. Fu (2021) Point cloud upsampling via disentangled refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344–353. Cited by: §5.1.3.
  • [68] M. Liu, L. Sheng, S. Yang, J. Shao, and S. Hu (2020) Morphing and sampling network for dense point cloud completion. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34, pp. 11596–11603. Cited by: §5.1.4.
  • [69] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang (2021) Self-supervised learning: generative or contrastive. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.1, §2.3.
  • [70] X. Liu, Z. Han, X. Wen, Y. Liu, and M. Zwicker (2019) L2g auto-encoder: understanding point clouds by local-to-global reconstruction with hierarchical self-attention. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 989–997. Cited by: §5.1.1, TABLE II, TABLE IV.
  • [71] Y. Liu, B. Fan, S. Xiang, and C. Pan (2019) Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: §4.4, §4.4, TABLE IV.
  • [72] J. Mao, M. Niu, C. Jiang, H. Liang, J. Chen, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li, et al. (2021) One million scenes for autonomous driving: once dataset. arXiv preprint arXiv:2106.11037. Cited by: 8th item, TABLE I, §3, §6.2, TABLE X.
  • [73] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger (2013) Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (ToG) 32 (6), pp. 1–11. Cited by: §4.
  • [74] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §5.2.1.
  • [75] O. Poursaeed, T. Jiang, H. Qiao, N. Xu, and V. G. Kim (2020) Self-supervised learning of point clouds via orientation estimation. In 2020 International Conference on 3D Vision (3DV), pp. 1018–1028. Cited by: §5.2.2, TABLE III, TABLE IV.
  • [76] C. R. Qi, O. Litany, K. He, and L. J. Guibas (2019) Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286. Cited by: Fig. 4, §2.2.3, §4.1.
  • [77] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 918–927. Cited by: Fig. 4.
  • [78] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §2.2.2, Fig. 7, §4.1, §4.1, §5.4, TABLE IV.
  • [79] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §4.1, §4.1, TABLE IV.
  • [80] G. Qi and J. Luo (2020) Small data challenges in big data era: a survey of recent progress on unsupervised and semi-supervised methods. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.3.
  • [81] G. Qian, A. Abualshour, G. Li, A. Thabet, and B. Ghanem (2021) Pu-gcn: point cloud upsampling using graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11683–11692. Cited by: §5.1.3.
  • [82] Y. Qian, J. Hou, S. Kwong, and Y. He (2020) Pugeo-net: a geometry-centric network for 3d point cloud upsampling. In European Conference on Computer Vision, pp. 752–769. Cited by: §5.1.3.
  • [83] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1, §5.1.4.
  • [84] Y. Rao, B. Liu, Y. Wei, J. Lu, C. Hsieh, and J. Zhou (2021) RandomRooms: unsupervised pre-training from synthetic shapes and randomized layouts for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3283–3292. Cited by: §5.2.1, TABLE III, TABLE IX.
  • [85] Y. Rao, J. Lu, and J. Zhou (2020) Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5376–5385. Cited by: §5.2.1, TABLE III, TABLE IV, TABLE V.
  • [86] E. Remelli, P. Baque, and P. Fua (2019) NeuralSampler: euclidean point cloud auto-encoder and sampler. arXiv preprint arXiv:1901.09394. Cited by: §5.1.3, TABLE IV.
  • [87] Z. Ren and Y. J. Lee (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 762–771. Cited by: §7.
  • [88] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: Fig. 9, §4.3.
  • [89] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3234–3243. Cited by: §3, §7.
  • [90] S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. arXiv preprint arXiv:1710.09829. Cited by: §5.1.1.
  • [91] A. Sanghi (2020) Info3d: representation learning on 3d objects using mutual information maximization and contrastive learning. In European Conference on Computer Vision, pp. 626–642. Cited by: §5.2.1, TABLE III, TABLE IV, TABLE V.
  • [92] J. Sauder and B. Sievers (2019) Context prediction for unsupervised deep learning on point clouds. arXiv preprint arXiv:1901.08396 2 (4), pp. 5. Cited by: §5.2.2, TABLE IV, TABLE VI.
  • [93] J. Sauder and B. Sievers (2019) Self-supervised deep learning on point clouds by reconstructing space. Advances in Neural Information Processing Systems 32, pp. 12962–12972. Cited by: §1, Fig. 17, §5.2.2, TABLE III, TABLE IV, TABLE V, TABLE VI, TABLE VII, TABLE VIII.
  • [94] A. Sharma, O. Grau, and M. Fritz (2016) Vconv-dae: deep volumetric shape learning without object labels. In European Conference on Computer Vision, pp. 236–250. Cited by: §5.1.4, TABLE II, TABLE IV.
  • [95] S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 770–779. Cited by: §2.2.3.
  • [96] Y. Shi, M. Xu, S. Yuan, and Y. Fang (2020) Unsupervised deep shape descriptor with point distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9353–9362. Cited by: §5.1.1, TABLE II, TABLE IV.
  • [97] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4, §7.
  • [98] E. J. Smith and D. Meger (2017) Improved adversarial systems for 3d object generation and reconstruction. In Conference on Robot Learning, pp. 87–96. Cited by: §5.1.2.
  • [99] S. Song, S. P. Lichtenberg, and J. Xiao (2015) Sun rgb-d: a rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 567–576. Cited by: 6th item, TABLE I, §3, §6.2, TABLE IX.
  • [100] X. Song, S. Zhao, J. Yang, H. Yue, P. Xu, R. Hu, and H. Chai (2021) Spatio-temporal contrastive domain adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9787–9795. Cited by: §5.2.3.
  • [101] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §4.
  • [102] C. Sun, Z. Zheng, X. Wang, M. Xu, and Y. Yang (2021) Point cloud pre-training by mixing and disentangling. arXiv preprint arXiv:2109.00452. Cited by: §5.2.2.
  • [103] Y. Sun, Y. Wang, Z. Liu, J. Siegel, and S. Sarma (2020) Pointgrow: autoregressively learned point cloud generation with self-attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 61–70. Cited by: §5.1.1, TABLE II, TABLE IV.
  • [104] A. Thabet, H. Alwassel, and B. Ghanem (2020) Self-supervised learning of local features in 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 938–939. Cited by: §5.2.2.
  • [105] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) Kpconv: flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6411–6420. Cited by: §4.4.
  • [106] M. A. Uy, Q. Pham, B. Hua, T. Nguyen, and S. Yeung (2019) Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1588–1597. Cited by: 3rd item, TABLE I, §3.
  • [107] D. Valsesia, G. Fracastoro, and E. Magli (2018) Learning localized generative models for 3d point clouds via graph convolution. In International conference on learning representations, Cited by: §1, §5.1.2, TABLE II.
  • [108] H. Wang, Q. Liu, X. Yue, J. Lasenby, and M. J. Kusner (2021-10) Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9782–9792. Cited by: §1, Fig. 14, §5.1.4, TABLE II, TABLE V, TABLE VI, TABLE VIII.
  • [109] P. Wang, Y. Yang, Q. Zou, Z. Wu, Y. Liu, and X. Tong (2020) Unsupervised 3d learning for shape analysis via multiresolution instance discrimination. ACM Trans. Graphic. Cited by: §5.2.1, TABLE IV, TABLE VI.
  • [110] W. Wang, Q. Huang, S. You, C. Yang, and U. Neumann (2017)

    Shape inpainting using 3d generative adversarial network and recurrent convolutional networks

    .
    In Proceedings of the IEEE international conference on computer vision, pp. 2298–2306. Cited by: §5.1.2.
  • [111] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §4.2, TABLE IV.
  • [112] X. Wen, T. Li, Z. Han, and Y. Liu (2020) Point cloud completion by skip-attention network with hierarchical folding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1939–1948. Cited by: §5.1.4, TABLE II, TABLE IV.
  • [113] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 82–90. Cited by: §5.1.2, TABLE II, TABLE IV.
  • [114] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: 1st item, TABLE I, §3, §6.1.1, TABLE IV.
  • [115] A. Xiao, J. Huang, D. Guan, F. Zhan, and S. Lu (2021) SynLiDAR: learning from synthetic lidar sequential point cloud for semantic segmentation. arXiv preprint arXiv:2107.05399. Cited by: §5.2.3, §7.
  • [116] A. Xiao, X. Yang, S. Lu, D. Guan, and J. Huang (2021) FPS-net: a convolutional fusion network for large-scale lidar point cloud segmentation. ISPRS Journal of Photogrammetry and Remote Sensing 176, pp. 237–249. Cited by: §4.
  • [117] C. Xie, C. Wang, B. Zhang, H. Yang, D. Chen, and F. Wen (2021) Style-based point generator with adversarial rendering for point cloud completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4619–4628. Cited by: §5.1.4.
  • [118] J. Xie, Z. Zheng, R. Gao, W. Wang, S. Zhu, and Y. N. Wu (2018) Learning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8629–8638. Cited by: §5.1.4, TABLE II, TABLE IV.
  • [119] S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany (2020) Pointcontrast: unsupervised pre-training for 3d point cloud understanding. In European Conference on Computer Vision, pp. 574–591. Cited by: Fig. 9, §4.3, Fig. 16, §5.2.1, §5.2.1, §5.2.4, TABLE III, §6.2, TABLE X, TABLE XI, TABLE VIII, TABLE IX.
  • [120] Y. Xie, J. Tian, and X. X. Zhu (2020) Linking points with labels in 3d: a review of point cloud semantic segmentation. IEEE Geoscience and Remote Sensing Magazine 8 (4), pp. 38–59. Cited by: §2.3.
  • [121] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: TABLE X.
  • [122] B. Yang, H. Wen, S. Wang, R. Clark, A. Markham, and N. Trigoni (2017) 3d object reconstruction from a single depth view with adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 679–688. Cited by: §5.1.2.
  • [123] G. Yang, X. Huang, Z. Hao, M. Liu, S. Belongie, and B. Hariharan (2019) Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4541–4550. Cited by: §5.1.1, TABLE II, TABLE IV.
  • [124] J. Yang, P. Ahn, D. Kim, H. Lee, and J. Kim (2021) Progressive seed generation auto-encoder for unsupervised point cloud learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6413–6422. Cited by: §5.1.1, TABLE II, TABLE IV.
  • [125] Y. Yang, C. Feng, Y. Shen, and D. Tian (2018) Foldingnet: point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215. Cited by: §5.1.1, §5.1.1, §5.1.4, §5.4, TABLE II, TABLE IV.
  • [126] W. Yifan, S. Wu, H. Huang, D. Cohen-Or, and O. Sorkine-Hornung (2019) Patch-based progressive 3d point set upsampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5958–5967. Cited by: §5.1.3.
  • [127] L. Yu, X. Li, C. Fu, D. Cohen-Or, and P. Heng (2018) Pu-net: point cloud upsampling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790–2799. Cited by: §5.1.3.
  • [128] W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert (2018) Pcn: point completion network. In 2018 International Conference on 3D Vision (3DV), pp. 728–737. Cited by: §5.1.4.
  • [129] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser (2017) 3dmatch: learning local geometric descriptors from rgb-d reconstructions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1802–1811. Cited by: §5.4.
  • [130] Y. Zeng, Y. Qian, Z. Zhu, J. Hou, H. Yuan, and Y. He (2021) CorrNet3D: unsupervised end-to-end learning of dense correspondence for 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6052–6061. Cited by: §5.4.
  • [131] J. Zhang, X. Chen, Z. Cai, L. Pan, H. Zhao, S. Yi, C. K. Yeo, B. Dai, and C. C. Loy (2021) Unsupervised 3d shape completion through gan inversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1768–1777. Cited by: §5.1.2.
  • [132] L. Zhang and Z. Zhu (2019) Unsupervised feature learning for point cloud understanding by contrasting and clustering using graph convolutional neural networks. In 2019 International Conference on 3D Vision (3DV), pp. 395–404. Cited by: §5.2.1, TABLE III, TABLE IV.
  • [133] W. Zhang, Q. Yan, and C. Xiao (2020) Detail preserved point cloud completion via separated feature aggregation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 512–528. Cited by: §5.1.4.
  • [134] Z. Zhang, R. Girdhar, A. Joulin, and I. Misra (2021-10) Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10252–10263. Cited by: §5.2.1, §5.2.1, §5.2.4, TABLE III, §6.2, TABLE VIII, TABLE IX.
  • [135] H. Zhao, L. Jiang, C. Fu, and J. Jia (2019) Pointweb: enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5565–5573. Cited by: §4.1.
  • [136] Y. Zhao, T. Birdal, H. Deng, and F. Tombari (2019) 3D point capsule networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1018. Cited by: §4, §5.1.1, TABLE II, TABLE IV.