SoTAPointCloud
Deep Learning for 3D Point Clouds
view repo
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
READ FULL TEXT VIEW PDF
Point cloud is point sets defined in 3D metric space. Point cloud has be...
read it
This paper presents a new 3D point cloud classification benchmark data s...
read it
Over the past decade deep learning has driven progress in 2D image
under...
read it
In this paper, we propose a method to segment regions in threedimension...
read it
We present a new deep learning architecture (called Kdnetwork) that is
...
read it
To help bridge the gap between internet visionstyle problems and the go...
read it
Researchers have now achieved great success on dealing with 2D images us...
read it
Deep Learning for 3D Point Clouds
With the rapid development of 3D acquisition technologies, 3D sensors are becoming increasingly available and affordable, including various types of 3D scanners, LiDARs, and RGBD cameras (such as Kinect, RealSense and Apple depth cameras) [91]. 3D data acquired by these sensors can provide rich geometric, shape and scale information [51, 50]. Complemented with 2D images, 3D data provides an opportunity for a better understanding of the surrounding environment for machines. 3D data has numerous applications in different areas, including autonomous driving, robotics, remote sensing, medical treatment, and design industry [19].
3D data can usually be represented with different formats, including depth images, point clouds, meshes, and volumetric grids. As a commonly used format, point cloud representation preserves the original geometric information in 3D space without any discretization. Therefore, it is the preferred representation for many scene understanding related applications such as autonomous driving and robotics. Recently, deep learning techniques have dominated many research areas, such as computer vision, speech recognition, Natural Language Processing (NLP), and bioinformatics. However, deep learning on 3D point clouds still face several significant challenges
[129], such as the small scale of datasets, the high dimensionality and the unstructured nature of 3D point clouds. On this basis, this paper focuses on the analysis of deep learning methods which have been used to process 3D point clouds.Deep learning on point clouds has been attracting more and more attention, especially in the last five years. Several publicly available datasets are also released, such as ModelNet [176], ShapeNet [139], ScanNet [26], Semantic3D [52], and the KITTI Vision Benchmark Suite [44]. These datasets have further boosted the research of deep learning on 3D point clouds, with an increasingly number of methods being proposed to address various problems related to point cloud processing, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. Few surveys of deep learning on 3D data are also available, such as [63, 2, 178, 133]. However, our paper is the first to specifically focus on deep learning methods for point clouds. Besides, our paper comprehensively covers different applications including classification, detection, tracking, and segmentation. A taxonomy of existing deep learning methods for 3D point clouds in shown in Fig. 1.
Compared to the existing literature, the major contributions of this work can be summarized as follows:
To the best of our knowledge, this is the first survey paper to comprehensively cover deep learning methods for several important point cloud related tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation.
This paper covers the most recent and advanced progress of deep learning on point clouds. Therefore, it provides the reader with the stateoftheart methods.
The structure of this paper is as follows. Section 2 reviews the methods for 3D shape classification. Section 3 provides a survey of existing methods for 3D object detection and tracking. Section 4 presents a review of methods for point cloud segmentation, including semantic segmentation, instance segmentation, and part segmentation. Finally, Section 5 concludes the paper. We also provide a regularly updated project page on: https://github.com/QingyongHu/SoTAPointCloud.
These methods usually learn the embedding of each point first and then extract a global shape embedding from the whole point cloud using an aggregation method. Classification is finally achieved by several fully connected layers. Based on the way that feature learning is performed on each point, existing 3D shape classification methods can be divided into projectionbased networks and pointbased networks. Several milestone methods are illustrated in Fig. 2.
Projectionbased methods first project an unstructured point cloud into an intermediate regular representation, and then leverage the wellestablished 2D or 3D convolution to achieve shape classification. In contrast, pointbased methods directly work on raw point clouds without any voxelization or projection. Pointbased methods do not introduce explicit information loss and becoming increasingly popular. In this paper, we mainly focus on pointbased networks, but also include few projectionbased networks for completeness.
These methods project 3D point clouds into different representation modalities (e.g., multiview, volumetric representations) for feature learning and shape classification.
These methods first project a 3D object into multiple views and extract the corresponding viewwise features, and then fuse these features for accurate object recognition. How to aggregate multiple viewwise features into a discriminative global representation is a key challenge. MVCNN [151]
is a pioneering work, which simply maxpools multiview features into a global descriptor. However, maxpooling only retains the maximum elements from a specific view, resulting in information loss. MHBN
[196] integrates local convolutional features by harmonized bilinear pooling to produce a compact global descriptor. Yang et al. [188] first leveraged a relation network to exploit the interrelationships (e.g., regionregion relationship and viewview relationship) over a group of views, and then aggregated these views to obtain a discriminative 3D object representation. In addition, several other methods [130, 43, 160, 109] have also been proposed to improve the recognition accuracy.Early methods usually apply 3D Convolution Neural Network (CNN) build upon the volumetric representation of 3D point clouds. Daniel et al.
[112] introduced a volumetric occupancy network called VoxNet to achieve robust 3D object recognition. Wu et al. [176]proposed a convolutional deep beliefbased 3D ShapeNets to learn the distribution of points from various 3D shapes. A 3D shape is usually represented by a probability distribution of binary variables on voxel grids. Although encouraging performance has been achieved, these methods are unable to scale well to dense 3D data since the computation and memory footprint grows cubically with the resolution. To this end, a hierarchical and compact graph structure (such as octree) is introduced to reduce the computational and memory costs of these methods. OctNet
[136]first hierarchically partitions a point cloud using a hybrid gridoctree structure, which represents the scene with several shallow octrees along a regular grid. The structure of octree is encoded efficiently using a bit string representation, and the feature vector of each voxel is indexed by simple arithmetic. Wang et al.
[163] proposed an Octreebased CNN for 3D shape classification. The average normal vectors of a 3D model sampled in the finest leaf octants are fed into the network, and 3DCNN is applied on the octants occupied by the 3D shape surface. Compared to a baseline network based on dense input grids, OctNet requires much less memory and runtime for highresolution point clouds. Le et al. [81] proposed a hybrid network called PointGrid, which integrates the point and grid representation for efficient point cloud processing. A constant number of points is sampled within each embedding volumetric grid cell, which allows the network to extract geometric details by using 3D convolutions.According to the network architecture used for the feature learning of each point, methods in this category can be divided into pointwise MLP, convolutionbased, graphbased, data indexingbased networks and other typical networks.
These methods model each point independently using several MultiLayer Perceptrons (MLPs) and then aggregate a global feature using a symmetric function, as shown in Fig.
3. These networks can achieve permutation invariance for unordered 3D point clouds. However, the geometric relationships among 3D points are not fully considered.As a pioneering work, PointNet [129] learns pointwise features with several MLP layers and extracts global shape features with a maxpooling layer. The classification score is obtained using several MLP layers. Zaheer et al. [197] also theoretically demonstrated that the key to achieve permutation invariance is by summing up all representations and applying nonlinear transformations. They also designed a fundamental architecture, DeepSets, for various applications including shape classification [197].
Since features are learned independently for each point in PointNet [129], the local structural information between points cannot be captured. Therefore, Qi et al. [131] proposed a hierarchical network PointNet++ to capture fine geometric structures from the neighborhood of each point. As the core of PointNet++ hierarchy, its set abstraction level is composed of three layers: the sampling layer, the grouping layer and the PointNet layer. By stacking several set abstraction levels, PointNet++ can learn features from a local geometric structure and abstract the local features layer by layer.
Because of its simplicity and strong representation ability, many networks have been developed based on PointNet [129]. Achlioptas et al. [1]
introduced a deep autoencoder network to learn point cloud representation. Its encoder follows the design of PointNet and learns point features independently using five 1D convolutional layers, a ReLU nonlinear activation, batch normalization and maxpooling operations. In Point Attention Transformers (PATs)
[186], each point is represented by its own absolute position and relative positions with respect to its neighbors. Then, Group Shuffle Attention (GSA) is used to capture relations between points, and a permutation invariant, differentiable and trainable endtoend Gumbel Subset Sampling (GSS) layer is developed to learn hierarchical features. The architecture of MoNet [69] is similar to PointNet [129]but it takes a finite set of moments as the input of its network.
PointWeb [208] is also built upon PointNet++ and uses context of the local neighborhood to improve point features using Adaptive Feature Adjustment (AFA). Duan et al. [33] proposed a Structural Relational Network (SRN) to learn structural relational features between different local structures using MLP. Lin et al. [94] accelerated the inference process by constructing a lookup table for both input and function spaces learned by PointNet. The inference time on the ModelNet and ShapeNet datasets is sped up by 1.5 ms and 32 times over PointNet on a moderate machine. SRINet [152] first projects a point cloud to obtain rotation invariant representations, and then utilizes PointNetbased backbone to extract a global feature and graphbased aggregation to extract local features.Compared to kernels defined on 2D grid structures (e.g., images), convolutional kernels for 3D point clouds are hard to design due to the irregularity of point clouds. According to the type of convolutional kernels, current 3D convolution networks can be divided into continuous convolution networks and discrete convolution networks, as shown in Fig. 4.
3D Continuous Convolution Networks. These methods define convolutional kernels on a continuous space, where the weights for neighboring points are related to the spatial distribution with respect to the center point.
3D convolution can be interpreted as a weighted sum over a given subset. MLP is a simple way to learn weights. As the core layer of RSCNN [104], RSConv takes a local subset of points around a certain point as its input, and the convolution is implemented using an MLP by learning the mapping from lowlevel relations (such as Euclidean distance and relative position) to highlevel relations between points in the local subset. In [14], kernel elements are selected randomly in a unit sphere. An MLPbased continuous function is then used to establish relation between the locations of the kernel elements and the point cloud. In DensePoint [103], convolution is defined as a SingleLayer Perceptron (SLP) with a nonlinear activator. Features are learned by concatenating features from all previous layers to sufficiently exploit the contextual information.
Some methods also use existing algorithms to perform convolution. In PointConv [175]
, convolution is defined as a Monte Carlo estimate of the continuous 3D convolution with respect to an importance sampling. The convolutional kernels consist of a weighting function (which is learned with MLP layers) and a density function (which is learned by a kernelized density estimation and an MLP layer). To improve memory and computational efficiency, the 3D convolution is further reduced into two operations: matrix multiplication and 2D convolution. With the same parameter setting, its memory consumption can be reduced by about 64 times. In MCCNN
[54], convolution is considered as a Monte Carlo estimation process relying on a sample’s density function (which is implemented with MLP). Poisson disk sampling is then used to construct a point cloud hierarchy. This convolution operator can be used to perform convolution between two or multiple sampling methods and can handle varying sampling densities. In SpiderCNN [180], SpiderConv is proposed to define convolution as the product of a step function and a Taylor expansion defined on thenearest neighbors. The step function captures the coarse geometry by encoding the local geodesic distance, and the Taylor expansion captures the intrinsic local geometric variations by interpolating arbitrary values at the vertices of a cube. Besides, a convolution network PCNN
[111]is also proposed for 3D point clouds based on the radial basis function. Thomas et al.
[157] proposed both rigid and deformable kernel point convolution (KPConv) operators for 3D point clouds using a set of learnable kernel points.Several methods have been proposed to address the rotation equivariant problem faced by 3D convolution networks. Esteves et al. [40]
proposed 3D spherical convolutional neural networks (Spherical CNN) to learn rotation equivariant representation for 3D shapes, which takes multivalued spherical functions as its input. Localized convolutional filters are obtained by parameterizing spectrum with anchor points in the spherical harmonic domain. Tensor field networks
[158] are proposed to define the point convolution operation as the product of a learnable radial function and spherical harmonics, which are locally equivariant to 3D rotations, translations, and permutations of points. The convolution in [24]is defined based on the spherical crosscorrelation and implemented using a generalized Fast Fourier Transformation (FFT) algorithm. Based on PCNN, SPHNet
[125] achieves rotation invariance by incorporating spherical harmonic kernels during convolution on volumetric functions. ConvPoint [13] separates the convolution kernel into spatial and feature parts. The locations of the spatial part are randomly selected from a unit sphere and the weighting function is learned through a simple MLP.To accelerate computing speed, FlexConvolution [48] defines weights of convolution kernel as standard scalar product over nearest neighbors, which can be accelerated using CUDA. Experimental results have demonstrated its competitive performance on a small dataset with fewer parameters and lower memory consumption.
3D Discrete Convolution Networks. These methods define convolutional kernels on regular grids, where the weights for neighboring points are related to the offsets with respect to the center point.
Hua et al. [59] transformed nonuniform 3D point clouds into uniform grids, and defined convolutional kernels on each grid. Unlike 2D convolutions (which assign a weight to each pixel), the proposed 3D kernel assigns the same weights to all points falling into the same grid. For a given point, the mean features of all the neighboring points that are located on the same grid are computed from the previous layer. Then, mean features of all grids are weighted and summed to produce the output of the current layer. Lei et al. [83] defined a spherical convolutional kernel by partitioning a 3D spherical neighboring region into multiple volumetric bins and associating each bin with a learnable weighting matrix. The output of the spherical convolutional kernel for a point is determined by the nonlinear activation of the mean of weighted activation values of its neighboring points. In GeoConv [76], the geometric relationship between a point and its neighboring points is explicitly modeled based on six bases. Edge features along each direction of the basis is weighted independently by a learnable matrix according to the basis of the neighboring point. These directionassociated features are then aggregated according to an angle formed by the given point and its neighboring points. For a given point, its feature at the current layer is defined as the sum of features of the given point and its neighboring edge features at the previous layer. PointCNN [88] achieves permutation invariance through a conv transformation (which is implemented through MLP). By interpolating point features to neighboring discrete convolutional kernelweight coordinates, Mao et al. [110] proposed an interpolated convolution operator InterpConv to measure the geometric relations between input point clouds and kernelweight coordinates. Zhang et al. [205] proposed a RIConv operator to achieve rotation invariance, which takes lowlevel rotation invariant geometric features as input and then turns the convolution into 1D by a simple binning approach.
ACNN [72] defines an annular convolution by looping the array of neighbors with respect to the size of kernel on each ring of the query point.ACNN learns the relationship between neighboring points in a local subset. To reduce the computational and memory cost of 3D CNNs, Kumawat et al. [74] proposed a Rectified Local Phase Volume (ReLPV) block to extract phase in a 3D local neighborhood based on 3D Short Term Fourier Transform (STFT), which significantly reduces the number of parameters. In SFCNN [134], a point cloud is projected onto regular icosahedral lattices with aligned spherical coordinates. Convolutions are then conducted upon the features concatenated from vertices of spherical lattices and their neighbors through convolutionmaxpoolingconvolution structures. SFCNN is resistant to rotations and perturbations.
Graphbased networks consider each point in a point cloud as a vertex of a graph, and generate directed edges for the graph based on the neighbors of each point. Feature learning is then performed in spatial or spectral domains [146]. A typical graphbased network is shown in Fig. 5.
Graphbased Methods in Spatial Domain. These methods define operations (e.g., convolution and pooling) in spatial domain. Specifically, convolution is usually implemented through MLP over spatial neighbors, pooling is produces a new coarsened graph by aggregating information from each point’s neighbors. Features at each vertex are usually assigned with coordinates, laser intensities or colors, while features at each edge are usually assigned with geometric attributes between two connected points.
As a pioneering work, Simonovsky et al. [146] considered each point as a vertex of the graph, and connected each vertex to all its neighbors by a directed edge. Then, EdgeConditioned Convolution (ECC) is proposed using a filtergenerating network (e.g., MLP). Max pooling is adopted to aggregate neighborhood information and graph coarsening is implemented based on the VoxelGrid [138] algorithm. For shape classification, convolutions and pooling are first interlaced. Then, global average pooling and fully connected layers are followed to produce classification scores. In DGCNN [168], a graph is constructed in the feature space and dynamically updated after each layer of the network. As the core layer of EdgeConv, an MLP is used as the feature learning function for each edge, channelwise symmetric aggregation is also applied onto the edge features associated with the neighbors of each point. Further, LDGCNN [203]
removed the transformation network and linked the hierarchical features from different layers in DGCNN
[168] to improve its performance and reduce the mode size. An endtoend unsupervised deep AutoEncoder network (namely FoldingNet [187]) is also proposed to use the concatenation of a vectorized local covariance matrix and point coordinates as its input.Inspired by Inception [153] and DGCNN [168], Hassani et al. [53] proposed an unsupervised multitask autoencoder to learn point and shape features. The encoder is constructed based on mutliscale graphs. The decoder is constructed using three unsupervised tasks including clustering, selfsupervised classification and reconstruction, which are trained jointly with a mutlitask loss. Liu et al. [98] proposed a Dynamic Points Agglomeration Module (DPAM) based on graph convolution to simplify the process of points agglomeration (sampling, grouping and pooling) into a simple step, which is implemented through multiplication of the agglomeration matrix and points feature matrix. Based on the PointNet architecture, a hierarchical learning architecture is constructed by stacking multiple DPAMs. Compared to the hierarchy strategy of PointNet++, DPAM dynamically exploits the relation of points and agglomerates points in a semantic space.
To exploit the local geometric structures, KCNet [140] is proposed to learn features based on kernel correlation. Specifically, a set of learnable points characterizing geometric types of local structures are defined as kernels. Then, affinity between the kernel and the neighborhood of a given point is calculated. In G3D [32], convolution is defined as a variant of polynomial of adjacency matrix, and pooling is defined as multiplying the Laplacian matrix and the vertex matrix by a coarsening matrix. ClusterNet [16]
utilizes Rigorously RotationInvariant (RRI) module to extract rotationinvariant features for each point, and constructs hierarchical structures of a point cloud based on the unsupervised agglomerative hierarchical clustering method with wardlinkage criteria
[120]. The features in each subcluster are first learned through an EdgeConv block and then aggregated through max pooling.Graphbased Methods in Spectral Domain.
These methods define convolutions as spectral filtering, which is implemented as the multiplication of signals on graph with eigenvectors of the graph Laplacian matrix
[15].To handle the challenges of high computation and nonlocalization, Defferrard et al. [30] proposed a truncated Chebyshev polynomials to approximate the spectral filtering. Their learned feature maps are located within the Khops neighbors of each point. Note, eigenvectors are calculated from a fixed graph Laplacian matrix in [15] [30]. In contrast, RGCNN [156]
constructs a graph by connecting each point with all other points in the point cloud and updates the graph Laplacian matrix in each layer. To make features of adjacent vertices more similar, a graphsignal smoothness prior is added into the loss function. To address the challenges caused by diverse graph topology of data, the SGCLL layer in AGCN
[87] utilizes a learnable distance metric to parameterize the similarity between two vertices on the graph. The adjacency matrix obtained from graph is normalized using Gaussian kernels and learned distances. Feng et al. [42] proposed a Hypergraph Neural Network (HGNN) and built a hyperedge convolutional layer by applying spectral convolution on the hypergraph.The aforementioned methods operate on full graphs. To exploit the local structural information, Wang et al. [161] proposed an endtoend spectral convolution network LocalSpecGCN to work on a local graph (which is constructed from the nearest neighbors). This method does not require any offline computation of the graph Laplacian matrix and graph coarsening hierarchy. In PointGCN [204], a graph is constructed based on nearest neighbors from a point cloud and each edge is weighted using a Gaussian kernel. Convolutional filters are defined as Chebyshev polynomials in the graph spectral domain. Global pooling and multiresolution pooling are used to capture global and local features of the point cloud. Pan et al. [122] proposed 3DTINet by applying convolution on the nearest neighboring graph in the spectral domain. The invariance to geometry transformation is achieved by learning from relative Euclidean and direction distances.
These networks are constructed based on different data indexing structures (e.g., octree and kdtree). In these methods, point features are learned hierarchically from leaf nodes to the root node along a tree. Lei et al. [83] proposed an octree guided CNN using spherical convolutional kernels (as described in Section 4
). Each layer of the network corresponds to one layer of the octree and a spherical convolutional kernel is applied at each layer. The values of neurons in the current layer are determined as the mean values of all relevant children nodes in the previous layer. Unlike OctNet
[136] (which is based on octree), KdNet [71] is built using multiple Kd trees with different splitting directions at each iteration. Following a bottomup approach, the representation of a nonleaf node is computed from representations of its children using MLP. The feature of the root node (which describes the whole point cloud) is finally fed to fully connected layers to predict classification scores. Note that, KdNet shares parameters at each level according to the splitting type of nodes. 3DContextNet [199] uses a standard balanced Kd tree to achieve feature learning and aggregation. At each level, point features are first learned through MLP based on local cues (which models interdependencies between points in a local region) and global contextual cues (which models the relationship for one position with respect to all other positions). Then, the feature of a nonleaf node is computed from its child nodes using MLP and aggregated by max pooling. For classification, the above process is repeated until the root node is attained.The hierarchy of SONet network is constructed by performing pointtonode nearest neighbor search [86]
. Specifically, a modified permutation invariant SelfOrganizing Map (SOM) is used to model the spatial distribution of a point cloud. Individual point features are learned from normalized pointtonode coordinates through a series of fully connected layers. The feature of each node in SOM is extracted from point features associated with this node using channelwise max pooling. The final feature is then learned from node features using an approach similar to PointNet
[129]. Compared to PointNet++ [131], the hierarchy of SOM is more efficient and the spatial distribution of the point cloud is fully explored.Methods  Input  #params (M) 






PointNet [129]  Coordinates  3.48  89.2%  86.2%      
PointNet++ [131]  Coordinates  1.48  90.7%        
MONet [69]  Coordinates  3.1  89.3%  86.1%      
Deep Sets [197]  Coordinates    87.1%        
PAT [186]  Coordinates    91.7%        
PointWeb [208]  Coordinates    92.3%  89.4%      
SRNPointNet++ [33]  Coordinates    91.5%        
JUSTLOOKUP [94]  Coordinates    89.5%  86.4%  92.9%  92.1%  

PointwiseCNN [59]  Coordinates    86.1%  81.4%      
PointConv [175]  Coordinates+Normals    92.5%        
MC Convolution [54]  Coordinates    90.9%        
SpiderCNN [180]  Coordinates+Normals    92.4%        
PointCNN [88]  Coordinates  0.45  92.2%  88.1%      
FlexConvolution [48]  Coordinates    90.2%        
PCNN [111]  Coordinates  1.4  92.3%    94.9%    
Boulch [14]  Coordinates    91.6%  88.1%      
RSCNN [104]  Coordinates    92.6%        
Spherical CNNs [40]  Coordinates  0.5  88.9%        
GeoCNN [76]  Coordinates    93.4%  91.1%      
CNN [83]  Coordinates    92.0%  88.7%  94.6%  94.4%  
ACNN [72]  Coordinates    92.6%  90.3%  95.5%  95.3%  
SFCNN [134]  Coordinates    91.4%        
SFCNN [134]  Coordinates+Normals    92.3%        
DensePoint [103]  Coordinates  0.53  93.2%    96.6%    
KPConv rigid [157]  Coordinates    92.9%        
KPConv deform [157]  Coordinates    92.7%        
InterpCNN [110]  Coordinates  12.8  93.0%        
ConvPoint [13]  Coordinates    91.8%  88.5%      

ECC [146]  Coordinates    87.4%  83.2%  90.8%  90.0%  
KCNet [140]  Coordinates  0.9  91.0%    94.4%    
DGCNN [168]  Coordinates  1.84  92.2%  90.2%      
LocalSpecGCN [161]  Coordinates+Normals    92.1%        
RGCNN [156]  Coordinates+Normals  2.24  90.5%  87.3%      
LDGCNN [203]  Coordinates    92.9%  90.3%      
3DTINet [122]  Coordinates  2.6  91.7%        
PointGCN [204]  Coordinates    89.5%  86.1%  91.9%  91.6%  
ClusterNet [16]  Coordinates    87.1%        
Hassani et al. [53]  Coordinates    89.1%        
DPAM [98]  Coordinates    91.9%  89.9%  94.6%  94.3%  

KDNet [71]  Coordinates  2.0  91.8%  88.5%  94.0%  93.5%  
SONet [86]  Coordinates    90.9%  87.3%  94.1%  93.9%  
SCN [177]  Coordinates    90.0%  87.6%      
ASCN [177]  Coordinates    89.8%  87.4%      
3DContextNet [199]  Coordinates    90.2%        
3DContextNet [199]  Coordinates+Normals    91.1%        
Other Networks  3DmFVNet [9]  Coordinates  4.6  91.6%    95.2%    
PVNet [194]  Coordinates+Views    93.2%        
PVRNet [195]  Coordinates+Views    93.6%        
3DPointCapsNet [210]  Coordinates    89.3%        
DeepRBFNet [18]  Coordinates  3.2  90.2%  87.8%      
DeepRBFNet [18]  Coordinates+Normals  3.2  92.1%  88.8%      
Point2Sequences [102]  Coordinates    92.6%  90.4%  95.3%  95.1%  
RCNet [174]  Coordinates    91.6%    94.7%    
RCNetE [174]  Coordinates    92.3%    95.6%   
In addition to the above methods, many other schemes have also been proposed. In 3DmFV [9]
, a point cloud is voxelized into uniform 3D grids and fisher vectors are extracted based on the likelihood of a set of Gaussian mixture models defined on these grids. Since the components of fisher vector are summed over all points, the resulting representation is invariant to the order, structure and size of point clouds. RBFNet
[18]explicitly models the spatial distribution of points by aggregating features from sparsely distributed Radial Basis Function (RBF) kernels. The RBF feature extraction layer computes responses from all kernels for each point, and then the kernel position and kernel size are optimized to capture the spatial distribution of points during training. Compared to fully connected layers, the RBF feature extraction layer produces more discriminative features while reducing the number of parameters by orders of magnitude. Zhao et al.
[210] proposed an unsupervised autoencoder 3DPointCapsNet for generic representation learning of 3D point clouds. In the encoder stage, a pointwise MLP is first applied to the point cloud to extract point independent features, which are further fed into multiple independent convolutional layers. Then, a global latent representation is extracted by concatenating multiple maxpooled learned feature maps. Based on unsupervised dynamic routing, powerful representative latent capsules are learned. Inspired from the construction of shape context descriptor [7], Xie et al. [177] proposed a novel ShapeContextNet architecture by combining affinity point selection and compact feature aggregation into a soft alignment operation using dotproduct selfattention [159]. To address noise and occlusion in 3D point clouds, Bobkov et al. [11] fed handcrafted point pair function based 4D rotation invariant descriptors into a 4D convolutional neural network. Prokudin et al. [126]first randomly sampled a basis point set with a uniform distribution from a unit ball, and then encoded point cloud as minimal distances to the basis point set, which converts the point cloud to a vector with a relatively small fixed length. The encoded representation can then be processed with existing machine learning methods. RCNet
[174] utilizes standard RNN and 2D CNN to construct a permutationinvariant network for 3D point cloud processing. The point cloud is first partitioned into parallel beams and sorted along a specific dimension, and each beam is then fed into a shared RNN. The learned features are further fed into an efficient 2D CNN for hierarchical feature aggregation. To enhance its description ability, RCNetE is proposed to ensemble multiple RCNets along different partition and sorting directions. Point2Sequences [102] is another RNNbased model that captures correlations between different areas in local regions of point clouds. It considers features learned from a local region at multiple scales as sequences and feeds these sequences from all local regions into an RNNbased encoderdecoder structure to aggregate local region features. Qin et al. [132] proposed an endtoend unsupervised domain adaptationbased network PointDAN for 3D point cloud representation. To capture the semantic properties of a point cloud, a selfsupervised method is proposed to reconstruct the point cloud, whose parts have been randomly rearranged [144].Several methods have also been proposed to learn from both 3D point clouds and 2D images. In PVNet [194]
, highlevel global features extracted from multiview images are projected into the subspace of point clouds through an embedding network, and fused with point cloud features through a soft attention mask. Finally, a residual connection is employed for fused features and multiview features to perform shape recognition. Later, PVRNet
[195] is further proposed to exploit the relation between a 3D point cloud and its multiple views, which are learned by a relation score module. Based on the relation scores, the original 2D global view features are enhanced for pointsingleview fusion and pointmultiview fusion.The ModelNet10/40 datasets are the most frequently used datasets for shape classification. Table I shows the results achieved by different pointbased networks. Several observations can be drawn:
Pointwise MLP networks are usually served as basic building blocks for other types of networks to learn pointwise features.
As a standard deep learning architecture, convolutionbased networks can achieve superior performance on irregular 3D point clouds. More attention should be paid to both discrete and continuous convolution networks for irregular data.
Due to its inherent strong capability to handle irregular data, graphbased networks have attracted increasingly more attention in recent years. However, it is still challenging to extend graphbased networks in the spectral domain to various graph structures.
Most networks need to downsample a point cloud into a fixed small size. This sampling process discards details of the shape. Developing networks that can deal with largescale point clouds is still in its infancy [57].
In this section, we will review existing methods for 3D object detection, 3D object tracking and 3D scene flow estimation.
The task of 3D object detection is to accurately locate all objects of interest in a given scene. Similar to object detection in images [99], 3D object detection methods can be divided into two categories: region proposalbased methods and single shot methods. Several milestone methods are presented in Fig. 6.
These methods first propose several possible regions (also called proposals) containing objects, and then extract regionwise features to determine the category label of each proposal. According to their object proposal generation approach, these methods can further be divided into three categories: multiview based, segmentationbased and frustumbased methods.
Multiview Methods. These methods fuse proposalwise features from different view maps (e.g., LiDAR front view, bird’s eye view (BEV) and image) to obtain 3D rotated boxes, as shown in Fig. 7(a). The computational cost of these methods is usually high.
Chen et al. [19] generated a group of highly accurate 3D candidate boxes from the BEV map and projected them to the feature maps of multiple views (e.g., LiDAR front view image, RGB image). They then combined these regionwise features obtained from different views to predict oriented 3D bounding boxes, as shown in Fig. 7(a). Although this method achieves a recall of 99.1% at an IntersectionoverUnion (IoU) of 0.25 with only 300 proposals, its speed is too slow for practical applications. Subsequently, several approaches have been developed to improve multiview 3D object detection methods from two aspects.
First, several methods have been proposed to efficiently fuse the information of different modalities. To generate 3D proposals with a high recall for small objects, Ku et al. [73] proposed a multimodal fusionbased region proposal network. They first extracted equalsized features from both BEV and image views using cropping and resizing operations, and then fused these features using elementwise mean pooling. Liang et al. [90] exploited continuous convolutions to enable effective fusion of image and 3D LiDAR feature maps at different resolutions. Specifically, they extracted nearest corresponding image features for each point in the BEV space and then used bilinear interpolation to obtain a dense BEV feature map by projecting image features into the BEV plane. Experimental results show that dense BEV feature maps are more suitable for 3D object detection than discrete image feature maps and sparse LiDAR feature maps. Liang et al. [89] presented a multitask multisensor 3D object detection network for endtoend training. Specifically, multiple tasks (e.g., 2D object detection, ground estimation and depth completion) are exploited to help the network learn better feature representations. The learned crossmodality representation is further exploited to produce highly accurate object detection results. Experimental results show that this method achieves a significant improvement on 2D, 3D and BEV detection tasks, and outperforms previous stateoftheart methods on the TOR4D benchmark [184, 108].
Second, different methods have been investigated to extract robust representations of the input data. Lu et al. [107] explored multiscale contextual information by introducing a Spatial Channel Attention (SCA) module, which captures the global and multiscale context of a scene and highlights useful features. They also proposed an Extension Spatial Unsample (ESU) module to obtain highlevel features with rich spatial information by combining multiscale lowlevel features, thus generating reliable 3D object proposals. Although better detection performance can be achieved, the aforementioned multiview methods take a long runtime since they perform feature pooling for each proposal. Subsequently, Zeng et al. [200] used a preRoI pooling convolution to improve the efficiency of [19]. Specifically, they moved the majority of convolution operations to be ahead of the RoI pooling module. Therefore, RoI convolutions are performed once for all object proposals. Experimental results show that this method can run at a speed of 11.1 fps, which is 5 times faster than MV3D [19].
Segmentationbased Methods. These methods first leverage existing semantic segmentation techniques to remove most background points, and then generate a large amount of highquality proposals on foreground points to save computation, as shown in Fig. 7(b). Compared to multiview methods [19, 73, 200], these methods achieve higher object recall rates and are more suitable for complicated scenes with highly occluded and crowded objects.
Yang et al. [189] used a 2D segmentation network to predict foreground pixels and projected them into point clouds to remove most background points. They then generate proposals on the predicted foreground points and designed a new criterion named PointsIoU to reduce the redundancy and ambiguity of proposals. Following [189], Shi et al. [141] proposed a PointRCNN framework. Specifically, they directly segmented 3D point clouds to obtain foreground points and then fused semantic features and local spatial features to produce highquality 3D boxes. Following the RPN stage of [141], Jesus et al. [65] proposed a pioneering work to leverage Graph Convolution Network (GCN) for 3D object detection. Specifically, two modules are introduced to refine object proposals using graph convolution. The first module RGCN utilizes all points contained in a proposal to achieve perproposal feature aggregation. The second module CGCN fuses perframe information from all proposals to regress accurate object boxes by exploiting contexts. Sourabh et al. [149] projected a point cloud into the output of the imagebased segmentation network and appended the semantic prediction scores to the points. The painted points are fed into existing detectors [141, 213, 79] to achieve significant performance improvement. Yang et al. [190] associated each point with a spherical anchor. The semantic score of each point is then used to remove redundant anchors. Consequently, this method achieves a higher recall with lower computational cost as compared to previous methods [189, 141]. In addition, a PointsPool layer is proposed to learn compact features for interior points in proposals and a parallel IoU branch is introduced to improve localization accuracy and detection performance. Experimental results show that this method significantly outperforms other methods [141, 169, 89] on the hard set (Car Class) of the KITTI dataset [44] and runs at a speed of 12.5 fps.
Frustumbased Methods. These methods first leverage existing 2D object detectors to generate 2D candidate regions of objects and then extract a 3D frustum proposal for each 2D candidate region, as shown in Fig. 7 (c). Although these methods can efficiently propose possible locations of 3D objects, the stepbystep pipeline makes their performance limited by 2D image detectors.
FPointNets [128] is a pioneering work in this direction. It generates a frustum proposal for each 2D region and applies PointNet [129] (or PointNet++ [131]) to learn point cloud features of each 3D frustum for amodal 3D box estimation. In a followup work, Zhao et al. [209] proposed a PointSENet module to predict a set of scaling factors, which were further used to adaptively highlight useful features and suppress informativeless features. They also integrated the PointSIFT [67] module into the network to capture orientation information of point clouds, which achieved strong robustness to shape scaling. This method achieves significant improvement on both indoor and outdoor datasets [44, 148] as compared to FPointNets [128].
Xu et al. [179] leveraged both 2D image region and its corresponding frustum points to accurately regress 3D boxes. To fuse image features and global features of point clouds, they presented a global fusion network for direct regression of box corner locations. They also proposed a dense fusion network for the prediction of pointwise offsets to each corner. Shin et al. [143] first estimated 2D bounding boxes and 3D poses of objects from a 2D image, and then extracted multiple geometrically feasible object candidates. These 3D candidates were fed into a box regression network to predict accurate 3D object boxes. Wang et al. [169] generated a sequence of frustums along the frustum axis for each 2D region and applied PointNet [129] to extract features for each frustum. The frustumlevel features were reformed to generate a 2D feature map, which was then fed into a fully convolutional network for 3D box estimation. This method achieves the stateoftheart performance among 2D imagebased methods and was ranked in the top position of the official KITTI leaderboard. Lehner et al. [68] first obtained a preliminary detection results on the BEV map, and then extracted small point subsets (also called patches) based on the BEV predictions. A local refinement network is applied to learn the local features of patches to predict highly accurate 3D bounding boxes.
Other Methods. Motivated by the success of axisaligned IoU in object detection in images, Zhou et al. [31] integrated the IoU of two 3D rotated bounding boxes into several stateoftheart detectors [182, 79, 141] to achieve consistent performance improvement. Chen et al. [20] proposed a twostage network architecture to use both point cloud and voxel representations. First, point clouds are voxelized and fed to a 3D backbone network to produce initial detection results. Second, the interior point features of initial predictions are further exploited for box refinements. Although this design is conceptually simple, it achieves comparable performance to PointRCNN [141] while maintaining a speed of 16.7 fps.
Inspired by Hough votingbased 2D object detectors, Qi et al. [127] proposed VoteNet to directly vote for virtual center points of objects from point clouds and to generate a group of highquality 3D object proposals by aggregating vote features. VoteNet significantly outperforms previous approaches using only geometric information, and achieves the stateoftheart performance on two large indoor benchmarks (i.e., ScanNet [26] and SUN RGBD [148]). However, the prediction of virtual center point is unstable for a partially occluded object. Further, Feng et al. [117] added an auxiliary branch of direction vectors to improve the prediction accuracy of virtual center points and 3D candidate boxes. In addition, a 3D objectobject relationship graph between proposals is built to emphasize useful features for accurate object detection. Inspired by the observation that the ground truth boxes of 3D objects provide accurate locations of intraobject parts, Shi et al. [142] proposed the Net, which is composed of a partaware stage and a partaggregation stage. The partaware stage applies a UNetlike network with sparse convolution and sparse deconvolution to learn pointwise features for the prediction and coarse generation of intraobject part locations. The partaggregation stage adopts RoIaware pooling to aggregate predicted part locations for box scoring and location refinement.
These methods directly predict class probabilities and regress 3D bounding boxes of objects using a singlestage network. These methods do not need region proposal generation and postprocessing. As a result, they can run at a high speed and are highly suitable for realtime applications. According to the type of input data, single shot methods can be divided into two categories: BEVbased and point cloudbased methods.
BEVbased Methods. These methods mainly take BEV representation as their input. Yang et al. [184] discretized the point cloud of a scene with equally spaced cells and encoded the reflectance in a similar way, resulting in a regular representation. A Fully Convolution Network (FCN) network was then applied to estimate the locations and heading angles of objects. This method outperforms most single shot methods (including VeloFCN [84], 3DFCN [85] and Vote3Deep [35]) while running at 28.6 fps. Later, Yang et al. [183] exploited the geometric and semantic prior information provided by HighDefinition (HD) maps to improve the robustness and detection performance of [184]
. Specifically, they obtained the coordinates of ground points from the HD map and then replaced the absolute distance in the BEV representation with the distance relative to the ground to remedy the translation variance caused by the slope of the road. In addition, they concatenated a binary road mask with the BEV representation along the channel dimension to focus on moving objects. Since HD maps were not available everywhere, they also proposed an online map prediction module to estimate the map priors from single LiDAR point cloud. This mapaware method significantly outperforms its baseline on the TOR4D
[184, 108] and KITTI [44] datasets. However, its generalization performance to point clouds with different densities is poor. To solve this problem, Beltrán et al. [8] proposed a normalization map to consider the differences among different LiDAR sensors. The normalization map is a 2D grid with the same resolution as the BEV map, and it encodes the maximum number of points contained in each cell. It is shown that this normalization map significantly improves the generalization ability of BEVbased detectors.Method  Modality 

Cars  Pedestrians  Cyclists  
E  M  H  E  M  H  E  M  H  


MV3D [19]  L & I  2.8  74.97  63.63  54.00              
AVOD [73]  L & I  12.5  76.39  66.47  60.23  36.10  27.86  25.76  57.19  42.08  38.29  
ContFuse [90]  L & I  16.7  83.68  68.78  61.67              
MMF [89]  L & I  12.5  88.40  77.43  70.22              
SCANet [107]  L & I  11.1  79.22  67.13  60.65              
RT3D [200]  L & I  11.1  23.74  19.14  18.86              

IPOD [189]  L & I  5.0  80.30  73.04  68.73  55.07  44.37  40.05  71.99  52.23  46.50  
PointRCNN [141]  L  10.0  86.96  75.64  70.70  47.98  39.37  36.01  74.96  58.82  52.53  
PointRGCN [65]  L  3.8  85.97  75.73  70.60              
PointPainting [149]  L & I  2.5  82.11  71.70  67.08  50.32  40.97  37.87  77.63  63.78  55.89  
STD [190]  L  12.5  87.95  79.71  75.09  53.29  42.47  38.35  78.69  61.59  55.30  

FPointNets [128]  L & I  5.9  82.19  69.79  60.59  50.53  42.15  38.08  72.27  56.12  49.01  
SIFRNet [209]  L & I                      
PointFusion [179]  L & I    77.92  63.00  53.27  33.36  28.04  23.38  49.34  29.42  26.98  
RoarNet [143]  L & I  10.0  83.71  73.04  59.16              
FConvNet [169]  L & I  2.1  87.36  76.39  66.69  52.16  43.38  38.80  81.98  65.07  56.54  

L  6.7  88.67  77.20  71.82              

3D IoU loss [31]  L  12.5  86.16  76.50  71.39              

L  16.7  84.80  74.59  67.27              
VoteNet [127]  L                      
Feng et al. [117]  L                      
PartA^2 [142]  L  12.5  87.81  78.49  73.51              


PIXOR [184]  L  28.6                    
HDNET [183]  L  20.0                    
BirdNet [8]  L  9.1  13.53  9.47  8.49  12.25  8.99  8.06  16.63  10.46  9.53  

VeloFCN [84]  L  1.0                    
3D FCN [85]  L  <0.2                    
Vote3Deep [35]  L                      
3DBN [181]  L  7.7  83.77  73.53  66.23              
VoxelNet [213]  L  2.0  77.47  65.11  57.73  39.48  33.69  31.51  61.22  48.36  44.37  
SECOND [182]  L  26.3  83.34  72.55  65.82  48.96  38.78  34.91  71.33  52.08  45.83  
MVXNet [147]  L & I  16.7  84.99  71.95  64.88              
PointPillars [79]  L  62.0  82.58  74.31  68.99  51.45  41.92  38.89  77.10  58.65  51.92  

LaserNet [115]  L  83.3                    
LaserNet++ [114]  L & I  26.3                   
Method  Modality 

Cars  Pedestrians  Cyclists  
E  M  H  E  M  H  E  M  H  


MV3D [19]  L & I  2.8  86.62  78.93  69.80              
AVOD [73]  L & I  12.5  89.75  84.95  78.32  42.58  33.57  30.14  64.11  48.15  42.37  
ContFuse [90]  L & I  16.7  94.07  85.35  75.88              
MMF [89]  L & I  12.5  93.67  88.21  81.99              
SCANet [107]  L & I  11.1  90.33  82.85  76.06              
RT3D [200]  L & I  11.1  56.44  44.00  42.34              

IPOD [189]  L & I  5.0  89.64  84.62  79.96  60.88  49.79  45.43  78.19  59.40  51.38  
PointRCNN [141]  L  10.0  92.13  87.39  82.72  54.77  46.13  42.84  82.56  67.24  60.28  
PointRGCN [65]  L  3.8  91.63  87.49  80.73              
PointPainting [149]  L & I  2.5  92.45  88.11  83.36  58.70  49.93  46.29  83.91  71.54  62.97  
STD [190]  L  12.5  94.74  89.19  86.42  60.02  48.72  44.55  81.36  67.23  59.35  

FPointNets [128]  L & I  5.9  91.17  84.67  74.77  57.13  49.57  45.48  77.26  61.37  53.78  
SIFRNet [209]  L & I                      
PointFusion [179]  L & I                      
RoarNet [143]  L & I  10.0  88.20  79.41  70.02              
FConvNet [169]  L & I  2.1  91.51  85.84  76.11  57.04  48.96  44.33  84.16  68.88  60.05  

L  6.7  92.72  88.39  83.19              

3D IoU loss [31]  L  12.5  91.36  86.22  81.20              

L  16.7  90.76  85.61  79.99              
VoteNet [127]  L                      
Feng et al. [117]  L                      
PartA^2 [142]  L  12.5  91.70  87.79  84.61        81.91  68.12  61.92  


PIXOR [184]  L  28.6  83.97  80.01  74.31              
HDNET [183]  L  20.0  89.14  86.57  78.32              
BirdNet [8]  L  9.1  76.88  51.51  50.27  20.73  15.80  14.59  36.01  23.78  21.09  

VeloFCN [84]  L  1.0  0.02  0.14  0.21              
3D FCN [85]  L  <0.2  70.62  61.67  55.61              
Vote3Deep [35]  L                      
3DBN [181]  L  7.7  89.66  83.94  76.50              
VoxelNet [213]  L  2.0  89.35  79.26  77.39  46.13  40.74  38.11  66.70  54.76  50.55  
SECOND [182]  L  26.3  89.39  83.77  78.59  55.99  45.02  40.93  76.50  56.05  49.45  
MVXNet [147]  L & I  16.7  92.13  86.05  78.68              
PointPillars [79]  L  62.0  90.07  86.56  82.81  57.60  48.64  45.78  79.90  62.73  55.58  

LaserNet [115]  L  83.3  79.19  74.52  68.45              
LaserNet++ [114]  L & I  26.3                   
Point Cloudbased Methods. These methods convert a point cloud into a regular representation (e.g., 2D map), and then apply CNN to predict both categories and 3D boxes of objects.
Li et al. [84] proposed the first method to use a FCN for 3D object detection. They converted a point cloud into a 2D point map and used a 2D FCN to predict the bounding boxes and confidences of objects. Later, they [85] discretized the point cloud into a 4D tensor with dimensions of length, width, height and channels, and extended the 2D FCNbased detection technologies to 3D domain for 3D object detection. Compared to [84], 3D FCNbased method [85] obtains a gain of over >20% in accuracy, but inevitably costs more computing resources due to 3D convolutions and the sparsity of the data. To address the sparsity problem of voxels, Engelcke et al. [35] leveraged a featurecentric voting scheme to generate a set of votes for each nonempty voxel and to obtain the convolutional results by accumulating the votes. Its computational complexity method is proportional to the number of occupied voxels. Li et al. [181] constructed a 3D backbone network by stacking multiple sparse 3D CNNs. This method is designed to save memory and accelerate computation by fully using the sparsity of voxels. This 3D backbone network extracts rich 3D features for object detection without introducing heavy computational burden.
Zhou et al. [213] presented a voxelbased endtoend trainable framework VoxelNet. They partitioned a point cloud into equally spaced voxels and encoded the features within each voxel into a 4D tensor. A region proposal network is then connected to produce detection results. Although its performance is strong, this method is very slow due to the sparsity of voxels and 3D convolutions. Later, Yan et al. [182] used the sparse convolutional network [47] to improve the inference efficiency of [213]. They also proposed a sineerror angle loss to solve the ambiguity between orientations of 0 and . Sindagi et al. [147] extended VoxelNet by fusing image and point cloud features at early stages. Specifically, they projected nonempty voxels generated by [213] into the image and used a pretrained network to extract image features for each projected voxel. These image features were then concatenated with voxel features to produce accurate 3D boxes. This method can effectively exploit multimodal information to reduce false positives and negatives compared to [213, 182]. Lang et al. [79] proposed a 3D object detector named PointPillars. This method leverages PointNet [129] to learn the feature of point clouds organized in vertical columns (Pillars) and encodes the learned features as a pesudo image. A 2D object detection pipeline is then applied to predict 3D bounding boxes. PointPillars outperforms most fusion approaches (including MV3D [19], RoarNet [143] and AVOD [73]) in terms of Average Precision (AP). Moreover, PointPillars can run at a speed of 62 fps on both the 3D and BEV KITTI [44] benchmarks, making it highly suitable for practical applications.
Other Methods. Meyer et al. [115] proposed an efficient 3D object detector called LaserNet. This method predicts a probability distribution over bounding boxes for each point and then combines these perpoint distributions to generate final 3D object boxes. Further, the dense range view (RV) representation of point cloud is used as input and a fast meanshift algorithm is proposed to reduce the noise produced by perpoint prediction. LaserNet achieves the stateoftheart performance at the range of 0 to 50 meters, and its runtime is significantly lower than existing methods. Meyer et al. [114] then extended LaserNet to exploit the dense texture provided by RGB images (e.g., 50 to 70 meters). Specifically, they associated LiDAR points with image pixels by projecting 3D point clouds onto 2D images and exploited this association to fuse RGB information into 3D points. They also considered 3D semantic segmentation as an auxiliary task to learn better representations. This method achieves a significant improvement in both longrange (e.g., 50 to 70 meters) object detection and semantic segmentation while maintaining high efficiency of LaserNet [115].
Given the locations of an object in the first frame, the task of object tracking is to estimate its state in subsequent frames [56, 97]. Since 3D object tracking can use the rich geometric information in the point clouds, it is expected to overcome several of the drawbacks that are faced by 2D imagebased tracking, including occlusion, illumination and scale variation.
Inspired by the success of Siamese network [10] for imagedbased object tracking, Giancola et al. [45]
proposed a 3D Siamese network with shape completion regularization. Specifically, they first generated candidates using a Kalman filter, and encoded model and candidates into a compact representation using shape regularization. The cosine similarity is then used to search for the location of the tracked object in the next frame. This method can be used as an alternative for object tracking, and significantly outperforms most 2D object tracking methods, including
[119] and SiamFC [10]. To efficiently search for the target object, Zarzar et al. [198] leveraged a 2D Siamese network to generate a large number of coarse object candidates on BEV representation. They then refined the candidates by exploiting the cosine similarity in 3D Siamese network. This method significantly outperforms [45] in terms of both precision (i.e., by 18%) and success rate (i.e., by 12%). Simon et al. [145]proposed a 3D object detection and tracking architecture for semantic point clouds. They first generated voxelized semantic point clouds by fusing 2D visual semantic information, and then utilized the temporal information to improve accuracy and robustness of multitarget tracking. In addition, they introduced a powerful and simplified evaluation metric (i.e., ScaleRotationTranslation score (SRFs)) to speed up training and inference. Their proposed ComplexerYOLO achieves promising tracking performance and can still run in realtime.
Analogous to optical flow estimation in 2D vision, several methods have started to learn useful information (e.g. 3D scene flow, spatialtemporary information) from a sequence of point clouds.
Liu et al. [100] proposed FlowNet3D to directly learn scene flows from a pair of consecutive point clouds. FlowNet3D learns both pointlevel features and motion features through a flow embedding layer. However, there are two problems with FlowNet3D. First, some predicted motion vectors differ significantly from the ground truth in their directions. Second, it is difficult to apply FlowNet to nonstatic scenes, especially for the scenes which are dominated by deformable objects. To solve this problem, Wang et al. [170] introduced a cosine distance loss to minimize the angle between the predictions and the ground truth. In addition, they also proposed a pointtoplane distance loss to improve the accuracy for both rigid and dynamic scenes. Experimental results show that these two loss terms improve the accuracy of FlowNet3D from 57.85% to 63.43%, and speed up and stabilize the training process. Gu et al. [49] proposed a Hierarchical Permutohedral Lattice FlowNet (HPLFlowNet) to directly estimate scene flow from largescale point clouds. Several bilateral convolution layers are proposed to restore structural information from raw point clouds, while reducing the computational cost.
To effectively process sequential point clouds, Fan and Yang [41] proposed PointRNN, PointGRU and PointLSTM networks and a sequencetosequence model to track moving points. PointRNN, PointGRU, and PointLSTM are able to capture the spatialtemporary information and model dynamic point clouds. Similarly, Liu et al. [101] proposed MeteorNet to directly learn a representation from dynamic point clouds. This method learns to aggregate information from spatiotemporal neighboring points. Direct grouping and chainedflow grouping are further introduced to determine the temporal neighbors. However, the performance of the aforementioned methods is limited by the scale of datasets. Mittal et al. [118]
proposed two selfsupervised losses to train their network on large unlabeled datasets. Their main idea is that a robust scene flow estimation method should be effective in both forward and backward predictions. Due to the unavailability of scene flow annotation, the nearest neighbor of the predicted transformed point is considered as pesudo ground truth. However, the true ground truth may not be the same as the nearest point. To avoid this problem, they computed the scene flow in the reverse direction and proposed a cycle consistency loss to translate the point to the original position. Experimental results show that this selfsupervised method exceeds the stateoftheart performance of supervised learningbased methods.
The KITTI [44] benchmark is one of the most influential datasets in autonomous driving and has been commonly used in both academia and industry. Tables II and III present the results achieved by different detectors on the KITTI test 3D and BEV benchmarks, respectively. The following observations can be made:
Region proposalbased methods are the most frequently investigated methods among these two categories, and outperform single shot methods by a large margin on both KITTI test 3D and BEV benchmarks.
There are two limitations for existing 3D object detectors. First, the longrange detection capability of existing methods is relatively poor. Second, how to fully exploit the texture information in images is still an open problem.
Multitask learning is a future direction in 3D object detection. For example, MMF [89] learns a crossmodality representation to achieve stateoftheart detection performance by incorporating multiple tasks.
3D object tracking and scene flow estimation are emerging research topics, and have gradually attracted increasing attention since 2019.
3D point cloud segmentation requires the understanding of both the global geometric structure and the finegrained details of each point. According to the segmentation granularity, 3D point cloud segmentation methods can be classified into three categories:
semantic segmentation (scene level), instance segmentation (object level) and part segmentation (part level).Given a point cloud, the goal of semantic segmentation is to separate a point cloud into several subsets according to their semantic meanings. Similar to the taxonomy for 3D shape classification (see Section 2), there are two paradigms for semantic segmentation, i.e., projectionbased and pointbased methods. We show several representative methods in Fig. 8.
Intermediate regular representations can be organised or classfied into multiview representation [80, 12], spherical representation [172, 173, 116], volumetric representation [113, 135, 46], permutohedral lattice representation [150, 137], and hybrid representation [27, 64], as shown in Fig. 9.
Multiview Representation. Felix et al. [80] first projected a 3D point cloud onto 2D planes from multiple virtual camera views. Then, a multistream FCN is used to predict pixelwise scores on synthetic images. The final semantic label of each point is obtained by fusing the reprojected scores over different views. Similarly, Boulch et al. [12] first generated several RGB and depth snapshots of a point cloud using multiple camera positions. They then performed pixelwise labeling on these snapshots using 2D segmentation networks. The scores predicted from RGB and depth images are further fused using residual correction [5]. Based on the assumption that point clouds are sampled from locally Euclidean surfaces, Tatarchenko et al. [154] introduced tangent convolutions for dense point cloud segmentation. This method first projects the local surface geometry around each point to a virtual tangent plane. Tangent convolutions are then directly operated on the surface geometry. This method shows great scalability and is able to process largescale point clouds with millions of points. Overall, the performance of multiview segmentation methods is sensitive to viewpoint selection and occlusions. Besides, these methods have not fully exploited the underlying geometric and structural information, as the projection step inevitably introduces information loss.
Spherical Representation. To achieve fast and accurate segmentation of 3D point clouds, Wu et al. [172] proposed an endtoend network based on SqueezeNet [62] and Conditional Random Field (CRF). To further improve segmentation accuracy, SqueezeSegV2 [173] is introduced to address domain shift by utilizing an unsupervised domain adaptation pipeline. Milioto et al. [116]
proposed RangeNet++ for realtime semantic segmentation of LiDAR point clouds. The semantic labels of 2D range images are first transferred to 3D point clouds, an efficient GPUenabled KNNbased postprocessing step is further used to alleviate the problem of discretization errors and blurry inference outputs. Compared to single view projection, spherical projection retains more information and is suitable for the labeling of LiDAR point clouds. However, this intermediate representation inevitably brings several problems such as discretization errors and occlusions.
Volumetric Representation. Huang et al. [60] first divided a point cloud into a set of occupancy voxels. They then fed these intermediate data to a fully3D convolutional neural network for voxelwise segmentation. Finally, all points within a voxel are assigned the same semantic label as the voxel. The performance of this method is severely limited by the granularity of the voxels and the boundary artifacts caused by the point cloud partition. Further, Tchapmi et al. [155] proposed SEGCloud to achieve finegrained and global consistent semantic segmentation. This method introduces a deterministic trilinear interpolation to map the coarse voxel predictions generated by 3DFCNN [106] back to the point cloud, and then uses Fully Connected CRF (FCCRF) to enforce spatial consistency of these inferred perpoint labels. Meng et al. [113] introduced a kernelbased interpolated variational autoencoder architecture to encode the local geometrical structures within each voxel. Instead of a binary occupancy representation, RBFs are employed for each voxel to obtain a continuous representation and capture the distribution of points in each voxel. VAE is further used to map the point distribution within each voxel to a compact latent space. Then, both symmetry groups and an equivalence CNN are used to achieve robust feature learning.
Good scalability is one of the remarkable advantages of volumetric representation. Specifically, volumetricbased networks are free to be trained and tested in point clouds with different spatial sizes. In FullyConvolutional Point Network (FCPN) [135], different levels of geometric relations are first hierarchically abstracted from point clouds, 3D convolutions and weighted average pooling are then used to extract features and incorporate longrange dependencies. This method can process largescale point clouds and has good scalability during inference. Angela et al. [28] proposed ScanComplete to achieve 3D scan completion and pervoxel semantic labeling. This method leverages the scalability of fullyconvolutional neural networks and can adapt to different input data sizes during training and test. A coarsetofine strategy is used to hierarchically improve the resolution of the predicted results.
Volumetric representation is naturally sparse, as the number of nonzero values only accounts for a small percentage. Therefore, it is inefficient to apply dense convolution neural networks on the spatiallysparse data. To this end, Graham et al. [46] proposed submanifold sparse convolutional networks. This method significantly reduces memory and computational costs by restricting the output of convolution to be only related to occupied voxels. Meanwhile, its sparse convolution can also control the sparsity of the extracted features. This submanifold sparse convolution is suitable for efficient processing of highdimensional and spatiallysparse data. Further, Choy et al. [23]
proposed a 4D spatiotemporal convolutional neural network called MinkowskiNet for 3D video perception. A generalized sparse convolution is proposed to effectively process highdimensional data. A trilateralstationary conditional random field is further applied to enforce consistency.
Overall, the volumetric representation naturally preserves the neighborhood structure of 3D point clouds. Its regular data format also allows direct application of standard 3D convolutions. These factors lead to a stead performance improvement in this area. However, the voxelization step inherently introduces discretization artifacts and information loss. Usually, a high resolution leads to high memory and computational costs, while a low resolution introduces loss of details. It is nontrivial to select an appropriate grid resolution in practice.
Permutohedral Lattice Representation. Su et al. [150] proposed the Sparse Lattice Networks (SPLATNet) based on Bilateral Convolution Layers (BCLs). This method first interpolates a raw point cloud to a permutohedral sparse lattice, BCL is then applied to convolve on occupied parts of the sparsely populated lattice. The filtered output is then interpolated back to the raw point cloud. In addition, this method allows flexible joint processing of multiview images and point clouds. Further, Rosu et al. [137] proposed LatticeNet to achieve efficient processing of large point clouds. A datadependent interpolation module called DeformsSlice is also introduced to back project the lattice feature to point clouds.
Hybrid Representation. To further leverage all available information, several methods have been proposed to learn multimodal features from 3D scans. Angela and Matthias [27] present a joint 3Dmultiview network to combine RGB features and geometric features. A 3D CNN stream and several 2D streams are used to extract features, and a differentiable backprojection layer is proposed to jointly fuse the learned 2D embeddings and 3D geometric features. Further, Hung et al. [22] proposed a unified pointbased framework to learn 2D textural appearance, 3D structures and global context features from point clouds. This method directly applies pointbased networks to extracts local geometric features and global context from sparsely sampled point sets without any voxelization. Jaritz et al. [64] proposed Multiview PointNet (MVPNet) to aggregate appearance features from 2D multiview images and spatial geometric features in the canonical point cloud space.
Pointbased networks directly work on irregular point clouds. However, point clouds are orderless and unstructured, making it infeasible to directly apply standard CNNs. To this end, the pioneering work PointNet [129] is proposed to learn perpoint features using shared MLPs and global features using symmetrical pooling functions. Based on PointNet, a series of pointbased networks have been proposed recently. Overall, these methods can be roughly divided into pointwise MLP methods, point convolution methods, RNNbased methods, and graphbased methods.
Pointwise MLP Methods. These methods usually use shared MLP as the basic unit in their network for its high efficiency. However, pointwise features extracted by shared MLP cannot capture the local geometry in point clouds and the mutual interactions between points [129]. To capture wider context for each point and learn richer local structures, several dedicated networks have been introduced, including methods based on neighboring feature pooling, attentionbased aggregation, and localglobal feature concatenation.
Neighboring feature pooling: To capture local geometric patterns, these methods learn a feature for each point by aggregating the information from local neighboring points. In particular, PointNet++ [131] groups points hierarchically and progressively learns from larger local regions, as illustrated in Fig. 10. Multiscale grouping and multiresolution grouping are also proposed to overcome the problems caused by nonuniformity and varying density of point clouds. Later, Jiang et al. [67] proposed a PointSIFT module to achieve orientation encoding and scale awareness. This module stacks and encodes the information from eight spatial orientations through a threestage ordered convolution operation. Multiscale features are extracted and concatenated to achieve adaptivity to different scales. Different from the grouping techniques used in PointNet++ (i.e., ball query), Francis et al. [38] utilized means clustering and KNN to separately define two neighborhoods in the world space and learned feature space. Based on the assumption that points from the same class are expected to be closer in feature space, a pairwise distance loss and a centroid loss are introduced to further regularize feature learning. To model the mutual interactions between different points, Zhao et al. [208] proposed PointWeb to explore the relations between all pairs of points in a local region by densely constructing a locally fullylinked web. An Adaptive Feature Adjustment (AFA) module is proposed to achieve information interchange and feature refinement. This aggregation operation helps the network to learn a discriminative feature representation. Zhang et al. [206] proposed a permutation invariant convolution called Shellconv based on the statistics from concentric spherical shells. This method first queries a set of multiscale concentric spheres, the maxpooling operation is then used within different shells to summarize the statistics, MLPs and 1D convolution are used to obtain the final convolution output. Hu et al. [57] proposed an efficient and lightweight network called RandLANet for largescale point cloud processing. This network utilizes random point sampling to achieve a remarkable efficiency in terms of memory and computation. A local feature aggregation module is further proposed to capture and preserve the geometric features.
Attentionbased aggregation: To further improve segmentation accuracy, an attention mechanism [159] is introduced to point cloud segmentation. Yang et al. [186]
proposed a group shuffle attention to model the relations between points, and presented a permutationinvariant, taskagnostic and differentiable Gumbel Subset Sampling (GSS) to replace the widely used Furthest Point Sampling (FPS) approach. This module is less sensitive to outliers and can select a representative subset of points. To better capture the spatial distribution of a point cloud, Chen et al.
[17] proposed a Local Spatial Aware (LSA) layer to learn spatial awareness weights based on the spatial layouts and the local structures of point clouds. Similar to CRF, Zhao et al. [207] proposed an Attentionbased Score Refinement (ASR) module to postprocess the segmentation results produced by the network. The initial segmentation result is refined by pooling the scores of neighbouring points with learned attention weights. This module can be easily integrated into existing deep networks to improve the final segmentation performance.Localglobal concatenation: Zhao et al. [210] proposed a permutationinvariant PSNet to incorporate local structures and global context from point clouds. Edgeconv [168] and NetVLAD [3] are repeatedly stacked to capture the local information and scenelevel global features.
Point Convolution Methods. These methods tend to propose effective convolution operations for point clouds. Hua et al. [59] proposed a pointwise convolution operator, where the neighboring points are binned into kernel cells and then convolved with kernel weights. Wang et al. [165] proposed a network called PCCN based on parametric continuous convolution layers. The kernel function of this layer is parameterized by MLPs and spans the continuous vector space. Hughes et al. [157] proposed a Kernel Point Fully Convolutional Network (KPFCNN) based on Kernel Point Convolution (KPConv). Specifically, the convolution weights of KPConv are determined by the Euclidean distances to kernel points, and the number of kernel points is not fixed. The positions of the kernel points are formulated as an optimization problem of best coverage in a sphere space. Note that the radius neighbourhood is used to keep a consistent receptive field, while grid subsampling is used in each layer to achieve high robustness under varying densities of point clouds. In [37], Francis et al. provided rich ablation experiments and visualization results to show the impact of receptive field on the performance of aggregationbased methods. They also proposed a Dilated Point Convolution (DPC) operation to aggregate dilated neighbouring features, instead of the K nearest neighbours. This operation is demonstrated to be very effective in increasing the receptive field and can be easily integrated into existing aggregationbased networks.
Method  S3DIS  Semantic3D  ScanNet(v2) 










OA  mIoU  

Multiview  DeePr3SS [80]              88.9  58.5        
SnapNet [12]          91.0  67.4  88.6  59.1        
TangentConv [154]  82.5  52.8              80.1  40.9  40.9  
Spherical  SqueezeSeg [172]                      29.5  
SqueezeSegV2 [173]                      39.7  
RangeNet++ [116]                      52.2  
Volumetric  SegCloud [155]    48.9          88.1  61.3        
SparseConvNet [46]                    72.5    
MinkowskiNet [23]                    73.6    
VVNet [113]      87.8  78.2                

SPLATNet [150]                    39.3  18.4  
LatticeNet [137]                    64.0  52.2  
Hybrid  3DMV [27]                    48.4    
UPB [22]                    63.4    
MVPNet [64]                  64.1    


PointNet [129]    41.1  78.6  47.6              14.6  
PointNet++ [131]      81.0  54.5  85.7  63.1      84.5  33.9  20.1  
PointSIFT [67]      88.7  70.2          86.2  41.5    
Engelmann [39]  84.2  52.2  84.0  58.3                
3DContextNet [199]      84.9  55.6                
ASCN [177]      81.6  52.7                
PointWeb [208]  87.0  60.3  87.3  66.7          85.9      
PAT [186]  60.1  64.3                
LSANet [17]      86.8  62.2          85.1      
ShellNet [206]      87.1  66.8      93.2  69.3  85.2      
RandLANet [57]      87.2  68.5      94.4  76.0      50.3  

PointCNN [88]  85.9  57.3  88.1  65.4          85.1  45.8    
PCCN [165]    58.3                    
ACNN [72]      87.3            85.4      
ConvPoint [13]      88.8  68.2  93.4  76.5            
KPConv [157]    67.1    70.6      92.9  74.6    68.4    
DPC [37]  86.8  61.3                59.2    
InterpCNN [110]      88.7  66.7                

RSNet [61]    51.9    56.5          84.9  39.4    
G+RCU [36]    45.1  81.1  49.7                
3PRNN [191]  85.7  53.4  86.9  56.3                

DGCNN [168]      84.1  56.1                
SPG [78]  86.4  58.0  85.5  62.1  92.9  76.2  94.0  73.2      17.4  
SSP+SPG [77]  87.9  61.7  87.9  68.4                
GACNet [162]  87.8  62.9          91.9  70.8        
PAG [123]  86.8  59.3  88.1  65.9              
HDGCN [92]    59.3    66.9                
HPEIN [66]  87.2  61.9  88.2  67.8            61.8    
SPH3DGCN [82]  87.7  59.5  88.6  68.9            61.0    
DPAM [98]  86.1  60.0  87.6  64.5               
RNNbased Methods.
To capture inherent context features from point clouds, Recurrent Neural Networks (RNN) have also been used for semantic segmentation of point clouds. Based on PointNet
[129], Francis et al. [36] first transformed a block of points into multiscale blocks and grid blocks to obtain inputlevel context. Then, the blockwise features extracted by PointNet are sequentially fed into Consolidation Units (CU) or Recurrent Consolidation Units (RCU) to obtain outputlevel context. Experimental results show that incorporating spatial context is important for the improvement of the segmentation performance. Huang et al. [61] proposed a lightweight local dependency modeling module, and utilized a slice pooling layer to convert unordered point feature sets into an ordered sequence of feature vectors. Ye et al. [191] first proposed a Pointwise Pyramid Pooling (3P) module to capture the coarsetofine local structure, and then utilized twodirection hierarchical RNNs to further obtain longrange spatial dependencies. RNN was then applied to achieve an endtoend learning. However, these methods lose rich geometric features and density distribution from point clouds when aggregating the local neighbourhood features with global structure features [211]. To alleviate the problems caused by the rigid and static pooling operations, Zhao et al. [211] proposed a Dynamic Aggregation Network (DARNet) to consider both global scene complexity and local geometric features. The intermedium features are dynamically aggregated using a selfadapted receptive field and node weights. Liu et al. [96] proposed 3DCNNDQNRNN for efficient semantic parsing of largescale point clouds. This network first learns the spatial distribution and color features using a 3D CNN network, DQN is further used to localize the class objects. The final concatenated feature vector is fed into a residual RNN to obtain the final segmentation results.Graphbased Methods. To capture the underlying shapes and geometric structures of 3D point clouds, several methods resort to graph networks. Loic et al. [78] represented a point cloud as a set of interconnected simple shapes and superpoints, and used an attributed directed graph (i.e., superpoint graph) to capture the structure and context information. Then, the largescale point cloud segmentation problem is spilt into three subproblems, i.e., geometrically homogeneous partition, superpoint embedding, and contextual segmentation. To further improve the partition step, Loic and Mohamed [77] proposed a supervised framework to oversegment a point cloud into pure superpoints. This problem is formulated as a deep metric learning problem structured by an adjacency graph. In addition, a graphstructured contrastive loss is also proposed to help the recognition of borders between objects.
To better capture the local geometric relationships in highdimensional space, Kang et al. [212] proposed a PyramNet based on Graph Embedding Module (GEM) and Pyramid Attention Network (PAN). The GEM module formulates a point cloud as a directed acyclic graph and utilzes a covariance matrix to replace the Euclidean distance for the construction of adjacent similarity matrix. Convolution kernels with four different sizes are used in the PAN module to extract features with different semantic intensities. In [162], Graph Attention Convolution (GAC) is proposed to selectively learn relevant features from a local neighbouring set. This operation is achieved by dynamically assigning attention weights to different neighbouring points and feature channels based on their spatial positions and feature differences. GAC can learn to capture discriminative features for segmentation, and has similar characteristics to the commonly used CRF model.
Compared to semantic segmentation, instance segmentation is more challenging as it requires more accurate and finegrained reasoning of points. In particular, it not only needs to distinguish the points with different semantic meanings, but also separate instances with the same semantic meaning. Overall, existing methods can be divided into two groups: proposalbased methods and proposalfree methods. Several milestone methods are illustrated in Fig 11.
These methods convert the instance segmentation problem into two subtasks: 3D object detection and instance mask prediction. Hou et al. [55] proposed a 3D fullyconvolutional Semantic Instance Segmentation (3DSIS) network to achieve semantic instance segmentation on RGBD scans. This network learns from both color and geometry features. Similar to 3D object detection, a 3D Region Proposal Network (3DRPN) and a 3D Region of Interesting (3DRoI) layer are used to predict bounding box locations, object class labels and instance masks. Following the analysisbysynthesis strategy, Yi et al. [193] proposed a Generative Shape Proposal Network (GSPN) to generate highobjectness 3D proposals. These proposals are further refined by a Regionbased PointNet (RPointNet). The final label is obtained by predicting a perpoint binary mask for each class label. Different from direct regression of 3D bounding boxes from point clouds, this method removes a large amount of meaningless proposals by enforcing geometric understanding. By extending 2D panoptic segmentation to 3D mapping, Gaku et al. [121] proposed an oneline volumetirc 3D mapping system to jointly achieve largescale 3D reconstruction, semantic labeling, and instance segmentation. They first utilized 2D semantic and instance segmentation networks to obtain pixelwise panoptic labels and then integrated these labels to the volumtric map. A fullyconnected CRF is further used to achieve accurate segmentation. This semantic mapping system can achieve highquality semantic mapping and discriminative object recognition. Yang et al. [185] proposed a singlestage, anchorfree and endtoend trainable network called 3DBoNet to achieve instance segmentation on point clouds. This method directly regress rough 3D bounding boxes for all potential instances, and then utilized a pointlevel binary classifier to obtain instance labels. Particularly, the bounding box generation task is formulated as an optimal assignment problem. In addition. a multicriteria loss function is also proposed to regularize the generated bounding boxes. This method does not need any postprocessing and is computationally efficient. Zhang et al. [202] proposed a network for instance segmentation of largescale outdoor LiDAR point clouds. This method learns a feature representation on the bird’seye view of point clouds using selfattention blocks. The final instance labels are obtained based on the predicted horizontal center and the height limits.
Overall, proposalbased methods are intuitive and straightforward, and the instance segmentation results usually have good objectness. However, these methods require multistage training and pruning of redundant proposals. Therefore, they are usually timeconsuming and computationally expensive.
Proposalfree methods [166, 167, 124, 34, 95, 93] do not have an object detection module. Instead, they usually consider instance segmentation as a subsequent clustering step after semantic segmentation. In particular, most existing methods are based on the assumption that points belonging to the same instance should have very similar features. Therefore, these methods mainly focus on discriminative feature learning and point grouping.
In a pioneering work, Wang et al. [166]
first introduced a Similarity Group Proposal Network (SGPN). This method first learns a feature and semantic map for each point, and then introduces a similarity matrix to represent the similarity between each paired features. To learn more discriminative features, they use a doublehinge loss to mutually adjust the similarity matrix and semantic segmentation results. Finally, a heuristic and nonmaximal suppression method is adopted to merge similar points into instances. Since the construction of a similarity matrix requires large memory consumption, the scalability of this method is limited. Similarly, Liu et al.
[95] first leveraged submanifold sparse convolution [46] to predict semantic scores of each voxel and affinity between neighboring voxels. They then introduced a clustering algorithm to group points into instances based on the predicted affinity and the mesh topology. Further, Liang et al. [93] proposed a structureaware loss for the learning of discriminative embeddings. This loss considers both the similarity of features and the geometric relations among points. An attentionbased graph CNN is further used to adaptively refine the learned features by aggregating different information from neighbors.Since the semantic category and instance label of a point are usually dependent on each other, several methods have been proposed to couple these two tasks into a single task. Wang et al. [167] integrated these two tasks by introducing an endtoend and learnable Associatively Segmenting Instances and Semantics (ASIS) module. Experiments show that semantic features and instance features can mutually support each other to achieve an improved performance through this ASIS module. Similarly, Pham et al. [124] first introduced a MultiTask Pointwise Network (MTPNet) to assign a label to each point and regularized the embeddings in the feature space by introducing a discriminative loss [29]. They then fused the predicted semantic labels and embeddings to a MultiValue Conditional Random Field (MVCRF) model for joint optimization. Finally, meanfield variational inference is used to produce semantic labels and instance labels. Hu et al. [58]
first proposed a Dynamic Region Growing (DRG) method to dynamically separate a point cloud into a set of disjoint patches, and then used an unsupervised Kmeans++ algorithm to group all these patches. Multiscale patch segmentation is then performed with the guidance of contextual information between patches. Finally, these labeled patches are merged into object level to obtain final semantic and instance labels.
To achieve instance segmentation on full 3D scenes, Cathrin et al. [34] presented a hybrid 2D3D network to jointly learn global consistent instance features from a BEV representation and local geometric features of point clouds. The learned features are then combined to achieve semantic and instance segmentation. Note that, rather than heuristic GroupMerging algorithms [166], a more flexible Meanshift [25] algorithm is used to group these points into instances. Alternatively, multitask learning is also introduced for instance segmentation. Jean et al. [75] learned both the unique feature embedding of each instance and the directional information to the object’s center. Feature embedding loss and directional loss are proposed to adjust the learned feature embeddings in latent feature space. Meanshift clustering and nonmaximum suppression are adopted to group voxels into instances. This method achieves the stateoftheart performance on the ScanNet [26] benchmark. Besides, the predicted directional information is particularly useful to determine the boundary of instances. Zhang et al. [201] introduced probabilistic embeddings to instance segmentation of point clouds. This method also incorporates uncertainty estimation and proposes a new loss function for the clustering step.
In summary, proposalfree methods do not require computationally expensive regionproposal components. However, the objectness of instance segments grouped by these methods is usually low since these methods do not explicitly detect object boundaries.
The difficulty for part segmentation of 3D shapes are twofold. First, shape parts with the same semantic label have a large geometric variation and ambiguity. Second, the method should be robust to noise and sampling.
VoxSegNet [171] is proposed to achieve finegrained part segmentation on 3D voxelized data under a limited solution. A Spatial Dense Extraction (SDE) module (which consists of stacked atrous residual blocks) is proposed to extract multiscale discriminative features from sparse volumetric data. The learned features are further reweighted and fused by progressively applying an Attention Feature Aggregation (AFA) module. Evangelos et al. [70] combined FCNs and surfacebased CRFs to achieve endtoend 3D part segmentation. They first generated images from multiple views to achieve optimal surface coverage and fed these images into a 2D network to produce confidence maps. Then, these confidence maps are aggregated by a surfacebased CRF, which is responsible for a consistent labeling of the entire scene. Yi et al. [192] introduced a Synchronized Spectral CNN (SyncSpecCNN) to perform convolution on irregular and nonisomorphic shape graphs. A spectral parameterization of dilated convolutional kernels and a spectral transformer network is introduced to solve the problem of multiscale analysis in parts and information sharing across shapes.
Wang et al. [164] first performed shape segmentation on 3D meshes by introducing Shape Fully Convolutional Networks (SFCN) and taking three lowlevel geometric features as its input. They then utilized votingbased multilabel graph cuts to further refine the segmentation results. Zhu et al. [214] proposed a weaklysupervised CoSegNet for 3D shape cosegmentation. This network takes a collection of unsegmented 3D point cloud shapes as input, and produces shape part labels by iteratively minimizing a group consistency loss. Similar to CRF, a pretrained partrefinement network is proposed to further refine and denoise part proposals. Chen et al. [21] proposed a Branched AutoEncoder network (BAENET) for unsupervised, oneshot and weakly supervised 3D shape cosegmentation. This method formulates the shape cosegmentation task as a representation learning problem and aims at finding the simplest part representations by minimizing the shape reconstruction loss. Based on the encoderdecoder architecture, each branch of this network can learn a compact representation for a specific part shape. The features learned from each branch and the point coordinate are then fed to the decoder to produce a binary value (which indicates whether the point belongs to this part). This method has good generalization ability and can process large 3D shape collections (up to 5000+ shapes). However, it is sensitive to initial parameters and does not incorporate shape semantics into the network, which hinders this method to obtain a robust and stable estimation in each iteration.
Table IV shows the results achieved by existing methods on public benchmark, including S3DIS [4], Semantic3D [52], ScanNet [107], and SemanticKITTI [6]. The following issues need to be further investigated:
Pointbased networks are the most frequently investigated methods. However, point representation naturally does not have the explicit neighbouring information, most existing pointbased methods have to resort the expensive neighbor searching mechanism (e.g., KNN [88] or ball query [131]). This inherently limits the efficiency of these methods, as the neighbor searching mechanism requires both high computational cost and irregular memory access [105].
Learning from imbalanced data is still a challenging problem in point cloud segmentation. Although several approaches [206, 157, 78] have achieved a remarkable overall performance, their performance on minority classes is still limited. E.g., RandLANet [57] achieves an overall IoU of 76.0% on the reduced8 subset of Semantic3D, but a very low IOU of 41.1% on the class of hardscape.
The majority of existing approaches [129, 131, 88, 17, 206] work on small point clouds (e.g., 1m1m with 4096 points). In practice, the point clouds acquired by depth sensors are usually immense and largescale. Therefore, it is desirable to further investigate the problem of efficient segmentation of largescale point clouds.
This paper has presented a contemporary survey of the stateoftheart methods for 3D understanding, including 3D shape classification, 3D object detection & tracking, and 3D scene and object segmentation. A comprehensive taxonomy and performance comparison of these methods have been presented. Merits and demerits of various methods are also covered, with potential research directions being listed.
This work was partially supported by the National Natural Science Foundation of China (No. 61972435, 61602499), the Natural Science Foundation of Guangdong Province (2019A1515011271), the Shenzhen Technology and Innovation Committee, the Australian Research Council (Grants DP150100294 and DP150104251), and a China Scholarship Council (CSC) scholarship.
Justlookup: one millisecond deep feature extraction for point clouds by lookup tables
. In ICME, pp. 326–331. Cited by: §2.2.1, TABLE I.3DCNNDQNRNN: a deep reinforcement learning framework for semantic parsing of largescale 3D point clouds
. In ICCV, pp. 5678–5687. Cited by: §4.1.2.
Comments
There are no comments yet.