We live in a three-dimensional world, but since the invention of the camera in 1888, visual information of the 3D world is being projected onto 2D images using cameras. 2D images, however, lose depth information and relative positions between two or more objects in the real world, which makes it less suitable for applications that require depth and positioning information such as robotics, autonomous driving, virtual reality and augmented reality among others. To capture the 3D world with depth information, early convention was to use stereo vision where 2 or more calibrated digital cameras are used to extract the 3D information. Point cloud is a data structure that is often used to represent 3D geometry, making it the immediate representation of the extracted 3D information from stereo vision cameras as well as of the depth map produced by RGB-D. Recently, 3D point cloud is booming as a result of increasing availability of sensing devices such as LiDAR and more recently, mobile phones with time of flight (tof) depth camera, which allow easy acquisition of the 3D world in 3D point cloud.
Point cloud is simply a set of data points in a space. The point cloud of a scene is the set of 3D points sampled around the surface of the objects in the scene. In its simplest form, a 3D point cloud is represented by the XYZ coordinates of the points, however, additional features such as surface normal, RGB values can also be used. Point cloud is a very convenient format for representing 3d world and it has a range of application in different areas such as robotics, autonomous vehicles, augmented and virtual reality and other industrial purposes like manufacturing, building rendering e.t.c.
In the past few years, processing of point cloud for visual intelligence has been based on handcrafted features spinimages1999; 3DfreeformB2004; shapeD2009; featH2008; FPFH2009; uniqueshape2010. The review of handcrafted based feature learning techniques is conducted in comparison2014. The handcrafted features do not require large training data and were seldom used as there were not enough point cloud data and deep learning was not popular. However, with increasing availability of acquisition devices, point cloud data is now readily available making use of deep learning for its processing feasible. However, the application of deep learning on point cloud is not easy due to the nature of the point cloud. In this paper, we review the challenges of point cloud for deep learning; the early approaches devised to overcome these challenges; and also the recent state-of-the-arts approaches that directly operate on point cloud, focusing more on the latter. This paper is intended to serve as guide to new researchers in the field of deep learning on point cloud as it presents the recent state-of-the-arts approaches of deep learning on point cloud.
We organized the rest of the paper into the following: section 2 discussed the challenges of point cloud which makes the application of deep learning difficult. Section 3 reviewed the methods that overcome the challenges by converting the point cloud into a structured grid. Section 4 contains in-depth of the various deep learning methods that process point cloud directly. In section 5, we presented 3D point cloud benchmark datasets. We discussed the application of the various approaches in the 3D vision tasks in section 6. We summarize and conclude the paper in section 7.
2 Challenges of deep learning on point clouds
Applying deep learning on 3D point cloud data comes with many challenges. Some of these challenges include occlusion which is caused by clutterd scene or blind side; noise/outliers which are unintended points; points misalignment e.t.c. However, the more pronounced challenges when it comes to application of deep learning on point clouds can be categorized into the following:
Irregularity: Point cloud data is also irregular, meaning, the points are not evenly sampled accross the different regions of an object/scene, so some regions could have dense points while others sparse points. These can be seen in figure 0(a).
Unstructured: Point cloud data is not on a regular grid. Each point is scanned independently and its distance to neighboring points is not always fixed, in contrast, pixels in images are represented on a 2 dimension grid, and spacing between two adjacent pixels is always fixed.
Unorderdness: Point cloud of a scene is the set of points(usually represented by XYZ) obtained around the objects in the scene and are usually stored as a list in a file. As a set, the order in which the points are stored does not change the scene represented. For illustration purpose, we show the unordered nature of point sets in figure 0(c)
These properties of point cloud are very challenging for deep learning, especially convolutional neural networks (CNN). These is because convolutional neural networks are based on convolution operation which is performed on a data that is ordered, regular and on a structured grid. Early approaches overcome these challenges by converting the point cloud into a structured grid format, section3. However, recently researchers have been developing approaches that directly uses the power of deep learning on raw point cloud, see section 4, doing away with the need for conversion to structured grid.
3 Structured grid based learning
Deep learning, specifically convolutional neural network is successful because of the convolution operation. Convolution operation is used for feature learning, doing away with handcrafted features. Figure 2 shows a typical convolution operation on a 2D grid. The convolusion operation requires a structured grid. Point cloud data on the other hand is unstructured, and this is a challenge for deep learning, and to overcome the challenge many approaches convert the point cloud data into a structured form. These approaches can be broadly divided into two categories, voxel based and multiview based. In this section, we review some of the state-of-the-arts methods in both voxel based and multiview based categories, there advantages as well as there drawbacks.
3.1 Voxel based
Convolution operation on 2d images, uses a 2d filter of size to convolve a 2D input represented as matrix of size with and . Voxel based methods 3Dconvforlandingzone; voxnet; multiviewandvolumetric; normalnet; multiresolution3Dcnn uses similar approach by converting the point cloud into a 3D voxel structure of size and convolve it with 3D kernels of size with respectively. Basically, two important operations takes place in this methods, the offline(preprocessing) and the online (learning). The offline methods converts the point cloud into a fixed size voxels as shown in figure 3. Binary voxels modelnet is often used to represent the voxels. In normalnet
a normal vector is added to each voxel to improve discimination capability.
The online operation, is the learning stage. In this stage, deep convolutional neural network is designed usually using a number of 3D convolutional, pooling, and fully connected layers.
represented 3D shapes as a probability distribution of binary variables on a 3D voxel grid and were the first work that uses 3D Deep Convolutional Neural Networks. The input to the network, point cloud, CAD models or RGB-D images, is converted into a 3D binay voxel grid and is processed using a convolusional deep belief networkdeepbeliefnet. 3Dconvforlandingzone
uses 3D CNN for landing zone detection for unmanned rotorcraft. LiDAR from the rotorcraft is used to obtain point cloud of the landing site, which is then voxelized into 3D volumes and 3D CNN binary classifier is applied to classify the landing site as safe or otherwise. Invoxnet a 3D Convolutional Neural Network is proposed for object recognition, like modelnet, the input to the network in voxnet is converted into a 3D binary occupancy grid before applying 3D convolution operations to generate a feature vector which is passed through fully connected layers to obtain class scores. Two voxel based models where proposed in multiviewandvolumetric. First model addressed overfitting using auxiliary training tasks to predict object from partial subvolumes and the second model mimic Multiview-CNNs by convolving the 3D shapes with anisotropic probing kernel.
Voxel based methods, although have shown good performances, they however do suffer from high memory consumption due to the sparsity of the voxels, figure 3, which results in wasted computation when convolving over the non occupied regions. The memory consumption also limits the voxel resolution to usually between 32 cube to 64 cube. These drawbacks is also in addition to the artifacts introduced by the voxelization operation.
To overcome the challenges of voxelization, OctNet; octree proposed adaptive representation. These representation is much complex than the regular 3D voxels, however, its still limited to only 256 cube voxels.
3.2 multiview based
These methods multiviewCNN; SLCAE; GIFT; multiviewandvolumetric; 3Dshapeseg; classificationSphericalProjections; multiviewrecog, take advantage of the already matured 2D CNNs into 3D. Because images are actually representation of the 3D world squashed onto a 2D grid by a camera, methods under this category follows these technique by converting point cloud data into a collection of 2D images and apply existing 2D CNN techniques to it, see 4. Compared to their volumetric based counter parts, Multiview based methods have better performance as the Multiview images contains richer information than 3D voxels even though the latter contains depth information.
multiviewCNN is the first work in this direction with the aim of bypassing the need for 3D descriptors for recognition and achieved state-of-the-arts accuracy. SLCAE
proposed a stacked local convolutional autoencoder (SLCAE) for 3D object retrieval.multiviewandvolumetric introduced multi-resolution filtering which captures information at multiple scales and in addition they used data augmentation to improved on multiviewCNN.
Multiview based networks have better performance than voxel based methods, this is because of two reasons, 1) they used an already well researched 2D techniques and 2) they can contains reacher information as they do not have quantization artifacts of voxelization.
3.3 Higher dimensional lattices
There are other methods for point cloud processing using deep learning that converts the point cloud into higher dimensional regular lattice. SplatNet splatnet processes point cloud directly, however, its primary feature learning operation occurs at the bilateral convolutional layer(BCL). The BCL layer converts the features of unordered points into a six-dimensional(6D) permutohedral lattice, and convolve it with a kernal of similar lattice. SFCNN sfcnn uses a a fractalized regular icosahedral lattice to map points onto a discretized sphere and defined a multi-scale convolution operation on the regular shperical lattice.
4 Deep learning directly on raw point cloud
Deep learning on raw point cloud is receiving lot of attention since PointNet pointnet was released in 2017. Many state-of-the-arts methods have been developed since then. These techniques process point cloud directly despite the challenges of section 2. In this section, we review the state-of-the-arts techniques that work in this direction. We began with PointNet which is the bedrock for most of the techniques. Other techniques improved on PointNet by modeling local region structure.
Convolutional Neural Networks is largely successful because of the convolution operation, which enables learning on local regions in a hierachical manner as the network gets deeper. Convolution however, requires structured grid which is lacking in point cloud data. PointNet pointnet is the first method that applies deep learning on unstructured point cloud and its the basis for which most other techniques are based on. In this subsection we give a review of PointNet.
The architecture of PointNet is shown in figure 5. The input to PointNet is raw point cloud , where represents the number of points in the point cloud and the dimension, usually representing the XYZ values of each points, however additional features can be used. Because points are unordered, PointNet is made up with symmetric funtions. Symmetric functions are functions whose output are the same irrespective of the input order. PointNet is built on 2 basic symmetric functions, multilayer perceptron(MLP) with learnable parameters, and a maxpooling function. The MLPs are feature transformations that transform the feature dimension of the points from to dimensional space and there parameters are shared by all the points in each layer. To aggregate the global feature, maxpooling symmetric function is employed to produce one global 1024-dimensional feature vector. The feature vector represent the feature descriptor of the input which can be used for recognition and segmentation tasks.
PointNet achieves state-of-the-arts performance on several benchmark datasets. The design of PointNet, however, do not considers local dependency among points, thus, it does not capture local structure. The global maxpooling applied select the feature vector in a ”winner–take –all” WTA principle, making it very susceptible to targetted adversarial attack as demonstrated in pointattack. After PointNet many approaches were proposed to capture local structure.
4.2 Approaches with local structure computation
Many state-of-the-arts approaches where developed after PointNet that captures local structure. These techniques capture local structure hierarchically in a smilar fashion to grid convolution with each heirachy encoding richer representation.
Basically, due to the inherent nature of point cloud of unorderedness, local structure modeling rests on three basic operations: sampling; grouping; and a mapping function that is usually approximated by a multilayer perceptron (MLP) which maps the features of the nearest neighbor points into a feature representation that encodes higher level information, see figure 6. We briefly explained this operations before reviewing the various approaches.
Sampling Sampling is employed to reduce resolution of points accross layers in synonymity to how convolution operation reduces the resolution of feature maps via convolutional and pooling layers. Giving point cloud of N points, the sampling reduces it to M points , where . The subsampled points, also referred to as representative points or centroids, are used to represent the local region from which they were sampled. Two approaches are popular for subsampling 1) random point sampling, where each of the points is equally likely to be sampled and 2) farthest point sampling (FPS) where the points are sampled such that each sampled point is the most distant point from the rest of the points. Other sampling methods include uniform sampling and Gumbel Subset Sampling selfatGSS.
Grouping With the representative points sampled, k-nearest neighbor algorithm is use to select the nearest neighbor points to the representatives points to group them into a local patch, figure 7
. The points in a local patch will be used to compute the local feature representation of the neighborhood. In grid convolution, the receptive field, are the pixels on the feature map under a kernel. The kNN is either used directly where k nearest points to a centroid are sampled, or a ball query is used. With ball query, points are selected only when they are within a radius distance to the centroid points.
Non-linear mapping function
Once the nearest points to each representative points are obtained, the next step is to map them into a feature vector which represents the local structure. In grid convolution, the receptive field is mapped into a feature neuron using a simple matrix multiplication and summation with convolutional kernels. This is not easy in point cloud, because the points are not structured, therefore most approaches approximate the function using PointNetpointnet based methods which is composed of symmetric functions consisting of a multilayer perceptrons, , and a maxpooling function, as shown in equation 1.
4.2.1 Approaches that do not explore local correlation
Several approaches follow pointnet like MLP where correlation between points within a local region are not considered and instead, individual point features are learned via shared MLP and local region feature is aggregated using a maxpooling function in a winner-takes-all principle.
PointNet++ pointnetpp extended PointNet for local region computation by applying pointnet hiearchically in local regions. Giving a point sets, , farthest point sampling algorithm is used to select centroids, and ball query is used to select nearest neighbor points for each centroids. PointNet is then applied on the local regions to generate a feature vector of the regions. These process is repeated in a hierarchical form thereby reducing the points resolution as it goes deeper. In the last layer along the hierarchy, the whole point’s features are passed through a PointNet to produce one global feature vector. PointNet++ achieves state of the art accuracy on many public datasets including, ModelNet40 modelnet and ScanNet dai2017scannet.
VoxelNet voxelnet proposed a Voxel Feature Encoding(VFE). Giving a point cloud, it is first casted into 3D voxels of resolution , and points are grouped according to the voxel they fall into. Because of the irregularity of point cloud, T points are sampled in each voxel inorder to have uniform number of points per voxel. In a VFE layer, the centroids for each of the voxel is computed as a local mean of the T points withing the voxel, the T points are are then processed using a fully connected network (FCN) to aggregate information from all the points similar to PointNet. The VFE layers are stacked and a maxpooling layer is applied to get a global feature vector of each voxel making the feature of the input point cloud to be represented by a sparse 4D vector, . To fit voxelnet into figure 6 the centroids for each voxel are the centroids/representative points, the T points in each voxel are the nearest neighbor points and the FCN is the non linear mapping function.
Self organizing map, (SOM), originally proposed in som, is used to create a self organizing networks for point cloud in SO-Net sonet. While random point sampling/farthest point sampling/ uniform sampling is used to select centroids in most of the methods discussed, in So-Net, SOM is constructed with a fixed number of nodes which are dispersed uniformly in a unit ball. The SOM nodes are permutation invariant and plays the roles of local region centroid. For each SOM node, k-NN search is used to find its nearest neighbor points which are passed through a series of fully connected layers to extract point features which are maxpooled to generate M nodes features. To obtain the global feature of the input point cloud, the M nodes features are aggregated using maxpooling.
Pointwise convolution is proposed in pointConv. In this technique, there is no subsampled/representative points, because the convolution operation is done on all the input points. In each point, nearest neighbor points are sampled based on a size or radius value of a kernel centered on the point. The radius value can be adjusted for different number of neighbor points in any layer. Each pointwise convolution is applied independently on the input and it transforms input points from 3-dimension to 9-dimension. The final feature is obtained by concatenating the output of all the pointwise convolution for each point and it has a resolution equavalent to the input. These final feature is then used for segmentation using convolution layer or classification task using fully connected layers.
4.2.2 Approaches that explore local correlation
Several approaches explore the local correlations between points in a local region to improve discriminative capability. This is intuitive because points do not exist in isolation, rather, multiple points together are needed to form a meaningful shape.
PointCNN pointcnn improved on PointNet++ by proposing an X-transformation on the k-nearest neighbor points of each centroids before applying a PointNet-like MLP. The centroids/representative points are randomly sampled, and k-NN is used to select the neighborhood points which are passed through an X-transformation block before applying the non-linear mapping function. The purpose of the X-transform is to permute the input into a more canonical form which in essence also takes into consideration the relationship between points within a local region. In pointweb pointweb, ”a local web of points” is designed by densely connecting points within a local region and learns the impact of each point on the other points using an Adaptive Feature Adjustment (AFA) module. In pointconvcvpr
the authors propsed a ”pointConv” operation which similarly explore the intrinsic structure of points within a local region by computing the inverse density scale of each point using a kernel density estimation (KDE). The kernel density estimation is computed offline for each point, and is fed into an MLP to estimate the density estimates.
In relationshape, the centroids are selected using uniform sampling strategy, and the nearest neighbor points to the centroids are selected using spherical neighborhood. The non-linear function is also approximated using a multi-layer perceptron(MLP), but with additional discriminative capability by considering the relation between each centroids to its nearest neighbor points. The relationship between neighboring points is based on the spatial layout of the points. Similaryly, GeoCNN geocnn explores geometric structure within local region by weighing the features of neighboring points based on the distance to their respective centroid point, however, the authors performs point wise convolution without reducing point resolution across layers.
Acnn argues that overlapping receptive field caused by multi-scale architecture of most of PointNet based approaches could result in computation redundancy because same neigboring points could be included in different scaled regions. To address the redundancy, the authors proposed annularly convolution which is a ring based approach that avoids having overlaps between hierarchy of receptive fields and alsp captures relationship between points in within the recpetive field.
PointNet-like MLP is the popular mapping function for approximating points in a local patch into a feature vector, however, spiderCNN argues that MLP does not account for the geometric prior of point clouds and also requires sufficently large parameters. To address these issues, the authors proposed a family filters that are composed of two functions, a step function that encodes local geodesic information, followed by a third order taylor expansion. The approach learns hierarchical representations and achieves state-of-the-art performance in classification and segmentation tasks.
Point Attention transformers (PAT) is proposed in selfatGSS. The authors proposed a new subsampling method termed ”Gumbel Subset Sampling (GSS)” which unlike farthest point sampling (FPS), its permutation invariant, and its robust to outliers. The authors used absolute and relative position embedding, where each point is represented by a set of its absolute position and relative position to other points in a local patch, pointNet is then applied on the set. And to further capture relationship between points, a modified Multi-Head Attention (MHA) mechanism is used. A new sampling ang grouping techniques with learnable parameters were proposed in dpam in a module termed dynamic points agglomeration module(DPAM) which learns an agglomeration matrix which when multiplied with incoming poimt features reduces the resolution(similar to sampling) and produce an aggregated feature (similar to grouping and pooling).
|PointCNN pointcnn||Uniform/Random sampling||k-NN||MLP|
|Pointwise Conv pointConv||-||Radius-search||MLP|
|Kd-Network DeepKdNet||-||Tree based nodes||
|LocalSpec localspecGCNN||Farthest point sampling||k-NN||Spectral convolution+cluster pooling|
|SpiderCNN spiderCNN||Uniform sampling||k-NN||Taylor expansion|
|R-S CNN relationshape||Uniform sampling||radius-nn||MLP|
|PointConv pointconvcvpr||Uniform sampling||radius-nn||MLP|
|PAT selfatGSS||Gumbel subset sampling||k-NN||MLP|
|A-CNN Acnn||Uniform subsampling||k-NN||MLP+density functions|
|ShellNet shellnet||Random Sampling||Spherical Shells||1D convolution|
4.2.3 Graph based
Graph based approaches were proposed in DeepKdNet; DGCNN; localspecGCNN; Point2Node. Graph based approaches represents the point cloud with graph structure by treating each point as a node. Graph structure is good for modelling correlation between points as explicitly represented by the graph edges. DeepKdNet uses a kd-tree which is a special kind of graph. The kd-tree is built in a top-down manner on the point cloud to create a feed-forward Kd-network with learnable parameters in each layer. The computation performed in the Kd-network is in buttom-up fashion. The leaves represents the input points, 2 nearest neighbor (left and right) nodes are used to compute their parent node using shared parameters of weight matrix and a bias. The Kd-network captures hierarchical representations along the depth of the kd-tree, however, because of tree design, nodes at the same depth level do not capture overlapping receptive fields.
DGCNN; localspecGCNN; Point2Node are based on typical graph network whose vertices represents the points and edges represented as a matrix. In DGCNN edge convolution is proposed. The graph is represented as a k-nearest neighbor graph over the inputs. In each edge convolution layer, features of each point/vertex are computed by applying a non-linear function on its nearest neighbor vertices as captured by the edge matrix . The non-linear function is a multilayer perceptron (MLP). After the last edgeConv layer, global maxpooling is employed to obtain a global feature vector similar to pointnet. One distinct difference of DGCNN from normal graph network is that the edges are updated after each edgeConv layer based on the computed features from the previous layer hence the name Dynamic Graph CNN(DGCNN). While there is no resolution reduction as the network goes deeper in DGCNN which leads to increased in computation cost, localspecGCNN defined a spectral graph convolution in which the resolution of the points reduces as the network gets deeper. In each layer, k-nearest neighbor points are sampled, but instead of using mlp-like operation on the the k local points sets like in pointnetpp, a graph is defined on the sets, the vertices of the graph are the points and the edges
are weight based on the pair-wise distance between the xyz spatial corrdinates of the points. Graph fourier transform of the points is then computed and filtered using spectral filtering. After the filtering, the resolution of the points is still the same, clustering, recursive cluster pooling technique is proposed to aggregate the information in each graph into one vertex.
In Point2Node, the authors proposed a graph network that fully explore not only the local correlation, but also non local correlation. The correlation is explored in 3 ways, self correlation which explores channel-wise correlation of a node’s feature; local correlation that explore local dependency among nodes in a local region; and non-local correlation for capturing better global feature by considering long-range local features.
Table 1 summarized the approaches showing there sampling, grouping and mapping function methods.
5 Benchmark Datasets
A considerable amount of point cloud datasets has been published in recent years. Most of the existing datasets are provided by universities and industries. They can provide a fair comparison for testing diverse approaches. These public benchmark datasets consist of virtual scenes or real scenes, which focus particularly on point cloud classification, segmentation, registration and object detection. They are notably useful in deep learning since they can provide huge amounts of ground truth labels for training the network. The point cloud is obtained by different platforms/methods, such as Structure from Motion (SfM), Red Green Blue -Depth (RGB-D) cameras, and Light Detection And Ranging (LiDAR) systems. The availability of benchmark datasets usually decrease as the size and complexity increases. In this section, we introduce some popular datasets for 3D research.
5.1 3D Model Datasets
This dataset was developed by the Princeton Vision & Robotics Labs. ModelNet40 has 40 man-made object categories (such as airplane, bookshelf and chair) for shape classification and recognition. It consists of 12,311 CAD models, which has been split into 9,843 training and 2,468 testing shapes. ModelNet10 dataset is a subset of ModelNet40 that only contains 10 categories of classes. It is also divided into 3991 training and 908 testing shapes.
The large-scale dataset was developed by Stanford University et al. It provides semantic category labels for per model. rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations. ShapeNet has indexed almost 3,000,000 models when the dataset published, and there are 220,000 models has been classified into 3,135 categories. ShapeNetCore is a subset of ShapeNet, which consists of nearly 51,300 unique 3D models. It provides 55 common object categories and annotations. ShapeNetSem is also a subset of ShapeNet, which contains 12,000 models. It is more smaller but covers more extensive categories of 270.
yi2016scalable has created detailed part labels for 31963 models from ShapeNetCore dataset. It provides 16 shape categories for part segmentation. dai2017complete has provided 1200 virtual partial models from ShapeNet dataset. photoshape2018 has proposed an approach for automatically generating photorealistic materials for 3D shapes. It is built on the ShapeNetCore dataset. Mo_2019_CVPR is a large-scale dataset with fine-grained and hierarchical part annotations. It consists of 24 object categories and 26,671 3D models, which provides 573,585 part instance labels. xiang2016objectnet3d
has contributed a large-scale dataset for 3D object recognition. There are 100 categories of the dataset, which consists of 90,127 images with 201,888 objects (from ImageNetdeng2009imagenet) and 44,147 3D shapes (from ShapeNet).
Shape2Motion was developed by Beihang University and National University of Defense Technology. It has created a new benchmark dataset for 3D shape mobility analysis. The benchmark consists of 45 shape categories with 2440 models where the shapes are obtained from ShapeNet and 3D Warehouse 3DWarehouse. The proposed approach inputs a single 3D shape, then predicts motion part segmentation results and motion corresponding attributes jointly.
ScanObjectNN was developed by Hong Kong University of Science and Technology et al. It is the first real-world dataset for point cloud classification. About 15,000 objects are selected from indoor datasets (SceneNN scenenn-3dv16 and ScanNet dai2017scannet). And the objects are split into 15 categories where there are 2902 unique object instances.
5.2 3D Indoor Datasets
The New York University Depth Dataset v2 (NYUDv2) was developed by the New York University et al. The dataset provided 1449 RGB-D (obtained by Kinect v1) images captured from 464 various indoor scenes. All of the images are distributed segmentation labels. This dataset is mainly served for understanding how 3D cues can lead to better segmentation for indoor objects.
This dataset was developed by the Princeton University. It is a RGB-D video dataset where the videos are captured from 254 different spaces in 41 buildings. SUN3D provides 415 sequences with camera pose and object labels. The point cloud is generated by structure from motion (SfM).
Stanford 3D Large-Scale Indoor Spaces (S3DIS) was developed by the Stanford University et al. S3DIS was collected from 3 different buildings with 271 rooms where the cover area is above 6,000. It contains over 215 million points, and each point has the provision of instance-level semantic segmentation labels (13 categories).
Singapore University of Technology and Design et al. developed this dataset. SceneNN is an RGB-D (obtained by Kinect v2) scene dataset collected form 101 indoor scenes.
It provides 40 semantic classes for the indoor scenes, and all semantic labels are same as NYUDv2 dataset.
ScanNet is a large-scal indoor dataset developed by Stanford University et al. It contains 1513 scanned scenes, including nearly 2.5M RGB-D (obtained by Occipital Structure Sensor) images from 707 different indoor environments. The dataset provides ground truth labels for 3D object classification with 17 categories and semantic segmentation with 20 categories.
For object classification, ScanNet divides all instances into 9,677 instances for training and 2,606 instances for testing. And ScanNet splits all scans into 1201 training scenes and 312 testing scenes for semantic segmentation.
Matterport3D is the largest indoor dataset which developed by Princeton University et al. The cover area of this dataset is 219,399m from 2056 rooms, and there is 46,561m
of floor space. It consists of 10,800 panoramic views where the views are from 194,400 RGB-D images of 90 large-scale buildings. The labels contain surface reconstructions, camera poses, and semantic segmentation. This dataset investigates 5 tasks for scene understanding, which are keypoint matching, view overlap prediction, surface normal estimation, region-type classification, and semantic segmentation.
This benchmark dataset is developed by Princeton University et al. It is a large collection of existing datasets, such as Analysisby-Synthesis valentin2016learning, 7-Scenes shotton2013scene, SUN3D xiao2013sun3d, RGB-D Scenes v.2 de2013unsupervised and Halber et al. Halber2016StructuredGR. 3DMatch benchmark consists of 62 scenes with 54 training scenes and 8 testing scenes. It leverages correspondence labels from RGB-D scene reconstruction datasets, and then provides ground truth labels for point cloud registration.
Multisensor Indoor Mapping and Positioning Dataset wang2018semantic
This indoor dataset (rooms, corridor and indoor parking lots) was developed by Xiamen University et al. The data was acquired by multi-sensors, such as laser scanner, camera, WIFI, Bluetooth, and IMU. This dataset provides dense laser scanning point cloud for indoor mapping and positioning. Meanwhile, they also provide colored laser scans based on multi-sensor calibration and SLAM-mapping process.
5.3 3D Outdoor Datasets
Kitti Geiger2012CVPR Geiger2013IJRR
The KITTI dataset is one of the best known in the field of autonomous driving which was developed by Karlsruhe Institute of Technology et al. It can be used for the research of stereo image, optical flow estimation, 3d detection, 3d tracking, visual odometry and so on. The data acquisition platform is equipped with two color cameras, two grayscale cameras, a Velodyne HDL-64E 3D laser scanner and a high-precision GPS/IMU system. KITTI provides raw data with five categories of Road, City, Residential, Campus and Person. Depth completion and prediction benchmark consists of more than 93 thousand depth maps. 3D object detection benchmark contains 7481 training point clouds and 7518 testing point clouds. Visual odometry benchmark is formed by 22 sequences, with 11 sequences (00-10) LiDAR data for training and 11 sequences (11-21) LiDAR data for testing. Meanwhile, a semantic labeling behley2019iccv for Kitti odometry dataset is published recently. SemanticKITTI contains 28 classes including ground, structure, vehicle, nature, human, object, and others.
ASL Dataset pomerleau2012challenging
This group of datasets was developed by ETH Zurich. The dataset was collected between August 2011 to January 2012. It provides 8 point cloud sequences acquired by a Hokuyo UTM-30LX. Each sequences has around 35 scanning point clouds and the ground truth pose is supported by GPS/INS systems. This dataset covers the area of structured and unstructured environments.
The large-scale urban scene dataset was developed by Mines ParisTech et al in January 2013. The entire 3D point cloud has been classified and segmented into 50 classes. The data was collected by StereopolisII MLS, a system developed by French National Mapping Agency (IGN). They use Riegl LMS-Q120i sensor to acquire 300 million points.
Oxford Robotcar RobotCarDatasetIJRR
This dataset was developed by the University of Oxford. It consists of around 100 times trajectories (a total of 101,046km trajectories) through central Oxford between May 2014 to December 2015. This long-term dataset captures many challenging environment changes including season, weather, traffic, and so on. And the dataset provides both images, LiDAR point cloud, GPS and INS ground truth for autonomous vehicles. The LIDRA data were collected by two SICK LMS-151 2D LiDAR scanners and one SICK LD-MRS 3D LIDAR scanner.
It was developed by the University of Michigan. It contains 27 times trajectories through the University of Michigan’s North Campus between January 2012 to April 2013. This dataset also provides images, LiDAR, GPS and INS ground truth for long-term autonomous vehicles. The LiDRA point cloud was collected by a Velodyne-32 LiDAR scanner.
The high quality and density dataset was developed by ETH Zurich. It contains more than four billion of points where the point cloud are acquired by static terrestrial laser scanners. There are 8 semantic classes provided, which consist of man-made terrain, natural terrain, high vegetation, low vegetation, buildings, hard scape, scanning artefacts and cars. And the dataset is split into 15 training scenes and 15 testing scenes.
This real-world LiDAR-Video dataset was developed by Xiamen University et al. It aims at learning driving policy, since it is different from previous outdoor datasets. DBNet provides LiDAR point cloud, video record, GPS and driver behaviors for driving behavior study. It contains 1,000 km driving data captured by a Velodyne laser.
The Nuage de Points et Modélisation 3D (NPM3D) dataset was developed by PSL Research University. It is a benchmark for point cloud classification and segmentation, and all point cloud has been labeled to 50 different classes. It contains 1,431 M points data collected in Paris and Lille. The data was acquired by a Mobile Laser System including a Velodyne HDL-32E LiDAR and GPS/INS systems.
Apollo song2019apollocar3d lu2019l3
The Apollo was developed by Baidu Research et al and it is a large-scale autonomous driving dataset. It provides labeled data of 3D car instance understanding, LiDAR point cloud object detection and tracking, and LiDAR-based localization. For 3D car instance understanding task, there are 5,277 images with more than 60K car instances. Each car has an industry-grade CAD model. 3D object detection and tracking benchmark dataset contains 53 minutes sequences for training and 50 minutes sequences for testing. It is acquired at the frame rate of 10fps/sec and labeled at the frame rate of 2fps/sec. The Apollo-SouthBay dataset provides LiDAR frames data for localization. It was collected in southern San Francisco Bay Area. They equip a high-end autonomous driving sensor suite (Velodyne HDL-64E, NovAtel ProPak6, and IMU-ISA-100C) on a standard Lincoln MKZ sedan.
The nuTonomy scenes (nuScenes) dataset proposes a novel metric for 3D object detection which was developed by nuTonomy (an APTIV company). The metric consists of multi-aspects, which are classification, velocity, size, localization, orientation, and attribute estimation of the object. This dataset was acquired by an autonomous vehicle sensor suite (6 cameras, 5 radars and 1 lidar) with 360 degree field of view. It contains 1000 driving scenes collected from Boston and Singapore, where the two cities are both traffic-clogged. The objects in this dataset have 23 classes and 8 attributes, and they are all labeled with 3D bounding boxes.
This dataset was developed by Xian Jiaotong University and it was collected in Changshu (China). It introduces a new benchmark which focuses on dynamic 4D object tracking, 5D interactive event recognition and 5D intention prediction. BLVD dataset consists of 654 video clips, where the videos are 120k frames and the frame rate is 10fps/sec. All frames are annotated to obtain 249,129 3D annotations. There are totally 4,902 unique objects for tracking, 6,004 fragments for interactive event recognition, and 4,900 objects for intention prediction.
|CAD||ModelNet (2015, cls),
ShapeNet (2015, seg),
Shape2Motion (2019, seg, mot)
|RGB-D||ScanObjectNN (2019, cls)||NYUDv2 (2012, seg),
SUN3D (2013, seg),
S3DIS (2016, seg),
SceneNN (2016, seg),
ScanNet (2017, seg),
Matterport3D (2017, seg),
3DMatch (2017, reg)
|LiDAR||terrestrial LiDAR scanning||Semantic3D (2017, seg)|
|mobile LiDAR scanning||Multisensor Indoor Mapping and Positioning Dataset (2018, loc)||KITTI (2012, det, odo),
Semantic KITTI (2019, seg),
ASL Dataset (2012, reg),
iQmulus (2014, seg),
Oxford Robotcar (2017, aut),
NCLT (2016, aut),
DBNet (2018, dri),
NPM3D (2017, seg),
Apollo (2018, det, loc),
nuScenes (2019, det, aut),
BLVD (2019, det)
6 Application of deep learning in 3D vision tasks
In this section we discussed the application of the methods discussed in section 4 in 3 popular 3D vision tasks namely: classification, segmentation and object detection. See figure 7(a). We review the performance of the methods on popular benchmark datasets, Modelnet40 dataset modelnet for classification, ShapeNet shapenet and Stanford 3D Indoor Semantics Dataset(S3DIS) s3dis datasets for parts and semantic segmentation respectively.
Object classification has been one of the primary areas for which deep learning is used. In object classification the objective is: giving a point cloud, a network should classify it into a certain category. Classification is the pioneering task in deep learning because early breakthrough deep learning models such as AlexNet alexnet, VGGNet VGG, and ResNet resnet are classification models. In point cloud, most early techniques for classification using deep learning relied on a structured grid, section 3, however, we limit ourself to only approaches that process point cloud directly.
can easily be used for classification task by passing them through a fully connected network whose last layer represents classes. Other machine learning classifiers such as SVM can also be used as invoxnet; foldingnet. In figure 9 a timeline performance of point based deep learning approaches on modelnet40 is shown.
Segmentation of point cloud is the grouping of points into homegenous regions. Traditionally, segmentation is done using edges edgeseg2006 or surface properties such as normals, curvature and orientation edgeseg2006; regiongrowing. Recently, feature based deep learning approaches are used for point cloud segmentation with the goal of segmenting the points into different aspects. The aspects could be different parts of an object, referred to as part segmentation or different class categories, also referred to as semantic segmentation.
In parts segmentation, the input point cloud represent a certain object and the goal is to assign each point to a certain parts as shown in figure 3, hence the name ”part” segmentation. In pointnet; sonet; DGCNN the global descreptor learned is concateneated with the features of the points and then passed through MLP to classify each point into a part category. pointnetpp; pointcnn
propagates the global descreptor into high resolution predictions using interpolation and deconvolution methods respectively. InpointConv the per point features learned are used to achieve segmentation by passing them through dense convolutional layers. Encoder-decoder architecture is used in DeepKdNet for both parts and semantic segmenatation. In table 3 the result of various techniques on ShapeNet parts datasets are shown.
In semantic segmentation, the goal is to assign each point to a particular class. For example, in figure 7(d), the points belonging to chair are shown in red, while ceiling and floor in green and blue respectively, e.t.c. Popular public datasets for Semantic segmentation evaluation are S3DIS s3dis and ScanNet dai2017scannet. We show in table 4 performances of some of the state-of-the-arts methods on S3DIS and ScanNet datasets.
Instance segmentation on point cloud recieves less attention compared to part and semantic segmentation. Instance segmentation is when the grouping is based on instances where multiple objects of the same class are uniquely identified. Some state-of-the-art works on instance segmentation on point cloud are spgn; seginstansemantics; bonet; JSIS3D; rpointnet which are built on PointNet/PointNet++ feature learning backbone.
|R-S CNN relationshape||86.1%|
|Pointwise Conv pointConv||56.1%|
6.3 Object detection
Object detection is an extension of classification where multiple objects can be recognized and each object is localized using a bounding box as shown in figure 7(c). RCNN rcnn were the first that proposed 2D object detection by selective search, where different regions are selected and passed to the network one at a time. Several variants were later proposed fastrcnn; fasterrcnn; maskrcnn. Other state-of-the-art 2D object detection is YOLO yolo and its variants such as yolov2; yolov3. In summary, 2D object detection is based on 2 major stages, region proposals and classification.
Like in 2D images, detection in 3D point cloud is also emperical on the two stages of proposal and classification. Proposal stage in 3D point cloud, however, is more challenging than in 2D due to the search space being 3 dimensional and the sliding window or region to be proposed is also in 3 dimension. vote3D vote3d and vote3Deep vote3deep convert input point cloud into a structured grid and perform extensive sliding window operation for detection which is computationally expensive. To perform object detection directly in point cloud, several techniques used feature learning techniques discussed in section 4.
In VoxelNet voxelnet, the sparse 4D feature vectore is passed through a region proposal network to generate 3D detection. FrustumNet frustumnet proposed regions in 2D and obtain the 3D frustrum of the region from the point cloud and pass it through PointNet to predict 3D bouding box. spgn first uses PointNet/PointNet++ to obtain feature vector of each point, and based on the hypothesis that points belonging to the same object are closer in feature space proposed a similarity matrix which predicts if a given pair of points belong to the same object. In gspn, PointNet and PointNet++ are used to designed a generative shape proposal network to generate proposals which are further processed using PointNet for classification and segmentation. PointNet++ is used in pointrcnn to learn point-wise features which are used to segment foreground points from backgroud points and employs buttom-up 3D proposal to generate 3D box proposals from the foreground points. The 3D box proposals are further refined using another PointNet++-like structure. votenet used PointNet++ to learn point wise features which are considered to be seeds. The seeds then independently cast a vote using a hough voting module based on MLP. The votes of the same object are close in space hence allow for easy clustering. The clusters are further processed using a shared PointNet-like module for vote aggregation and propsal. PointNet is also utilized in pointpillers with Single Short Detector (SSD) ssd for object detection.
One of the popular object detection dataset is the Kitti dataset Geiger2012CVPR; Geiger2013IJRR. The evaluation on kitti is divided into easy, moderate and hard depending on occlusion level, minimum height of the bounding box and maximum truncation. We report the performance of various object detection methods on Kitti dataset in tables 6 and 6.
|MV3D ChenMWLX17||LiDAR & Image||2.8||N/A||86.02||76.9||68.49||N/A||N/A||N/A||N/A||N/A||N/A|
|Cont-Fuse LiangYWU18||LiDAR & Image||16.7||N/A||88.81||85.83||77.33||N/A||N/A||N/A||N/A||N/A||N/A|
|Roarnet ShinKT19||LiDAR & Image||10||N/A||88.2||79.41||70.02||N/A||N/A||N/A||N/A||N/A||N/A|
|AVOD-FPN KuMLHW18||LiDAR & Image||10||64.11||88.53||83.79||77.9||58.75||51.05||47.54||68.09||57.48||50.77|
|F-PointNet frustumnet||LiDAR & Image||5.9||65.39||88.7||84||75.33||58.09||50.22||47.2||75.38||61.96||54.68|
|HDNET YangLU18||LiDAR & Map||20||N/A||89.14||86.57||78.32||N/A||N/A||N/A||N/A||N/A||N/A|
|MV3D ChenMWLX17||LiDAR & Image||2.8||N/A||71.09||62.35||55.12||N/A||N/A||N/A||N/A||N/A||N/A|
|Cont-Fuse LiangYWU18||LiDAR & Image||16.7||N/A||82.54||66.22||64.04||N/A||N/A||N/A||N/A||N/A||N/A|
|Roarnet ShinKT19||LiDAR & Image||10||N/A||83.71||73.04||59.16||N/A||N/A||N/A||N/A||N/A||N/A|
|AVOD-FPN KuMLHW18||LiDAR & Image||10||55.62||81.94||71.88||66.38||50.8||42.81||40.88||64||52.18||46.64|
|F-PointNet frustumnet||LiDAR & Image||5.9||57.35||81.2||70.39||62.19||51.21||44.89||40.23||71.96||56.77||50.39|
7 Summary and Conclusion
The increasing availability of point cloud as a result of evolving scanning devices coupled with increasing application in autonomous vehicles, robotics, AR and VR demands for fast and efficient algorithms for the point cloud processing inorder to achieve improved visual perception such as recognition, segmentation and detection. Due to scarse data availability, unpopularity of deep learning, early methods for point cloud processing relied on handcrafted features. However, with the revolution brought about by deep learning in 2D vision tasks, and evolution of acquisition devices of point cloud which leads to availability of point cloud data, computer vision community are focusing on how to utilize the power of deep learning on point cloud data. Point cloud provides more accurate 3D information which is vital in applications that require 3D information. Due to the nature of point cloud, its very challenging to use deep learning for its processing. Most approaches resolve to convert the point cloud into a structured grid for easy processing by deep neural networks. These approaches, however, leads to either loss of depth information or introduces conversion artifacts and requires higher computational cost. Recently, deep learning directly on point cloud is recieving alot of attention. Learning on point cloud directly do away with convertion artifacts and mitigates the need for higher computational cost. PointNet is the basic deep learning method that process point cloud directly. PointNet however, does not capture local structures. Many approaches were developed to improve on pointNet by capturing local structures. Inorder to capture local structures, most methods follows three basic steps; sampling to reduce the resolution of points and to get centroids for representing local neighborhood; grouping, based on K-NN to select neighboring points to each centroids; mapping function, usually approximated by an MLP, which learn the representation of neigbhoring points. Several methods resolves to approximating the MLP with PointNet-like network. However because PointNet does not explore inter points relationship, several approaches explore inter-points relationships within a local patch before applying pointNet like MLPs. Taking into account the point-to-point relationship between points has proven to increase the discriminative capability of the networks.
While deep learning on 3D point cloud has shown good performance on several tasks including classification, parts and semantic segmentation, other areas, however, are recieving less attention. Instance segmentation on 3D point cloud, where individual objects are segmented in a scene, remain largely uncharted direction. Most current object detection relies on 2D detection for region proposal, few works are available on detecting objects directly on point cloud. Scaling to larger scene also remain largely unexploited as most of the current works relies on cutting large scenes into smaller pieces. As at the time of this review, only few works pointnetvlad; lpdnet explored deep learning on large scale 3D scene.