1 Introduction
Recent years have witnessed the emergence and increasing popularity of 3D computer vision to understand the 3D world, with the development of 3D sensors and technology. Point cloud data obtained from 3D sensors such as LiDAR and stereo cameras is an efficient representation of 3D data. Analysis of 3D point cloud is fundamental to a variety of applications in the field of 3D computer vision, including many significant tasks, such as virtual/ augmented reality
[17, 22], 3D scenes understanding [29, 30, 5], and autonomous driving [4, 25, 33].Standard CNNs cannot be directly applied to 3D point cloud data because it is not as easy to define irregular point neighbors as in image. One straightforward way is to convert the irregular 3D points to regular format, such as voxel representation [34, 24, 31] and view projection [27, 20, 12, 35], then use CNNs to analyze. Recently, network architectures [19, 21, 23, 8] that directly work on 3D point cloud have been developed. Similar as CNNs, given a set of points, the point “convolution layer” will need first to find “local patch” of each input point from point affinity matrix (affinity matrix is defined as the adjacency matrix of the dense graph constructed from the point cloud). Then it will learn local features from the patch, and finally to form the feature map. By stacking the basic point convolution layers, the network can extract information from point cloud at different levels.
Nonetheless, unlike 2D image where local patches are naturally determined, defining local patches for 3D point cloud is not easy. The local patches should cope with complicated geometric transformation in 3D shapes. Using nearest neighbor graph in euclidean space to define local patches are adopted by most methods [19, 21, 32, 16]. This may not be optimal, as (1) euclidean distance may not capture the geometric transformation of different 3D point shapes; (2) different layers usually target at different level of information, and fixed nearest neighbor graph constrains the changes at different information level. We propose to dynamically learn the point affinity matrix for finding local patches. The affinity matrix should be learned from both the point location and current feature map.
In order to transform point cloud for defining local neighborhood, a naive approach is to learn a function, , to generate new point set coordiantes from the original point set location and current feature map. However, if is unconstrained the learning can be nontrivial. An alternative way is to put a smoothness constraint on the function
. Observe that any isometric transformation (e.g. rigid) can not change the topology. Hence, in this work, we will use nonrigid transformations, including linear and nonlinear transformation to model
. In other words, we will learn spatial transformers using the point cloud coordinates and feature map to generate new point cloud for affinity matrix. Then, we find “local patches” based on affinity matrix (usually NN graph measured by distance). Our proposed scheme learns an affine spatial transformation , where is an affine matrix. We also propose projective spatial transformer and deformable spatial transformer as nonlinear transformation, where are respective transformation matrices.In summary, our main contributions are: (1) Proposing linear (affine) and nonlinear (projective, deformable) spatial transformer on point cloud for learning affinity matrix and local patch; (2) Showing the proposed module can be easily added to existing point cloud networks for different tasks (segmentation, detection); (3) Applying the spatial transformer to both pointbased and samplingbased point cloud processing networks, and observing improved performance compared to its fix graph counterpart.
2 Related Works
Viewbased and voxelbased networks. Viewbased methods project 3D shapes to 2D plane and use a group of images from different views as the representation. Taking advantages of the power of CNNs in 2D image processing [7, 10, 20, 27], viewbased methods has achieved reasonable 3D processing performance. Yet, the geometric shape information get lost when projecting from 3D to 2D.
Representing 3D shapes as volumetric data based on regular 3D grid, and processing with 3D convolution has also been adopted by many works [34, 18]. However, the quantization artifacts, inefficient use of the 3D voxels and low resolution due to computation capacity highly limits the volumetric methods. Furthermore, 3D convolution usually performs away from the surface, and cannot capture many 3D shape information. Recent works that applied different partition strategies [13, 24, 28, 31] somehow relieve such issues but still depends on bounding volume subdivision, instead of finegrained local geometric shape. Our work directly take 3D point cloud as input to minimize geometric information loss and to maximize the processing efficiency.
Deep learning on point cloud. Some deep neural networks directly takes point cloud as input and learn semantic/ abstract information by point processing operations. As a pioneer work, PointNet [19] directly learned embedding of every isolated 3D points and gather that information by pooling later on. Although successful, PointNet did not learn any local geometric information of the 3D shape. PointNet++ relieved this by proposing a hierarchical application of isolated 3D point feature learning to multiple subsets of point cloud data. Many other works also explored different strategies in leveraging local structure learning of point cloud data [32, 16]. Instead of finding neighbors of each point, SplatNet [26] encodes local structure from sampling perspective: it grouped points based on permutohedral lattices [1], and then applied bilateral convolution [11] for feature learning. Superpoint graphs [14] proposed to partition point cloud into superpoints and learned the 3D point geometric organization.
Point cloud is defined in irregular grid, and regular convolution operation cannot directly be applied. Many works [32, 16, 15] aim at designing point convolution operation for point cloud data. However, most work directly uses original input point cloud to find the local patches for the convolution operation. 3D shapes has diverse geometric transformations, and efficient learning requires the point convolution operation to be invariant to such transformations. Fixed 3D shape at all layer as input to find local patch greatly limit the network’s flexibility in handling this issue. In contrast, our work propose spatial transformers on the point cloud to capture geometric transformations more efficiently.
3 Methods
In this section, we briefly review different transformation methods and their influence on the affinity matrix, followed by the design of our three spatial transformers (affine, projective and deformable). Finally we will introduce our how the transformers can be added to existing point cloud networks and the relevance to other works.
3.1 Transformation and Affinity Matrix
We propose to learn transformations on the original point cloud to “deform” the original geometric shape, and build new affinity matrices based on nearest neighbor (NN) graphs. The new affinity matrix will dynamically alter the “local patch” for “convolution” for point feature learning.
As in Figure 2, transformations can be categorized into rigid and nonrigid transformations, and nonrigid transformations can be further partitioned into linear and nonlinear transformations. We briefly review different transformation methods below.
Rigid transformations: The group of rigid transformations consist of translations and rotations. However, rigid transformations are isometric and hence preserves the affinity matrix. Thus, local patches are “invariant” to rigid transformations in terms of NN graph. We do not consider this transformations in this paper.
Affine transformations: Affine transformation is the nonrigid linear transformation. Consider a 3D point cloud which consists of threedimensional points
. Then, an affine transformation is parametrized by an invertible matrix
and a translation vector
to it. Given , we will get the affine transformed point as . Note that translation will not change the NN graph. Affine transformation preserves collinearity, parallelism and convexity.Projective transformations: Projective transformation (or homography) is a nonrigid nonlinear transformation. We first map the 3D point sets to the homogeneous space and get by concatenating ones as the last dimension. The projective transformation is parametrized by and we get the transformed point as . Compared to the affine transformations, projective transformations have more freedom but cannot preserve parallelism.
Deformable transformations:
When all the points have the freedom to move without much constraint, the 3D shape can deform freely. We refer to it as deformable transformation. This transformation has more degree of freedom and do not preserve the topology. We learn deformable transformation from both point location and feature, as described in the following subsection.
3.2 Spatial Transformer
We propose to transform the original point cloud to obtain dynamic local patches. Our spatial transformers can thus be applied to other existing point cloud processing networks. We briefly introduce our affine, projective and deformable spatial transformers as follows.
Affine: The spatial transformer can be applied to different layers of a point cloud processing network. Suppose at layer , the block contains spatial transformers and corresponding feature learning modules. Each spatial transformer will learn a transformation subgraph first and calculate the corresponding subfeature. Finally we concatenate all subfeatures of each transformer to form the final feature output of the learning block.
Suppose the spatial transformer at layer takes as input the original point cloud and previous feature map . We first form new transformed point from as:
(1) 
As the affinity matrix is invariant of and , we set for simplicity. Equation 1 can hence be simplified to:
(2) 
where, . We then will apply NN on each transformed points to obtain the affinity matrix . For every affinity matrix , we can define local patches for point clouds and thus do point convolution operation on previous point cloud feature map and get the point cloud feature of the subgraph:
(3) 
where, CONV is the point convolution operation: it takes (a) previous point cloud feature, (b) affinity matrix (for defining local patch of each point) and (c) number of neighbors (for defining the size of local patches) as input. In some point convolution operations (such as [32]), the affinity matrix will alter the input feature in a nondifferentiable way. We thus will concatenate the transformed point cloud to the input feature for the sake of backpropagation of transformation matrix . In other samplingbased convolution operations (such as bilateral convolution [26]), affinity matrix will change the input feature in a differentiable way, therefore no extra operation is needed.
For all the subgraph in layer/ block , we can learn point cloud features . The output of this learning module will be the concatenation of all the subgraph point cloud features:
(4) 
where, and , . When implementing, we randomly initialize
from standard normal distribution,
. When computing the transformed point cloud location in Equation 2, we normalize transformation matrix by its norm as the norm is invariant to the affinity matrix.Projective: Analogous to the affine transformation learning module, for graph at layer, we first apply projective transformation to the point cloud in homogeneous coordinates and get transformed point cloud as:
(5) 
where, is the transformation matrix in the homogeneous coordinates.
Then we follow the same point cloud feature learning as defined in Equation 3, and concatenate them as in Equation 4 to get output feature of the layer.
Deformable: Affine and projective transformations are useful in transforming the original point cloud data, altering the affinity matrix, and providing learnable “local patch” for point convolution operation at different layers. Nonetheless, affine transformations are linear transformations so the ability to alter affinity matrix and local patches is limited. Although “projective transformation” has more flexibility than affine in the sense parallelism is not preserved, the restriction that “projective transformation” maps a straight line to a straight line makes it not general enough to capture all possible deformations. To alleviate this problem and capture more geometric transformation of the point cloud, we propose a nonlinear transformation module  deformable transformation module.
The deformable transformation at layer and subgraph can be written as:
(6) 
where, is the affine transformation, and deformation matrix gives every point the freedom to move, so the geometric shape of the whole point cloud has the flexibility to deform. Note that the deformation matrix is different from translation vector in Equation 1 and significantly changes the local patch.
As a selfsupervised learning procedure, the spatial transformer parameters are learned from both point cloud location and feature. Since affine transformation
already captures the location information, we use deformation matrix to capture feature map changes given by , where, is the feature map of previous layer, transforms the feature from to . Hence, the deformable transformation in Equation 6 can be simplified as:(7) 
where, is the concatenation of affine and deformable transformation matrix that captures both point cloud location and feature map projection.
After we compute the transformed point location , we will follow the Equations 3 and 4 to learn the feature of each subtransformation graph, and concatenate them as the final output feature of layer .
For the deformable spatial transformer, we decompose the entire transformation in two parts: and . The former is the affine transformation of point 3D coordinates, while the latter is a transformation of the point feature. The transformation of point spatial location captures the linear transformation information of the point cloud, and the feature transformation captures the relatively highlevel semantic information. The deformable transformation sums the two sources of information together. Section 4.5 provides more empirical analysis of these two components.
3.3 Spatial Transformer Networks
The spatial transformer blocks discussed above aim to dynamically transform point cloud and change the local patches for point convolution operation. The transformer can be easily added to existing point cloud networks. We introduce how a general point cloud processing network with transformer module works, and then provide three applications in different networks and tasks as examples.
Point cloud classification/ segmentation/ detection: Figure 3 depicts a general network architecture for point cloud segmentation. Suppose it is a class segmentation task with input point cloud in consisting points. Our network consists of several spatial transformers at different layers. For layer , we learn transformation matrices to apply on the original point cloud location , and compute the corresponding affinity matrices (e.g. based on NN graphs in the edge convolution [32] for point cloud). For each subtransformation, we can learn a feature of dimension ; then we concatenate all features in this layer to form an output feature of dimension , where . The output feature serves as the input of the next layer for further feature learning. Note that since different layer can have multiple graphs, the affine/ projective transformation matrix will only be applied on the original point cloud location . Specifically for deformable transformation, deformable matrix applies on previous feature map, thus the feature transformation component is progressively learned. By stacking several such transformation learning blocks and finally a fully connected layer of dimension , we can map the input point cloud to the segmentation map of dimension , or downsample to vector of dimension for classification task. We can train the network endtoend with some modern optimization methods. For spatial transformer block in point cloud detection network (Figure 4), is dimension of the output feature.
Application in Pointbased Segmentation Networks. Pointbased segmentation networks [21, 19, 16, 32] take point cloud as input and derive affinity matrix and local patches from the point locations. For selected points, certain “convolution” operators on the points and its local patches will be applied to learn the feature map. We choose edge convolution from [32] as our baseline since it takes relative point location as input and achieves stateoftheart performance. Specifically, we retain all the setting of the method, just insert the spatial transformers to provide new local patches for the edge convolutional operation.
Application in Samplingbased Segmentation Networks. SplatNet [26] is a representative of samplingbased segmentation networks. It also takes point cloud as input, but use permutohedral lattice [1] to group points into lattices and performs learned bilateral filters [11] on the grouped points to get feature. The permutohedral lattice defines the local patches of each points to make the bilateral convolution possible. We use spatial transformers to deform the point cloud and form new permutohedral lattices on the transformed point sets. The local patches can therefore dynamically cope with the geometric shape of the point cloud. All the other components in SplatNet remains the same.
Application in Detection Networks. Object detection in 3D point clouds is an important problem in many applications, such as autonomous navigation, housekeeping robots, and AR/ VR. LiDAR point cloud, as the one of the most popular sensory data for 3D detection, is usually highly sparse and imbalanced. The proposed spatial transformers specializes in transforming the point clouds for dynamic local patches, and has the potential of processing LiDAR data efficiently. We use VoxelNet [36], which achieves the stateoftheart performance in 3D object detection in autonomous driving data, as our baseline model. As in Figure 4, we follow all the settings in VoxelNet, but add spatial transformers on the raw point cloud data, before point grouping. The spatial transformer feature learning blocks only change point feature but not the point location for grouping. It can be considered as improving the point cloud learning process.
3.4 Relevance to Other Works
The idea of transformation learning module has much relevance with some previous work. We briefly review its relevance to deformable CNN [6] and DGCNN [32].
Relevance to Deformable CNN. Deformable convolutional networks [6] propose to learn dynamic local patches for 2D image. Specifically for each location on the output feature map , deformable convolution augments the regular grid with offsets , where Then the convolutional output on input parameterized by weight becomes:
(8) 
The offset augmentation to the regular grid is very similar to the deformable transformation (Equation 6 ): we also want to make each point has the freedom to move. For 2D images (matrices) defined in regular grid, the dynamic grid is necessary to model geometric transformation [6]; for 3D point clouds defined in irregular grid, the dynamic grid is also necessary to model even more complicated 3D geometric transformation.
Relevance to Dynamic Graph CNN. The idea of having dynamic local patches on point cloud processing has also been explored in DGCNN [32]. For point convolution operation, they directly use the highdimensional feature map from the previous layer, to construct dynamic graph for affinity matrix and local patches. Additionally, at each layer, they only have one graph. For local patch learning, we transform both point cloud location and feature in Equation 7 to to compute the affinity matrix and local patches. We also have multiple graphs at each layer to deform the point cloud differently, in order to capture different geometric transformations. With less computation burden and more flexibility in geometric transformations, we demonstrate better performances as shown in two semantic segmentation experiments (Section 4.2 and 4.3.)
4 Experiments
In this section, we arrange comprehensive experiments to verify the effectiveness of the proposed spatial transformer. First, we evaluate the transformer on two networks (pointbased and samplingbased) for four point cloud processing tasks (classification results shown in Appendices section 4.1). We then conduct ablation studies on the deformable spatial transformer. We visualize and analyze the learned shapes. Finally, we conclude this section with classification result on Modelnet40 dataset.
4.1 Classification
We conduct experiments on ModelNet40 3D shape classification dataset [34] to show the effectiveness of the proposed spatial transformer. We evaluate on two baseline methods [32, 26] and adopt the same network architecture, setting and evaluation protocols. We show that adding the spatial transformer to pointbased and samplingbased method gives 1% and 2% gain on ModelNet40 (Table 1). If both the training and test data are augmented with random rotation, the spatial transformer gives 3% gain, since the learned transformations align the global 3D shape better for shape recognition (Fig. 8).
4.2 Part Segmentation
3D point cloud part segmentation is an important yet challenging finegrained 3D analysis task  given a 3D point cloud, the task aims at accurately assigning part category label (e.g. chair leg, cup handle) to each point. We evaluate the proposed modules on ShapeNet part segmentation dataset [3]. The dataset contains shapes from categories, annotated with parts in total, with # of parts/ category from to . Ground truth are annotated on every sample.
4.2.1 Pointbased method
Network architecture. Pointbased segmentation networks take point cloud as input and derive affinity matrix and local patches from the point location, for “convolution” operation on points. We use the stateoftheart pointbased segmentation networks, Dynamic graph CNN (DGCNN) [32]. We follow the same network architecture and evaluation protocol. Specifically this work use “edge convolution” as the point convolution operation. The network has 3 convolutional layers, with output feature dimension of 64. Additionally, in order to capture different level information of the input point clouds, they concatenate all the convolutional feature and use several fully connected layers to map the feature to the final segmentation output.
We insert our spatial transformer to alter the local patch definition for edge convolution operation. We first use the original point cloud location and name it fix graph baseline. With the affine, projective and deformable spatial transformer defined in Section 3.2, we also have pointbased affine, projective and deformable networks. Specifically DGCNN directly used learned feature to build affinity matrix (based on NN graph) to obtain local patches, and we consider this as pointbased dynamic graph network.
Under the framework of 3 edge convolution layers, we kept number of graphs in each layer and subgraph feature dimension the same, and search for the best architecture. Due to memory limitation, we report the affine, projective and deformable network with at the best performance. To make fair comparison, we also increase the # of channels of fixgraph baseline and dynamic networks.
Result and analysis. In Table 2, we report instance average mIOU (mean intersection over union), as well as the mIOU of some representative categories in ShapeNet. Compared with fix graph baseline, the affine, projective and deformable networks respectively achieve 0.5%, 0.2% and 1.1% improvement and beats the fixgraph baseline methods in most categories. Specifically, we observe 8.0%, 8.3% and 4.7% performance boost in deformable networks compared with the fix graph baseline. Compared with the dynamic graph network, the deformable network improve 4.0%. We also beat other stateoftheart methods [19, 21, 23]. Figure 5 qualitatively visualize some part segmentation results of the fix graph baseline our deformable spatial transformer. Deformable spatial transformer makes the prediction more smooth and achieves better performance, compared with fix graph baseline.
From affine to projective, and to deformable networks, the performance increases as the level of freedom goes up. Projective spatial transformer, however, seems to have similar or worse performance than affine, and we believe the mapping to homogeneous may inhibit the ability to capture geometric transformation. When the freedom further improves and we directly use learned feature as input to define affinity matrix and find local patches (dynamicgraph), yet the performance drops. We believe the need for both point location and feature to learn the affinity matrix, rather than reusing the highdimensional point cloud feature.
Avg.  aero  bag  cap  earphone  lamp  rocket  

# shapes  2690  76  55  69  1547  66  
3DCNN [19]  79.4  75.1  72.8  73.3  63.5  74.4  51.2 
PointNet[19]  83.7  84.3  78.7  82.5  73.0  80.8  57.9 
PointNet++ [21]  85.0  82.4  79.0  87.7  71.8  83.7  58.7 
FCPN [23]  81.3  84.0  82.8  86.4  73.6  77.4  68.4 
DGCNN [32]  81.3  84.0  82.8  86.4  73.6  77.4  68.4 
Pointbased [32] fixgraph  84.2  83.7  82.3  84.0  69.9  82.5  56.0 
Pointbased affine  84.7  84.1  83.5  86.9  72.5  83.3  60.9 
Pointbased projective  84.4  84.3  84.2  88.5  72.8  81.7  61.6 
Pointbased deformable  85.3  84.6  83.3  88.7  77.9  83.5  64.3 
PointCNN [16]  84.9  82.7  82.8  82.5  75.8  82.6  61.5 
PointCNN deformable  85.8  83.4  86.6  85.5  78.5  84.2  65.0 
Samplingbased baseline [26]  84.6  81.9  83.9  88.6  73.5  84.5  59.2 
Samplingbased deformable  85.2  82.9  83.8  87.6  73.0  85.7  65.1 
4.2.2 Samplingbased method
Samplingbased point cloud processing methods group 3D points first, and then conduct convolution on the grouped points. SplatNet [26], as a representative method, applies permutohedral lattice [1] to group points into lattices and performs learned bilateral filters [11] on the grouped points to extract feature. In comparison, the bilateral convolution operates on the grouped points, and enjoys the advantages of naturally defined local neighbors at different direction.
Network architecture. We follow the same architecture as SplatNet [26]: the network starts with a single regular convolutional layer, followed by 5 bilateral convolution layers (BCL). The output of all BCL are concatenated and feed to a final regular convolutional layer to get the segmentation output. Since each BCL directly takes raw point cloud location as input, we consider it as fix graph baseline. We add deformable spatial transformer to the networks and feed transformed point graphs to BCL to construct the permutohedral lattice. Because of the gradient to the permutohedral lattice grid, we can make the transformation matrix learned endtoend. Note that we increase the channel of convolution layers for fair comparison.
Result and analysis. We report the performance of deformable spatial transformer (with at all BCLs) in Table 2. Compared with samplingbased fix graph baseline [26], the deformable module achieves 0.6% improvement and performance boost in most categories (improves 5.9% for rocket). Deformable spatial transformer also beats other stateofart baselines.
4.3 Semantic Segmentation
Semantic segmentation for point cloud data is a challenging but has high practical significance, such as for robotic vision. The task is similar to part segmentation, only point labels become semantic object classes instead of part labels.
We conduct experiments on the Standford 3D semantic parsing dataset [2]. The dataset contains 3D scans from Matterport scanners in 6 areas including 271 rooms. Each point in the scan is annotated with one of the semantic labels from 13 categories (chair, table, floor etc. plus clutter).
We follow the data processing procedure of [19] for Stanford 3D Indoor Scenes Dataset [2]. Specifically, we first splits points by room, and then sample rooms into several 1m 1m blocks. When training, 4096 points are sampled from the block on the fly. We train our network to predict per point class in each block, where each point is represented by a 9 dimensional vector of XYZ, RGB and normalized location (in the range of (0, 1) ) as to the room.
PointNet[19]  DGCNN[32]  [32](FIX)  [32]+AFF  [32]+DEF  [26] (FIX)  [26]+DEF  
47.7  56.1  56.0  56.9  57.2  54.1  55.5  
ceiling  floor  wall  beam  column  window  clutter  
[32](FIX)  92.5  93.1  76.1  51.0  41.7  49.6  46.8 
[32]+AFF  92.7  93.6  76.7  52.6  41.2  48.7  47.8 
[32]+DEF  92.8  93.6  76.8  52.9  41.1  49.0  48.0 
door  table  chair  sofa  bookcase  board  
[32](FIX)  63.4  61.8  43.1  23.3  42.0  43.5  
[32]+AFF  63.7  63.4  45.1  27.0  41.3  44.8  
[32]+DEF  63.5  64.2  45.2  28.1  41.7  46.1 
Network architecture. The network architecture is based on DGCNN [32]. The network architecture is the same as Section 4.2, with the dimension of final segmentation label changes to 13.
Result Analysis. In Table 3
, we report the performance of the affine and deformable spatial transformer networks, and compare with our fix graph baseline and several other stateoftheart methods. Compared with our fix graph baseline, affine spatial transformer achieves 0.9% average mIOU improvement, while deformable achieves 1.2% average mIOU improvement. Specifically compared with the dynamic graph
[32], the deformable spatial transformer is also 1.1%. Our deformable spatial transformer beats all other stateoftheart methods.From the result, we have similar conclusion as in the part segmentation experiments: when given point cloud more freedom to the to deform (from affine to deformable spatial transformer) based on transformation of original location and feature projection, the segmentation performance improves. However, when directly using highdimensional point feature to find affinity matrix, the performance drops due to lack of regularization.
Figure 6 depicts qualitative results for semantic segmentation of our deformable transformation learning module. Our network is able to output smooth predictions and is robust to missing points and occlusions.
4.4 Detection
birds’ eye  3D  

Easy  Medium  Hard  Easy  Medium  Hard  
VoxelNet[36]  77.3  59.6  51.6  43.8  32.6  27.9 
VoxelNet + fix graph  84.3  67.2  59.0  45.7  34.5  32.4 
VoxelNet + deformable  85.3  69.1  60.9  46.1  35.9  34.0 
We also explore how the proposed methods performs in point cloud detection. We evaluate on the KITTI 3D object detection benchmark [9] which contains 7,481 training images/point clouds and 7,518 test images/point clouds, covering three categories: Car, Pedestrian, and Cyclist. For each class, detection outcomes are evaluated based on three difficulty levels: easy, moderate, and hard, which are determined according to the object size, occlusion state and truncation level. We follow the evaluation protocol in VoxelNet [36] and report the car detection result on the validation set.
Network architecture. As shown in Figure 4, the network takes raw point cloud as input and partition the points based into voxels. We add deformable spatial transformer to the point cloud location, so the grouped points in each voxel are represented as point features. There are two deformable feature learning layers with each layer having subgraphs with dimensional outputs. Note that the voxel partition is based on the original point cloud location. Then as VoxelNet, the point features in each voxel are fed to voxel feature encoding layers with channel and
to get sparse 4D tensors representing the space. The convolutional middle layers process 4D tensors to further aggregate spatial context. Finally a RPN generates the 3D detection.
We report the performance of 3 networks: (1) VoxelNet baseline [36]; (2) our fix graph baseline, where we used the original point cloud location to learn the point feature at the place of spatial transformer blocks; (3) deformable spatial transformer networks as discussed above.
Result and analysis. Table 3 reports car detection results on KITTI validation set.^{1}^{1}1As original authors did not provide code, we use the third party implemented code https://github.com/qianguih/voxelnet and obtain lower result than that reported in the original paper. Compared with baseline, having a point feature learning module improves the performance by 7.3% and 2.8% for birds’ eye view and 3D detection performance on average, respectively. The deformable module further improves 8.9% and 3.9%, respectively, on birds’ eye view and 3D detection performance on average, compared with the VoxelNet baseline. We observe performance boost with our deformable spatial transformer.
4.5 Ablation Studies
Influence of different components in deformable transformation. As in Equation 7, the deformable spatial transformer consists of two components: affine transformation on point location , , and threedimensional projection of highdimensional feature, . Figure 7 depicts performance of different component of deformable transformation learning module. We observe average mIOU improvement of both affine and feature only spatial transformer, while deformable spatial transformer (the combination of both) gives the highest performance boost.
fix graph  1 graph  2 graphs  4 graphs  

= 32  84.2  84.9  85.2  85.3 
= 64  84.2  85.3  85.2  83.5 
Influence of number of transformation module. Table 5 shows the performance of different number of deformable transformation modules. When subfeature dimension is fixed, the more graphs in each layer, the higher the performance. With the limitied resources (the multiplication of number of subgraphs and subfeature dimension to be ), the best performance is achieved at .
Model size and timing. Table 6 shows that with the same model size and almost the same test time, the significant performance gain can be achieved. We increase the number of channel in the fix graph baseline model for all experiments for fair comparison. Note that even without increasing number of parameters of baseline, adding spatial transformer only increases number of parameters by 0.1%.
4.6 Visualization and Analysis
Global view of the deformable transformation. Figure 8 depicts some examples of learned deformable transformation in ShapeNet part segmentation. We observe that each graph at certain layer aligns input 3D shape with similar semantic geometric transformation.
Local view of the deformable transformation. Point cloud data is not usually balanced sampled, which makes point cloud convolution challenging, as the NN graph does not accurately represents the exact neighborhood and 3D structure information. Our deformable spatial transformer can gives every point flexibility and in turn can capture better affinity matrix and find better local patches, but can it implicitly make the point cloud closer to balanced sampling?
Figure 9
shows the local view of a sample of skateboard  after deformable transformation, the points are deformed to be more uniformly distributed. We also analyze the standard deviation of data and transformed point cloud in the ShapeNet dataset. After transformation, the variance of the data decreases, and accounts for the balanced sampling distribution of the transformed points.
Dynamic neighborhood visualization. To illustrate how our spatial transformers learn diverse neighborhoods for 3D shapes, we show the nearest neighbors of two query points and use corresponding colors to indicate corresponding neighborhoods. (1) As in in Fig. 10, neighborhoods retrieved from deformed shape encode additional semantic information, compared to neighborhoods from 3D coordinates. (2) As shown in additional graph visualizations (Fig. 11) of table and earphone, different graphs enable the ability of the network to learn from diverse neighborhoods without incurring additional computational burden.
5 Conclusion
In this work, we propose novel spatial transformers on 3D point cloud for altering “local patch” in different point cloud process tasks. We analyze different transformation and their influence in affinity matrix and point “local patches”. We further propose one linear spatial transformer (affine) and two nonlinear spatial transformer (projective and deformable). We also show that the spatial transformers can be easily added to existing point cloud processing networks. We evaluate the performance of the proposed spatial transformer on two point cloud networks (pointbased [32] and samplingbased [26]) on three largescale 3D point cloud processing tasks (part segmentation, semantic segmentation and detection). In additional to beating all other stateoftheart methods, our spatial transformers also achieves higher performance than its fix graph counterpart. Future work could design better nonlinear spatial transformer for point cloud, and explore other methods in dynamic local patch for point cloud processing.
References
 [1] (2010) Fast highdimensional filtering using the permutohedral lattice. In Computer Graphics Forum, Vol. 29, pp. 753–762. Cited by: §2, §3.3, §4.2.2.

[2]
(2016)
3d semantic parsing of largescale indoor spaces.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1534–1543. Cited by: §4.3, §4.3.  [3] (2015) Shapenet: an informationrich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.2.
 [4] (2018) LiDARvideo driving dataset: learning driving policies effectively. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5870–5878. Cited by: §1.
 [5] (2018) ScanComplete: largescale scene completion and semantic segmentation for 3d scans. In CVPR, Vol. 1, pp. 2. Cited by: §1.
 [6] (2017) Deformable convolutional networks. CoRR, abs/1703.06211 1 (2), pp. 3. Cited by: §3.4, §3.4, §3.4.

[7]
(2018)
GVCNN: groupview convolutional neural networks for 3d shape recognition
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 264–272. Cited by: §2.  [8] (2018) Multiresolution tree networks for 3d point cloud processing. arXiv preprint arXiv:1807.03520. Cited by: §1.
 [9] (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §4.4.
 [10] (2019) SeqViews2SeqLabels: learning 3d global features via aggregating sequential views by rnn with attention. IEEE Transactions on Image Processing 28 (2), pp. 658–672. Cited by: §2.
 [11] (2016) Learning sparse high dimensional filters: image filtering, dense crfs and bilateral neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4452–4461. Cited by: §2, §3.3, §4.2.2.
 [12] (2017) 3D shape segmentation with projective convolutional networks. In Proc. CVPR, Vol. 1, pp. 8. Cited by: §1.
 [13] (2017) Escape from cells: deep kdnetworks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: §2.
 [14] (2018) Largescale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4558–4567. Cited by: §2.
 [15] (2018) Sonet: selforganizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §2.
 [16] (2018) PointCNN: convolution on xtransformed points. In Advances in Neural Information Processing Systems, pp. 820–830. Cited by: §1, §2, §2, §3.3, Table 2.
 [17] (2018) A novel campus navigation app with augmented reality and deep learning. In 2018 IEEE International Conference on Applied System Invention (ICASI), pp. 1075–1077. Cited by: §1.
 [18] (2015) Voxnet: a 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.
 [19] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2), pp. 4. Cited by: §1, §1, §2, §3.3, §4.2.1, §4.3, Table 1, Table 2, Table 3.
 [20] (2016) Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §1, §2.
 [21] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §1, §1, §3.3, §4.2.1, Table 2.

[22]
(2016)
Learning to fuse: a deep learning approach to visualinertial camera pose estimation
. In Mixed and Augmented Reality (ISMAR), 2016 IEEE International Symposium on, pp. 71–76. Cited by: §1.  [23] (2018) Fullyconvolutional point networks for largescale point clouds. arXiv preprint arXiv:1808.06840. Cited by: §1, §4.2.1, Table 2.
 [24] (2017) Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 3. Cited by: §1, §2.
 [25] (2018) Stereo visionbased semantic 3d object and egomotion tracking for autonomous driving. arXiv preprint arXiv:1807.02062. Cited by: §1.
 [26] (2018) Splatnet: sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2530–2539. Cited by: §2, §3.2, §3.3, §4.1, §4.2.2, §4.2.2, §4.2.2, Table 1, Table 2, Table 3, Table 6, §5.
 [27] (2015) Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §1, §2.
 [28] (2017) Octree generating networks: efficient convolutional architectures for highresolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096. Cited by: §2.
 [29] (2018) Factoring shape, pose, and layout from the 2d image of a 3d scene. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 302–310. Cited by: §1.
 [30] (2018) Occlusionaware rolling shutter rectification of 3d scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 636–645. Cited by: §1.
 [31] (2017) Ocnn: octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG) 36 (4), pp. 72. Cited by: §1, §2.
 [32] (2018) Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829. Cited by: §1, §2, §2, §3.2, §3.3, §3.3, §3.4, §3.4, §4.1, §4.2.1, §4.3, §4.3, Table 1, Table 2, Table 3, §5.
 [33] (2018) Squeezeseg: convolutional neural nets with recurrent crf for realtime roadobject segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893. Cited by: §1.
 [34] (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2, §4.1.
 [35] (2018) Learning and matching multiview descriptors for registration of point clouds. arXiv preprint arXiv:1807.05653. Cited by: §1.
 [36] (2018) Voxelnet: endtoend learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §3.3, item 1, §4.4, Table 4.
Comments
There are no comments yet.