Spatial Transformer for 3D Points

06/26/2019 ∙ by Jiayun Wang, et al. ∙ berkeley college 0

Point cloud is an efficient representation of 3D visual data, and enables deep neural networks to effectively understand and model the 3D visual world. All the previous methods used the same original point cloud location at different layers of the network to define "local patches". Depending on the neighborhood of the local patches, they learn the local feature and finally form a feature map. Though this is easy to do but not necessarily optimal. As with different layers the "local patches" structure will also change. Thus, one needs to learn the transformation of the original point cloud in each layer, and learn the feature maps from the "local patches" on the transformed coordinates. In this work, we propose a novel approach to learn non-rigid transformation of input point clouds in each layer. We propose both linear (affine) and non-linear (projective, deformable) spatial transformer on 3D point cloud and our proposed method outperforms the state-of-the-art fixed point counterparts in benchmark point cloud segmentation datasets.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

From 3D point cloud to semantics. We propose spatial transformers on point cloud that can be easily added to existing point cloud processing networks. The transformer learns class-specific transformations for point cloud, build affinity matrix (usually based on

-NN graph), derive local patches, and then apply point convolution. Corresponding transformers capture similar geometric transformation for different samples in a category.

Recent years have witnessed the emergence and increasing popularity of 3D computer vision to understand the 3D world, with the development of 3D sensors and technology. Point cloud data obtained from 3D sensors such as LiDAR and stereo cameras is an efficient representation of 3D data. Analysis of 3D point cloud is fundamental to a variety of applications in the field of 3D computer vision, including many significant tasks, such as virtual/ augmented reality

[17, 22], 3D scenes understanding [29, 30, 5], and autonomous driving [4, 25, 33].

Standard CNNs cannot be directly applied to 3D point cloud data because it is not as easy to define irregular point neighbors as in image. One straightforward way is to convert the irregular 3D points to regular format, such as voxel representation [34, 24, 31] and view projection [27, 20, 12, 35], then use CNNs to analyze. Recently, network architectures [19, 21, 23, 8] that directly work on 3D point cloud have been developed. Similar as CNNs, given a set of points, the point “convolution layer” will need first to find “local patch” of each input point from point affinity matrix (affinity matrix is defined as the adjacency matrix of the dense graph constructed from the point cloud). Then it will learn local features from the patch, and finally to form the feature map. By stacking the basic point convolution layers, the network can extract information from point cloud at different levels.

Nonetheless, unlike 2D image where local patches are naturally determined, defining local patches for 3D point cloud is not easy. The local patches should cope with complicated geometric transformation in 3D shapes. Using nearest neighbor graph in euclidean space to define local patches are adopted by most methods [19, 21, 32, 16]. This may not be optimal, as (1) euclidean distance may not capture the geometric transformation of different 3D point shapes; (2) different layers usually target at different level of information, and fixed nearest neighbor graph constrains the changes at different information level. We propose to dynamically learn the point affinity matrix for finding local patches. The affinity matrix should be learned from both the point location and current feature map.

In order to transform point cloud for defining local neighborhood, a naive approach is to learn a function, , to generate new point set coordiantes from the original point set location and current feature map. However, if is unconstrained the learning can be nontrivial. An alternative way is to put a smoothness constraint on the function

. Observe that any isometric transformation (e.g. rigid) can not change the topology. Hence, in this work, we will use non-rigid transformations, including linear and non-linear transformation to model

. In other words, we will learn spatial transformers using the point cloud coordinates and feature map to generate new point cloud for affinity matrix. Then, we find “local patches” based on affinity matrix (usually -NN graph measured by distance). Our proposed scheme learns an affine spatial transformation , where is an affine matrix. We also propose projective spatial transformer and deformable spatial transformer as non-linear transformation, where are respective transformation matrices.

In summary, our main contributions are: (1) Proposing linear (affine) and non-linear (projective, deformable) spatial transformer on point cloud for learning affinity matrix and local patch; (2) Showing the proposed module can be easily added to existing point cloud networks for different tasks (segmentation, detection); (3) Applying the spatial transformer to both point-based and sampling-based point cloud processing networks, and observing improved performance compared to its fix graph counterpart.

2 Related Works

View-based and voxel-based networks. View-based methods project 3D shapes to 2D plane and use a group of images from different views as the representation. Taking advantages of the power of CNNs in 2D image processing [7, 10, 20, 27], view-based methods has achieved reasonable 3D processing performance. Yet, the geometric shape information get lost when projecting from 3D to 2D.

Representing 3D shapes as volumetric data based on regular 3D grid, and processing with 3D convolution has also been adopted by many works [34, 18]. However, the quantization artifacts, inefficient use of the 3D voxels and low resolution due to computation capacity highly limits the volumetric methods. Furthermore, 3D convolution usually performs away from the surface, and cannot capture many 3D shape information. Recent works that applied different partition strategies [13, 24, 28, 31] somehow relieve such issues but still depends on bounding volume subdivision, instead of fine-grained local geometric shape. Our work directly take 3D point cloud as input to minimize geometric information loss and to maximize the processing efficiency.

Deep learning on point cloud. Some deep neural networks directly takes point cloud as input and learn semantic/ abstract information by point processing operations. As a pioneer work, PointNet [19] directly learned embedding of every isolated 3D points and gather that information by pooling later on. Although successful, PointNet did not learn any local geometric information of the 3D shape. PointNet++ relieved this by proposing a hierarchical application of isolated 3D point feature learning to multiple subsets of point cloud data. Many other works also explored different strategies in leveraging local structure learning of point cloud data [32, 16]. Instead of finding neighbors of each point, SplatNet [26] encodes local structure from sampling perspective: it grouped points based on permutohedral lattices [1], and then applied bilateral convolution [11] for feature learning. Super-point graphs [14] proposed to partition point cloud into super-points and learned the 3D point geometric organization.

Point cloud is defined in irregular grid, and regular convolution operation cannot directly be applied. Many works [32, 16, 15] aim at designing point convolution operation for point cloud data. However, most work directly uses original input point cloud to find the local patches for the convolution operation. 3D shapes has diverse geometric transformations, and efficient learning requires the point convolution operation to be invariant to such transformations. Fixed 3D shape at all layer as input to find local patch greatly limit the network’s flexibility in handling this issue. In contrast, our work propose spatial transformers on the point cloud to capture geometric transformations more efficiently.

3 Methods

In this section, we briefly review different transformation methods and their influence on the affinity matrix, followed by the design of our three spatial transformers (affine, projective and deformable). Finally we will introduce our how the transformers can be added to existing point cloud networks and the relevance to other works.

3.1 Transformation and Affinity Matrix

We propose to learn transformations on the original point cloud to “deform” the original geometric shape, and build new affinity matrices based on nearest neighbor (-NN) graphs. The new affinity matrix will dynamically alter the “local patch” for “convolution” for point feature learning.

Figure 2: Different transformation methods.

As in Figure 2, transformations can be categorized into rigid and non-rigid transformations, and non-rigid transformations can be further partitioned into linear and non-linear transformations. We briefly review different transformation methods below.

Rigid transformations: The group of rigid transformations consist of translations and rotations. However, rigid transformations are isometric and hence preserves the affinity matrix. Thus, local patches are “invariant” to rigid transformations in terms of -NN graph. We do not consider this transformations in this paper.

Affine transformations: Affine transformation is the non-rigid linear transformation. Consider a 3D point cloud which consists of three-dimensional points

. Then, an affine transformation is parametrized by an invertible matrix

and a translation vector

to it. Given , we will get the affine transformed point as . Note that translation will not change the -NN graph. Affine transformation preserves collinearity, parallelism and convexity.

Projective transformations: Projective transformation (or homography) is a non-rigid non-linear transformation. We first map the 3D point sets to the homogeneous space and get by concatenating ones as the last dimension. The projective transformation is parametrized by and we get the transformed point as . Compared to the affine transformations, projective transformations have more freedom but cannot preserve parallelism.

Deformable transformations:

When all the points have the freedom to move without much constraint, the 3D shape can deform freely. We refer to it as deformable transformation. This transformation has more degree of freedom and do not preserve the topology. We learn deformable transformation from both point location and feature, as described in the following subsection.

3.2 Spatial Transformer

We propose to transform the original point cloud to obtain dynamic local patches. Our spatial transformers can thus be applied to other existing point cloud processing networks. We briefly introduce our affine, projective and deformable spatial transformers as follows.

Affine: The spatial transformer can be applied to different layers of a point cloud processing network. Suppose at layer , the block contains spatial transformers and corresponding feature learning modules. Each spatial transformer will learn a transformation sub-graph first and calculate the corresponding sub-feature. Finally we concatenate all sub-features of each transformer to form the final feature output of the learning block.

Suppose the spatial transformer at layer takes as input the original point cloud and previous feature map . We first form new transformed point from as:


As the affinity matrix is invariant of and , we set for simplicity. Equation 1 can hence be simplified to:


where, . We then will apply -NN on each transformed points to obtain the affinity matrix . For every affinity matrix , we can define local patches for point clouds and thus do point convolution operation on previous point cloud feature map and get the point cloud feature of the subgraph:


where, CONV is the point convolution operation: it takes (a) previous point cloud feature, (b) affinity matrix (for defining local patch of each point) and (c) number of neighbors (for defining the size of local patches) as input. In some point convolution operations (such as [32]), the affinity matrix will alter the input feature in a non-differentiable way. We thus will concatenate the transformed point cloud to the input feature for the sake of back-propagation of transformation matrix . In other sampling-based convolution operations (such as bilateral convolution [26]), affinity matrix will change the input feature in a differentiable way, therefore no extra operation is needed.

For all the sub-graph in layer/ block , we can learn point cloud features . The output of this learning module will be the concatenation of all the sub-graph point cloud features:


where, and , . When implementing, we randomly initialize

from standard normal distribution,

. When computing the transformed point cloud location in Equation 2, we normalize transformation matrix by its norm as the norm is invariant to the affinity matrix.

Projective: Analogous to the affine transformation learning module, for graph at layer, we first apply projective transformation to the point cloud in homogeneous coordinates and get transformed point cloud as:


where, is the transformation matrix in the homogeneous coordinates.

Then we follow the same point cloud feature learning as defined in Equation 3, and concatenate them as in Equation 4 to get output feature of the layer.

Deformable: Affine and projective transformations are useful in transforming the original point cloud data, altering the affinity matrix, and providing learnable “local patch” for point convolution operation at different layers. Nonetheless, affine transformations are linear transformations so the ability to alter affinity matrix and local patches is limited. Although “projective transformation” has more flexibility than affine in the sense parallelism is not preserved, the restriction that “projective transformation” maps a straight line to a straight line makes it not general enough to capture all possible deformations. To alleviate this problem and capture more geometric transformation of the point cloud, we propose a non-linear transformation module - deformable transformation module.

The deformable transformation at layer and sub-graph can be written as:


where, is the affine transformation, and deformation matrix gives every point the freedom to move, so the geometric shape of the whole point cloud has the flexibility to deform. Note that the deformation matrix is different from translation vector in Equation 1 and significantly changes the local patch.

As a self-supervised learning procedure, the spatial transformer parameters are learned from both point cloud location and feature. Since affine transformation

already captures the location information, we use deformation matrix to capture feature map changes given by , where, is the feature map of previous layer, transforms the feature from to . Hence, the deformable transformation in Equation 6 can be simplified as:


where, is the concatenation of affine and deformable transformation matrix that captures both point cloud location and feature map projection.

After we compute the transformed point location , we will follow the Equations 3 and 4 to learn the feature of each sub-transformation graph, and concatenate them as the final output feature of layer .

For the deformable spatial transformer, we decompose the entire transformation in two parts: and . The former is the affine transformation of point 3D coordinates, while the latter is a transformation of the point feature. The transformation of point spatial location captures the linear transformation information of the point cloud, and the feature transformation captures the relatively high-level semantic information. The deformable transformation sums the two sources of information together. Section 4.5 provides more empirical analysis of these two components.

3.3 Spatial Transformer Networks

The spatial transformer blocks discussed above aim to dynamically transform point cloud and change the local patches for point convolution operation. The transformer can be easily added to existing point cloud networks. We introduce how a general point cloud processing network with transformer module works, and then provide three applications in different networks and tasks as examples.

Point cloud classification/ segmentation/ detection: Figure 3 depicts a general network architecture for point cloud segmentation. Suppose it is a class segmentation task with input point cloud in consisting points. Our network consists of several spatial transformers at different layers. For layer , we learn transformation matrices to apply on the original point cloud location , and compute the corresponding affinity matrices (e.g. based on -NN graphs in the edge convolution [32] for point cloud). For each sub-transformation, we can learn a feature of dimension ; then we concatenate all features in this layer to form an output feature of dimension , where . The output feature serves as the input of the next layer for further feature learning. Note that since different layer can have multiple graphs, the affine/ projective transformation matrix will only be applied on the original point cloud location . Specifically for deformable transformation, deformable matrix applies on previous feature map, thus the feature transformation component is progressively learned. By stacking several such transformation learning blocks and finally a fully connected layer of dimension , we can map the input point cloud to the segmentation map of dimension , or downsample to vector of dimension for classification task. We can train the network end-to-end with some modern optimization methods. For spatial transformer block in point cloud detection network (Figure 4), is dimension of the output feature.

Figure 3: Point cloud segmentation network with spatial transformers.

Application in Point-based Segmentation Networks. Point-based segmentation networks [21, 19, 16, 32] take point cloud as input and derive affinity matrix and local patches from the point locations. For selected points, certain “convolution” operators on the points and its local patches will be applied to learn the feature map. We choose edge convolution from [32] as our baseline since it takes relative point location as input and achieves state-of-the-art performance. Specifically, we retain all the setting of the method, just insert the spatial transformers to provide new local patches for the edge convolutional operation.

Application in Sampling-based Segmentation Networks. SplatNet [26] is a representative of sampling-based segmentation networks. It also takes point cloud as input, but use permutohedral lattice [1] to group points into lattices and performs learned bilateral filters [11] on the grouped points to get feature. The permutohedral lattice defines the local patches of each points to make the bilateral convolution possible. We use spatial transformers to deform the point cloud and form new permutohedral lattices on the transformed point sets. The local patches can therefore dynamically cope with the geometric shape of the point cloud. All the other components in SplatNet remains the same.

Figure 4: Detection Network.

Application in Detection Networks. Object detection in 3D point clouds is an important problem in many applications, such as autonomous navigation, housekeeping robots, and AR/ VR. LiDAR point cloud, as the one of the most popular sensory data for 3D detection, is usually highly sparse and imbalanced. The proposed spatial transformers specializes in transforming the point clouds for dynamic local patches, and has the potential of processing LiDAR data efficiently. We use VoxelNet [36], which achieves the state-of-the-art performance in 3D object detection in autonomous driving data, as our baseline model. As in Figure 4, we follow all the settings in VoxelNet, but add spatial transformers on the raw point cloud data, before point grouping. The spatial transformer feature learning blocks only change point feature but not the point location for grouping. It can be considered as improving the point cloud learning process.

3.4 Relevance to Other Works

The idea of transformation learning module has much relevance with some previous work. We briefly review its relevance to deformable CNN [6] and DGCNN [32].

Relevance to Deformable CNN. Deformable convolutional networks [6] propose to learn dynamic local patches for 2D image. Specifically for each location on the output feature map , deformable convolution augments the regular grid with offsets , where Then the convolutional output on input parameterized by weight becomes:


The offset augmentation to the regular grid is very similar to the deformable transformation (Equation 6 ): we also want to make each point has the freedom to move. For 2D images (matrices) defined in regular grid, the dynamic grid is necessary to model geometric transformation [6]; for 3D point clouds defined in irregular grid, the dynamic grid is also necessary to model even more complicated 3D geometric transformation.

Relevance to Dynamic Graph CNN. The idea of having dynamic local patches on point cloud processing has also been explored in DGCNN [32]. For point convolution operation, they directly use the high-dimensional feature map from the previous layer, to construct dynamic graph for affinity matrix and local patches. Additionally, at each layer, they only have one graph. For local patch learning, we transform both point cloud location and feature in Equation 7 to to compute the affinity matrix and local patches. We also have multiple graphs at each layer to deform the point cloud differently, in order to capture different geometric transformations. With less computation burden and more flexibility in geometric transformations, we demonstrate better performances as shown in two semantic segmentation experiments (Section 4.2 and 4.3.)

4 Experiments

In this section, we arrange comprehensive experiments to verify the effectiveness of the proposed spatial transformer. First, we evaluate the transformer on two networks (point-based and sampling-based) for four point cloud processing tasks (classification results shown in Appendices section 4.1). We then conduct ablation studies on the deformable spatial transformer. We visualize and analyze the learned shapes. Finally, we conclude this section with classification result on Modelnet40 dataset.

4.1 Classification

We conduct experiments on ModelNet40 3D shape classification dataset [34] to show the effectiveness of the proposed spatial transformer. We evaluate on two baseline methods [32, 26] and adopt the same network architecture, setting and evaluation protocols. We show that adding the spatial transformer to point-based and sampling-based method gives 1% and 2% gain on ModelNet40 (Table 1). If both the training and test data are augmented with random rotation, the spatial transformer gives 3% gain, since the learned transformations align the global 3D shape better for shape recognition (Fig. 8).

Point-based Point-based (R) Sampling based
[19] [32] [32] (FIX) AFF DEF [32] (FIX) DEF [26] (FIX) AFF DEF
Avg. 86.2 89.2 88.8 89.3 89.9 85.7 88.3 86.3 87.4 88.6
Table 1: Classification accuracy on ModelNet40 dataset.

4.2 Part Segmentation

3D point cloud part segmentation is an important yet challenging fine-grained 3D analysis task - given a 3D point cloud, the task aims at accurately assigning part category label (e.g. chair leg, cup handle) to each point. We evaluate the proposed modules on ShapeNet part segmentation dataset [3]. The dataset contains shapes from categories, annotated with parts in total, with # of parts/ category from to . Ground truth are annotated on every sample.

Figure 5: Qualitative results for part segmentation of deformable spatial transformer.

4.2.1 Point-based method

Network architecture. Point-based segmentation networks take point cloud as input and derive affinity matrix and local patches from the point location, for “convolution” operation on points. We use the state-of-the-art point-based segmentation networks, Dynamic graph CNN (DGCNN) [32]. We follow the same network architecture and evaluation protocol. Specifically this work use “edge convolution” as the point convolution operation. The network has 3 convolutional layers, with output feature dimension of 64. Additionally, in order to capture different level information of the input point clouds, they concatenate all the convolutional feature and use several fully connected layers to map the feature to the final segmentation output.

We insert our spatial transformer to alter the local patch definition for edge convolution operation. We first use the original point cloud location and name it fix graph baseline. With the affine, projective and deformable spatial transformer defined in Section 3.2, we also have point-based affine, projective and deformable networks. Specifically DGCNN directly used learned feature to build affinity matrix (based on -NN graph) to obtain local patches, and we consider this as point-based dynamic graph network.

Under the framework of 3 edge convolution layers, we kept number of graphs in each layer and sub-graph feature dimension the same, and search for the best architecture. Due to memory limitation, we report the affine, projective and deformable network with at the best performance. To make fair comparison, we also increase the # of channels of fix-graph baseline and dynamic networks.

Result and analysis. In Table 2, we report instance average mIOU (mean intersection over union), as well as the mIOU of some representative categories in ShapeNet. Compared with fix graph baseline, the affine, projective and deformable networks respectively achieve 0.5%, 0.2% and 1.1% improvement and beats the fix-graph baseline methods in most categories. Specifically, we observe 8.0%, 8.3% and 4.7% performance boost in deformable networks compared with the fix graph baseline. Compared with the dynamic graph network, the deformable network improve 4.0%. We also beat other state-of-the-art methods [19, 21, 23]. Figure 5 qualitatively visualize some part segmentation results of the fix graph baseline our deformable spatial transformer. Deformable spatial transformer makes the prediction more smooth and achieves better performance, compared with fix graph baseline.

From affine to projective, and to deformable networks, the performance increases as the level of freedom goes up. Projective spatial transformer, however, seems to have similar or worse performance than affine, and we believe the mapping to homogeneous may inhibit the ability to capture geometric transformation. When the freedom further improves and we directly use learned feature as input to define affinity matrix and find local patches (dynamic-graph), yet the performance drops. We believe the need for both point location and feature to learn the affinity matrix, rather than reusing the high-dimensional point cloud feature.

Avg. aero bag cap earphone lamp rocket
# shapes 2690 76 55 69 1547 66
3DCNN [19] 79.4 75.1 72.8 73.3 63.5 74.4 51.2
PointNet[19] 83.7 84.3 78.7 82.5 73.0 80.8 57.9
PointNet++ [21] 85.0 82.4 79.0 87.7 71.8 83.7 58.7
FCPN [23] 81.3 84.0 82.8 86.4 73.6 77.4 68.4
DGCNN [32] 81.3 84.0 82.8 86.4 73.6 77.4 68.4
Point-based [32] fixgraph 84.2 83.7 82.3 84.0 69.9 82.5 56.0
Point-based affine 84.7 84.1 83.5 86.9 72.5 83.3 60.9
Point-based projective 84.4 84.3 84.2 88.5 72.8 81.7 61.6
Point-based deformable 85.3 84.6 83.3 88.7 77.9 83.5 64.3
PointCNN [16] 84.9 82.7 82.8 82.5 75.8 82.6 61.5
PointCNN deformable 85.8 83.4 86.6 85.5 78.5 84.2 65.0
Sampling-based baseline [26] 84.6 81.9 83.9 88.6 73.5 84.5 59.2
Sampling-based deformable 85.2 82.9 83.8 87.6 73.0 85.7 65.1
Table 2: Part segmentation results on ShapeNet PartSeg dataset. Metric is mIoU(%) on points. Compared with several other methods,our method achieves the SOTA in average mIoU.

4.2.2 Sampling-based method

Sampling-based point cloud processing methods group 3D points first, and then conduct convolution on the grouped points. SplatNet [26], as a representative method, applies permutohedral lattice [1] to group points into lattices and performs learned bilateral filters [11] on the grouped points to extract feature. In comparison, the bilateral convolution operates on the grouped points, and enjoys the advantages of naturally defined local neighbors at different direction.

Network architecture. We follow the same architecture as SplatNet [26]: the network starts with a single regular convolutional layer, followed by 5 bilateral convolution layers (BCL). The output of all BCL are concatenated and feed to a final regular convolutional layer to get the segmentation output. Since each BCL directly takes raw point cloud location as input, we consider it as fix graph baseline. We add deformable spatial transformer to the networks and feed transformed point graphs to BCL to construct the permutohedral lattice. Because of the gradient to the permutohedral lattice grid, we can make the transformation matrix learned end-to-end. Note that we increase the channel of convolution layers for fair comparison.

Result and analysis. We report the performance of deformable spatial transformer (with at all BCLs) in Table 2. Compared with sampling-based fix graph baseline [26], the deformable module achieves 0.6% improvement and performance boost in most categories (improves 5.9% for rocket). Deformable spatial transformer also beats other state-of-art baselines.

4.3 Semantic Segmentation

Figure 6: Qualitative results for semantic segmentation of our deformable spatial transformer.

Semantic segmentation for point cloud data is a challenging but has high practical significance, such as for robotic vision. The task is similar to part segmentation, only point labels become semantic object classes instead of part labels.

We conduct experiments on the Standford 3D semantic parsing dataset [2]. The dataset contains 3D scans from Matterport scanners in 6 areas including 271 rooms. Each point in the scan is annotated with one of the semantic labels from 13 categories (chair, table, floor etc. plus clutter).

We follow the data processing procedure of [19] for Stanford 3D Indoor Scenes Dataset [2]. Specifically, we first splits points by room, and then sample rooms into several 1m 1m blocks. When training, 4096 points are sampled from the block on the fly. We train our network to predict per point class in each block, where each point is represented by a 9 dimensional vector of XYZ, RGB and normalized location (in the range of (0, 1) ) as to the room.

PointNet[19] DGCNN[32] [32](FIX) [32]+AFF [32]+DEF [26] (FIX) [26]+DEF
47.7 56.1 56.0 56.9 57.2 54.1 55.5
ceiling floor wall beam column window clutter
[32](FIX) 92.5 93.1 76.1 51.0 41.7 49.6 46.8
[32]+AFF 92.7 93.6 76.7 52.6 41.2 48.7 47.8
[32]+DEF 92.8 93.6 76.8 52.9 41.1 49.0 48.0
door table chair sofa bookcase board
[32](FIX) 63.4 61.8 43.1 23.3 42.0 43.5
[32]+AFF 63.7 63.4 45.1 27.0 41.3 44.8
[32]+DEF 63.5 64.2 45.2 28.1 41.7 46.1
Table 3: Semantic segmentation results on S3DIS semantic segmentation dataset. Metric is mIoU(%) on points. Compared with other methods, our method achieved the SOTA in avg. mIoU.

Network architecture. The network architecture is based on DGCNN [32]. The network architecture is the same as Section 4.2, with the dimension of final segmentation label changes to 13.

Result Analysis. In Table 3

, we report the performance of the affine and deformable spatial transformer networks, and compare with our fix graph baseline and several other state-of-the-art methods. Compared with our fix graph baseline, affine spatial transformer achieves 0.9% average mIOU improvement, while deformable achieves 1.2% average mIOU improvement. Specifically compared with the dynamic graph

[32], the deformable spatial transformer is also 1.1%. Our deformable spatial transformer beats all other state-of-the-art methods.

From the result, we have similar conclusion as in the part segmentation experiments: when given point cloud more freedom to the to deform (from affine to deformable spatial transformer) based on transformation of original location and feature projection, the segmentation performance improves. However, when directly using high-dimensional point feature to find affinity matrix, the performance drops due to lack of regularization.

Figure 6 depicts qualitative results for semantic segmentation of our deformable transformation learning module. Our network is able to output smooth predictions and is robust to missing points and occlusions.

4.4 Detection

birds’ eye 3D
Easy Medium Hard Easy Medium Hard
VoxelNet[36] 77.3 59.6 51.6 43.8 32.6 27.9
VoxelNet + fix graph 84.3 67.2 59.0 45.7 34.5 32.4
VoxelNet + deformable 85.3 69.1 60.9 46.1 35.9 34.0
Table 4: Detection results on KITTI validation set (car class). Metric is AP(%) on points.

We also explore how the proposed methods performs in point cloud detection. We evaluate on the KITTI 3D object detection benchmark [9] which contains 7,481 training images/point clouds and 7,518 test images/point clouds, covering three categories: Car, Pedestrian, and Cyclist. For each class, detection outcomes are evaluated based on three difficulty levels: easy, moderate, and hard, which are determined according to the object size, occlusion state and truncation level. We follow the evaluation protocol in VoxelNet [36] and report the car detection result on the validation set.

Network architecture. As shown in Figure 4, the network takes raw point cloud as input and partition the points based into voxels. We add deformable spatial transformer to the point cloud location, so the grouped points in each voxel are represented as point features. There are two deformable feature learning layers with each layer having subgraphs with -dimensional outputs. Note that the voxel partition is based on the original point cloud location. Then as VoxelNet, the point features in each voxel are fed to voxel feature encoding layers with channel and

to get sparse 4D tensors representing the space. The convolutional middle layers process 4D tensors to further aggregate spatial context. Finally a RPN generates the 3D detection.

We report the performance of 3 networks: (1) VoxelNet baseline [36]; (2) our fix graph baseline, where we used the original point cloud location to learn the point feature at the place of spatial transformer blocks; (3) deformable spatial transformer networks as discussed above.

Result and analysis. Table 3 reports car detection results on KITTI validation set.111As original authors did not provide code, we use the third party implemented code and obtain lower result than that reported in the original paper. Compared with baseline, having a point feature learning module improves the performance by 7.3% and 2.8% for birds’ eye view and 3D detection performance on average, respectively. The deformable module further improves 8.9% and 3.9%, respectively, on birds’ eye view and 3D detection performance on average, compared with the VoxelNet baseline. We observe performance boost with our deformable spatial transformer.

4.5 Ablation Studies

Figure 7: Performance improvement (compared to fix graph) of different component of deformable spatial transformer.

Influence of different components in deformable transformation. As in Equation 7, the deformable spatial transformer consists of two components: affine transformation on point location , , and three-dimensional projection of high-dimensional feature, . Figure 7 depicts performance of different component of deformable transformation learning module. We observe average mIOU improvement of both affine and feature only spatial transformer, while deformable spatial transformer (the combination of both) gives the highest performance boost.

fix graph 1 graph 2 graphs 4 graphs
= 32 84.2 84.9 85.2 85.3
= 64 84.2 85.3 85.2 83.5
Table 5: Performance of different number of deformable transformation modules. Metric is average mIOU (%). In the first row, the output feature of each sub-graph is of dim. 32, while the number of subgraphs changes; the second row limits the multiplication of number of sub-graphs and sub-feature dim. to be 64.

Influence of number of transformation module. Table 5 shows the performance of different number of deformable transformation modules. When sub-feature dimension is fixed, the more graphs in each layer, the higher the performance. With the limitied resources (the multiplication of number of sub-graphs and sub-feature dimension to be ), the best performance is achieved at .

Model size and timing. Table 6 shows that with the same model size and almost the same test time, the significant performance gain can be achieved. We increase the number of channel in the fix graph baseline model for all experiments for fair comparison. Note that even without increasing number of parameters of baseline, adding spatial transformer only increases number of parameters by 0.1%.

[26] [26] + Spatial Transformer
# Params. 2, 738K 2, 738K
Test time (s/shape) 0.352 0.379

Table 6: Model size and test time on ShapeNet part segmentation.

4.6 Visualization and Analysis

Figure 8: Examples of learned deformable transformation in ShapeNet part segmentation. The top (bottom) shows two examples of rocket (table). We observe that each graph at certain layer aligns input 3D shape with similar semantic geometric transformation, e.g., graph 2 (1) at layer 2 (2) in rocket (table) example captures the wing (table base) information.

Global view of the deformable transformation. Figure 8 depicts some examples of learned deformable transformation in ShapeNet part segmentation. We observe that each graph at certain layer aligns input 3D shape with similar semantic geometric transformation.

Figure 9: Deformable spatial transformer makes point cloud more balanced sampling, and makes defining local patch and point cloud feature learning more efficient.

Local view of the deformable transformation. Point cloud data is not usually balanced sampled, which makes point cloud convolution challenging, as the -NN graph does not accurately represents the exact neighborhood and 3D structure information. Our deformable spatial transformer can gives every point flexibility and in turn can capture better affinity matrix and find better local patches, but can it implicitly make the point cloud closer to balanced sampling?

Figure 9

shows the local view of a sample of skateboard - after deformable transformation, the points are deformed to be more uniformly distributed. We also analyze the standard deviation of data and transformed point cloud in the ShapeNet dataset. After transformation, the variance of the data decreases, and accounts for the balanced sampling distribution of the transformed points.

Dynamic neighborhood visualization. To illustrate how our spatial transformers learn diverse neighborhoods for 3D shapes, we show the nearest neighbors of two query points and use corresponding colors to indicate corresponding neighborhoods. (1) As in in Fig. 10, neighborhoods retrieved from deformed shape encode additional semantic information, compared to neighborhoods from 3D coordinates. (2) As shown in additional graph visualizations (Fig. 11) of table and earphone, different graphs enable the ability of the network to learn from diverse neighborhoods without incurring additional computational burden.

Figure 10: Nearest neighbor retrieval of two query points (red and yellow) using (transformed) 3D coordinates. Rotating version. Neighborhood of the transformed coordinates encode additional semantic information: the neighborhood inside the dashed circle changes to adapt to table base part.
Figure 11: Nearest neighbor retrieval of several query points using (transformed) 3D coordinates. Rotating table and earphone.

5 Conclusion

In this work, we propose novel spatial transformers on 3D point cloud for altering “local patch” in different point cloud process tasks. We analyze different transformation and their influence in affinity matrix and point “local patches”. We further propose one linear spatial transformer (affine) and two non-linear spatial transformer (projective and deformable). We also show that the spatial transformers can be easily added to existing point cloud processing networks. We evaluate the performance of the proposed spatial transformer on two point cloud networks (point-based [32] and sampling-based [26]) on three large-scale 3D point cloud processing tasks (part segmentation, semantic segmentation and detection). In additional to beating all other state-of-the-art methods, our spatial transformers also achieves higher performance than its fix graph counterpart. Future work could design better non-linear spatial transformer for point cloud, and explore other methods in dynamic local patch for point cloud processing.


  • [1] A. Adams, J. Baek, and M. A. Davis (2010) Fast high-dimensional filtering using the permutohedral lattice. In Computer Graphics Forum, Vol. 29, pp. 753–762. Cited by: §2, §3.3, §4.2.2.
  • [2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3d semantic parsing of large-scale indoor spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1534–1543. Cited by: §4.3, §4.3.
  • [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.2.
  • [4] Y. Chen, J. Wang, J. Li, C. Lu, Z. Luo, H. Xue, and C. Wang (2018) LiDAR-video driving dataset: learning driving policies effectively. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5870–5878. Cited by: §1.
  • [5] A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner (2018) ScanComplete: large-scale scene completion and semantic segmentation for 3d scans. In CVPR, Vol. 1, pp. 2. Cited by: §1.
  • [6] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. CoRR, abs/1703.06211 1 (2), pp. 3. Cited by: §3.4, §3.4, §3.4.
  • [7] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao (2018)

    GVCNN: group-view convolutional neural networks for 3d shape recognition

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 264–272. Cited by: §2.
  • [8] M. Gadelha, R. Wang, and S. Maji (2018) Multiresolution tree networks for 3d point cloud processing. arXiv preprint arXiv:1807.03520. Cited by: §1.
  • [9] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §4.4.
  • [10] Z. Han, M. Shang, Z. Liu, C. Vong, Y. Liu, M. Zwicker, J. Han, and C. P. Chen (2019) SeqViews2SeqLabels: learning 3d global features via aggregating sequential views by rnn with attention. IEEE Transactions on Image Processing 28 (2), pp. 658–672. Cited by: §2.
  • [11] V. Jampani, M. Kiefel, and P. V. Gehler (2016) Learning sparse high dimensional filters: image filtering, dense crfs and bilateral neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4452–4461. Cited by: §2, §3.3, §4.2.2.
  • [12] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri (2017) 3D shape segmentation with projective convolutional networks. In Proc. CVPR, Vol. 1, pp. 8. Cited by: §1.
  • [13] R. Klokov and V. Lempitsky (2017) Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: §2.
  • [14] L. Landrieu and M. Simonovsky (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4558–4567. Cited by: §2.
  • [15] J. Li, B. M. Chen, and G. Hee Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §2.
  • [16] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) PointCNN: convolution on x-transformed points. In Advances in Neural Information Processing Systems, pp. 820–830. Cited by: §1, §2, §2, §3.3, Table 2.
  • [17] C. Lin, Y. Chung, B. Chou, H. Chen, and C. Tsai (2018) A novel campus navigation app with augmented reality and deep learning. In 2018 IEEE International Conference on Applied System Invention (ICASI), pp. 1075–1077. Cited by: §1.
  • [18] D. Maturana and S. Scherer (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.
  • [19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2), pp. 4. Cited by: §1, §1, §2, §3.3, §4.2.1, §4.3, Table 1, Table 2, Table 3.
  • [20] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §1, §2.
  • [21] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §1, §1, §3.3, §4.2.1, Table 2.
  • [22] J. R. Rambach, A. Tewari, A. Pagani, and D. Stricker (2016)

    Learning to fuse: a deep learning approach to visual-inertial camera pose estimation

    In Mixed and Augmented Reality (ISMAR), 2016 IEEE International Symposium on, pp. 71–76. Cited by: §1.
  • [23] D. Rethage, J. Wald, J. Sturm, N. Navab, and F. Tombari (2018) Fully-convolutional point networks for large-scale point clouds. arXiv preprint arXiv:1808.06840. Cited by: §1, §4.2.1, Table 2.
  • [24] G. Riegler, A. O. Ulusoy, and A. Geiger (2017) Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 3. Cited by: §1, §2.
  • [25] S. Shen (2018) Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving. arXiv preprint arXiv:1807.02062. Cited by: §1.
  • [26] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang, and J. Kautz (2018) Splatnet: sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2530–2539. Cited by: §2, §3.2, §3.3, §4.1, §4.2.2, §4.2.2, §4.2.2, Table 1, Table 2, Table 3, Table 6, §5.
  • [27] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §1, §2.
  • [28] M. Tatarchenko, A. Dosovitskiy, and T. Brox (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096. Cited by: §2.
  • [29] S. Tulsiani, S. Gupta, D. Fouhey, A. A. Efros, and J. Malik (2018) Factoring shape, pose, and layout from the 2d image of a 3d scene. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 302–310. Cited by: §1.
  • [30] S. Vasu, M. M. MR, and A. Rajagopalan (2018) Occlusion-aware rolling shutter rectification of 3d scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 636–645. Cited by: §1.
  • [31] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong (2017) O-cnn: octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG) 36 (4), pp. 72. Cited by: §1, §2.
  • [32] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2018) Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829. Cited by: §1, §2, §2, §3.2, §3.3, §3.3, §3.4, §3.4, §4.1, §4.2.1, §4.3, §4.3, Table 1, Table 2, Table 3, §5.
  • [33] B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893. Cited by: §1.
  • [34] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2, §4.1.
  • [35] L. Zhou, S. Zhu, Z. Luo, T. Shen, R. Zhang, M. Zhen, T. Fang, and L. Quan (2018) Learning and matching multi-view descriptors for registration of point clouds. arXiv preprint arXiv:1807.05653. Cited by: §1.
  • [36] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §3.3, item 1, §4.4, Table 4.