DeepAI
Log In Sign Up

PointPWC-Net: A Coarse-to-Fine Network for Supervised and Self-Supervised Scene Flow Estimation on 3D Point Clouds

11/27/2019
by   Wenxuan Wu, et al.
0

We propose a novel end-to-end deep scene flow model, called PointPWC-Net, on 3D point clouds in a coarse-to-fine fashion. Flow computed at the coarse level is upsampled and warped to a finer level, enabling the algorithm to accommodate for large motion without a prohibitive search space. We introduce novel cost volume, upsampling, and warping layers to efficiently handle 3D point cloud data. Unlike traditional cost volumes that require exhaustively computing all the cost values on a high-dimensional grid, our point-based formulation discretizes the cost volume onto input 3D points, and a PointConv operation efficiently computes convolutions on the cost volume. Experiment results on FlyingThings3D outperform the state-of-the-art by a large margin. We further explore novel self-supervised losses to train our model and achieve comparable results to state-of-the-art trained with supervised loss. Without any fine-tuning, our method also shows great generalization ability on KITTI Scene Flow 2015 dataset, outperforming all previous methods.

READ FULL TEXT VIEW PDF

page 7

page 8

06/12/2019

HPLFlowNet: Hierarchical Permutohedral Lattice FlowNet for Scene Flow Estimation on Large-scale Point Clouds

We present a novel deep neural network architecture for end-to-end scene...
04/01/2022

RMS-FlowNet: Efficient and Robust Multi-Scale Scene Flow Estimation for Large-Scale Point Clouds

The proposed RMS-FlowNet is a novel end-to-end learning-based architectu...
11/01/2020

Adversarial Self-Supervised Scene Flow Estimation

This work proposes a metric learning approach for self-supervised scene ...
11/19/2020

FlowStep3D: Model Unrolling for Self-Supervised Scene Flow Estimation

Estimating the 3D motion of points in a scene, known as scene flow, is a...
04/10/2021

Occlusion Guided Self-supervised Scene Flow Estimation on 3D Point Clouds

Understanding the flow in 3D space of sparsely sampled points between tw...
05/18/2021

Self-Point-Flow: Self-Supervised Scene Flow Estimation from Point Clouds with Optimal Transport and Random Walk

Due to the scarcity of annotated scene flow data, self-supervised scene ...
08/23/2020

Neighbourhood-Insensitive Point Cloud Normal Estimation Network

We introduce a novel self-attention-based normal estimation network that...

1 Introduction

Scene flow is the 3D displacement vector between each surface point in two consecutive frames. As a fundamental low-level understanding of the world, scene flow can be used in various applications, such as motion segmentation, action recognition, and autonomous driving, etc. Traditionally, scene flow was estimated from RGB data 

[24, 23, 41, 43]. Recently, due to the increasing application of 3D sensors such as LiDAR, there is interest on directly estimating scene flow from 3D point clouds.

Fueled by recent advances in 3D deep networks that learn effective feature representations directly from point cloud data, recent works adopt ideas from 2D deep optical flow networks to 3D to estimate scene flow from a point cloud. FlowNet3D [22] operates directly on points with PointNet++ [28], and proposes a flow embedding which is computed in one layer to capture the correlation between two point clouds, and then propagates it through finer layers to estimate the scene flow. HPLFlowNet [11] computes the correlation jointly from multiple scales utilizing the upsampling operation in bilateral convolutional layers.

In this work, we explore another classic optical flow idea, coarse-to-fine estimation [2, 5, 35, 37], in 3D point clouds. Coarse-to-fine flow estimation allows us to accommodate large motion at a coarse level without a prohibitive search space, and then the coarse level flow is upsampled and warped to the next finer layer, where only the residual flow is estimated. The process continues until the finest layer. Because only a small neighborhood needs to be searched upon at each level, it is computationally efficient too. We also propose efficient upsampling and warping layers to implement the above process in point clouds.

An important piece in state-of-the-art deep optical flow estimation networks is the cost volume [18, 49, 37]

, a 3D tensor that contains matching information between neighboring pixel pairs from consecutive frames. In this paper, we propose a novel point-based cost volume where we discretize the cost volume to nearby input point pairs, avoiding the creation of a dense 4D tensor if we naively extend from image to point cloud. Then we apply the efficient PointConv layer 

[48] on this irregularly discretized cost volume. We experimentally show that it outperforms previous approaches for associating point cloud correspondences.

As in optical flow, it is difficult and expensive to acquire accurate scene flow labels. Hence, beyond supervised scene flow estimation, we also explore self-supervised scene flow which does not require human annotations. To our knowledge, our work is the first to explore self-supervised scene flow estimation from point cloud data. We propose new self-supervised loss terms: Chamfer distance [9], smoothness constraint and Laplacian regularization. The Chamfer distance enforces that two point clouds be close to each other. The smoothness constraint enforces the local scene flow to be similar to each other. The Laplacian regularization encourages the warped point cloud to have similar shape as the second point cloud.

We conduct extensive experiments on FlyingThings3D  [23] and KITTI Scene Flow 2015  [26, 25] dataset with both supervised loss and the proposed self-supervised losses. Experiments show that the proposed PointPWC-Net outperforms all the previous methods with a large margin. Even the self-supervised version is comparable with some of the previous supervised methods, such as SPLATFlowNet [34]. We also ablate each critical component of PointPWC-Net to understand their contributions.

The key contributions of our work are:

  • We present a novel model, called PointPWC-Net, that estimates scene flow from two consecutive point clouds in a coarse-to-fine fashion.

  • We propose a novel PointConv based cost volume layer that performs convolution on the cost volume without creating a dense 4-dimensional tensor.

  • We introduce self-supervised losses that can train the PointPWC-Net without any ground truth label.

  • We achieve state-of-the-art performance on FlyingThing3D and KITTI Scene Flow 2015, surpassing previous methods by .

2 Related Work

Deep Learning on Point Clouds.Deep learning methods on 3D point clouds have gain more and more attentions for the past several years. Some of the latest works [30, 27, 28, 34, 38, 14, 10, 42, 20] directly take raw point clouds as input. [30, 27, 28]

use a shared multi-layer perceptron (MLP) and max pooling layer to obtain features of point clouds. SPLATNet 

[34] projects the input features of the point clouds onto a high-dimensional lattice, and then apply bilateral convolution on the high-dimensional lattice to aggregate features. Other works [32, 17, 47, 12, 46, 48] propose to learn continuous convolutional filter weights as a nonlinear function from 3D point coordinates, approximated with MLP. [12, 48] use a density estimation to compensate the non-uniform sampling, and [48] significantly improves the memory efficiency by a change of summation trick, allowing these networks to scale up.

Optical Flow Estimation.

Optical flow estimation is a core computer vision problem and has many applications. Traditionally, the top performing methods adopt the energy minimization approach  

[13] and a coarse-to-fine, warping-based method in [2, 5, 4]. Since FlowNet [8], there were many recent breakthroughs using a deep network to learn optical flow. [16] stacks several FlowNet into a larger one. [29] develops a compact spatial pyramid network. [37] integrates the widely used traditional pyramid, warping, and cost volume technique into CNNs for optical flow, and outperform all the previous methods with high efficiency. We utilized a basic structure similar to theirs in our PointPWC-Net but proposed novel layers appropriate for point clouds.

Scene Flow Estimation. 3D scene flow is first introduced by [41]. Many works [15, 24, 44] estimate scene flow using RGB data. [15] introduces a variational method to estimate scene flow from stereo sequences. [24] proposes an object-level scene flow estimation method and introduces a dataset for 3D scene flow estimation. [44] presents a piecewise rigid scene model for 3D scene flow estimation.

Recent years, there are some works [7, 40, 39] that estimate scene flow directly from point clouds using classical techniques. [7] introduces a method that formulates the scene flow estimation problem as an energy minimization problem with assumptions on local geometric constancy and regularization for motion smoothness. [40] proposes a real-time four steps method of constructing occupancy grids, filtering the background, solving an energy minimization problem, and refining with a filtering framework. [39] further improves the method in [40] by using an encoding network to learn features from an occupancy grid.

In some most recent work [46, 22, 11], researchers try to estimate scene flow from point clouds using deep learning in a end-to-end fashion. [46] uses PCNN to operate on LiDAR data to estimate LiDAR motion. [22] introduces FlowNet3D based on PointNet++ [28]. FlowNet3D uses a flow embedding layer to encode the motion of point clouds. However it only has one flow embedding layer, so requires encoding a large neighborhood in order to capture large motions. [11] presents HPLFlowNet to estimate the scene flow using Bilateral Convolutional Layers(BCL), which projects the point cloud onto a permutohedral lattice. [1] estimates scene flow by using a network that jointly predicts 3D bounding boxes and rigid motions of objects or background in the scene. Different from [1], we do not require the rigid motion assumption and segmentation level supervision to estimate scene flow.

Self-supervised Learning. There are several recent works [21, 50, 51, 19]

which jointly estimate multiple tasks, i.e. depth, optical flow, ego-motion and camera pose without supervision. They take 2D image as input, which has ambiguity when used in scene flow estimation. In this paper, we investigate self-supervised learning of scene flow from 3D point clouds with our PointPWC-Net. To the best of our knowledge, we are the first to study self-supervised learning of scene flow from 3D point clouds.

Figure 1: PointPWC-Net. (a) shows the structure of PointPWC-Net. (b) illustrates how pyramid features are used in one level. At each level, PointPWC-Net first warps the features from the fist point cloud using the upsampled scene flow. Then, the cost volume is computed using the feature from the warped first point cloud and the feature from the second. Finally, the scene flow predictor takes the features from the first point cloud, the cost volume, and the upsampled flow as input, and predicts the finer flow at the current level

3 PointPWC-Net

As shown in Fig.1, PointPWC-Net predicts dense scene flow in a coarse-to-fine fashion. The input to PointPWC-Net is two consecutive point clouds, with points, and with points. We first construct a feature pyramid for each point cloud. Afterwards, we build a cost volume using features from both point clouds at each layer. Then, we use the feature from , the cost volume, and the upsampled flow to estimate the finer scene flow. We take the predicted scene flow as the coarse flow, upsample it to a finer flow, and warp points from onto . Note that both the upsampling and the warping layers are efficient with no learnable parameters.

Feature Pyramid from Point Cloud. To estimate scene flow with high accuracy, we need to extract strong features from the input point clouds. We generate a -level pyramid of feature representations, with the top level being the input point clouds, i.e., . For each level , we use furthest point sampling [28] to downsample the points by factor of 4 from previous level , and use PointConv [48] to do convolution on the feature from level . As a result, we can generate a feature pyramid with levels for each input point cloud. After this, we enlarge the receptive field at level of the pyramid by upsampling the feature in level and concatenate it to the feature at level .

Cost Volume Layer. The cost volume is one of the key components of optical flow. Most state-of-the-art algorithms, both traditional [36, 31] and modern deep learning based ones [37, 49, 6], use the cost volume to estimate optical flow. However, computing cost volumes on point clouds is still an open-problem. [22] proposes a flow embedding layer to aggregate feature similarities and spatial relationships to encode point motions. However, the motion information between points can be lost due to the max pooling operation in the flow embedding layering. [11] introduces a CorrBCL layer to compute the correlation between two point clouds, which requires to transfer two point clouds onto the same permutohedral lattice. We present a cost volume layer that takes the motions of the points into account and can directly apply onto the feature of two point clouds.

Figure 2: Cost Volume. Our cost volume aggregates the matching cost in a patch-to-patch manner. We first aggregate the cost from the patch in point cloud . Then, we aggregate the cost from patch in the warped point cloud .

Suppose is the feature for and the feature for , the matching cost between a point and a point can be defined as Eq.(1):

(1)

where is a function with learnable parameters.

This cost has been computed in literature either using concatenation or element-wise product [37] or even simple arithmetic difference [6] between and . Due to the flexibility of the point cloud, we also add a direction vector to the function , and the function is a concatenation of its inputs followed by MLP.

Once we have the matching costs, they can be aggregated to predict the movement between two point clouds. [22] uses max-pooing to aggregate features in the second point cloud. [11] uses CorrBCL to aggregate features on a permutohedral lattice. However, their method only aggregate costs in a point-to-point manner. In this work, we propose to aggregate costs in a patch-to-patch manner as in cost volumes on 2D images [18, 37]. For a point in , we first find a neighborhood around in . For each point , we find neighborhood around in . The cost volume for is defined as Eq.(2):

(2)
(3)
(4)

Where and are the convolutional weights that are used to aggregate the costs from the patches in and . It is learned as a continuous function of the directional vectors and , respectively with an MLP, similar to PointConv [48] and PCNN [46]. The output of the cost volume layer is a tensor with shape , where is the number of points in , and is the dimension of the cost volume, which encodes all the motion information for each point. The patch-to-patch idea used in the cost volume is illustrated in Fig. 2.

The main difference between this cost volume and conventional cost volumes in stereo and optical flow is that this cost volume is discretized irregularly on the two input point clouds and their costs are aggregated with point-based convolution. Previously, in order to compute the cost volume for optical flow in a area on a 2D image, all the values in a tensor needs to be populated, which is already slow to compute in 2D, but would be prohibitively costly in the 3D space. Our cost volume discretizes on input points and avoids this costly operation, while essentially creating the same capabilities to perform convolutions on the cost volume. We anticipate this to be widely useful beyond scene flow estimation. Table  2 shows that it is better than FlowNet3D’s MLP+Maxpool strategy.

Upsampling Layer.

The upsampling layer can propagate the scene flow estimated from a coarse layer to a finer layer. We use a distance based interpolation to upsample the coarse flow. Let

be the point cloud at level , be the estimated scene flow at level , and be the point cloud at level . For each point in the finer level point cloud , we can find its K nearest neighbors in its coarser level point cloud . The interpolated scene flow of finer level can be computed using inverse distance weighted interpolation in Eq.(5).

(5)

where , , and . is a distance metric. We use Euclidean distance in this work.

Warping Layer. Warping would “apply” the computed flow so that afterwards only the residual flow needs to be estimated, hence the search radius can be smaller when constructing the cost volume. In our network, we first up-sample the scene flow from the previous coarser level and then warp it before computing the cost volume. Denote the upsampled scene flow as , and the warped point cloud as . The warping layer is simply an element-wise addition as in Eq.(6). As shown in Fig. 3, the scene flow of current level can be computed using the upsampled flow and the estimated residual flow of current level.

(6)

A similar warping operation is used for visualization to compare the estimated flow with the ground truth in  [22, 11], but not used in coarse-to-fine estimation. [11] uses an offset strategy to reduce search radius which is specific to the permutohedral lattice.

Figure 3: Warping. We warp the first point cloud according to the upsampled flow. After warping, we can construct the cost volume in a smaller region. The final flow can be computed by summing the upsampled flow and the residual flow in current level.

Scene Flow Predictor. In order to obtain a flow estimate at each level, a convolutional scene flow predictor is built as multiple layers of PointConv and MLP. The input of the flow predictor are the cost volume, the feature of the first point cloud, the up-sampled flow from previous layer and the up-sampled feature of the second last layer from previous level’s scene flow predictor, which we call the predictor feature. The intuition of adding predictor feature from coarse level is that predictor feature encodes all the information needed to predict scene flow at coarse level. By adding that, we might be able to correct a prediction with large error and improve robustness. The output is the scene flow of the first point cloud . The first several PointConv layers are used to merge the feature locally, and the following MLP is used to estimate the scene flow on each point. We keep the flow predictor structure at different levels the same, but the parameters are not shared.

4 Training Loss Functions

4.1 Supervised Loss

We adopt the multi-scale loss function in FlowNet 

[8] and PWC-Net [37] as a supervised learning loss to demonstrate the effectiveness of the network structure and the design choice. Let be the ground truth flow at th level. The multi-scale training loss can be written as Eq.(7).

(7)

where computes the -norm, is the weight for each pyramid level , is the regularization parameter, and is the set of all the learnable parameters in our PointPWC-Net, including the feature extractor, cost volume layer and scene flow predictor at different pyramid levels. Note that the flow loss is not squared as in [37] for robustness.

4.2 Self-supervised Loss

Given the fact that obtaining the ground truth scene flow for 3D point clouds is hard and there are not many publicly available datasets for point clouds scene flow learning, it would be interesting to investigate the self-supervised learning approach for scene flow from 3D point clouds. In this section, we propose a self-supervised learning objective function to learn the scene flow in 3D point clouds without supervision. Our loss function contains three parts: Chamfer distance, Smoothness constraint, and Laplacian regularization [45, 33]. To the best of our knowledge, we are the first to study the self-supervised learning of scene flow estimation from 3D point clouds.

Chamfer Distance. The goal of using Chamfer loss is to estimate a scene flow by moving the first point cloud as close as the second one. Let be the scene flow predicted at level . Let be the point cloud that warped from the first point cloud according to in level , be the second point cloud at level . Let and be the point in and . The Chamfer loss can be written as Eq.(8).

(8)
(9)

Smoothness Constraint. In order to enforce local spatial smoothness, we add a smoothness constraint , which assumes that the predicted scene flow in a local region of should be similar to the scene flow at :

(10)

Where is the number of points in the local region .

Laplacian Regularization. Because the points in a point cloud are only on the surface of a object, their Laplacian coordinate vector approximates the local shape characteristics of the surface, including the normal direction and the mean curvature [33]. The Laplacian coordinate vector can be computed as Eq.(11).

(11)

For scene flow, the warped point cloud should have the same Laplacian coordinate vector with the second point cloud at the same position. To enforce the Laplacian regularization, we firstly compute the Laplacian coordinate for each point in warped point cloud and in second point cloud . Then, we interpolate the Laplacian coordinate of to obtain the Laplacian coordinate on each point , because the points in point cloud and do not correspond to each other. We use an inverse distance-based interpolation method as Eq.(5) by replacing the with the Laplacian coordinate . Let be the Laplacian coordinate of point at level , be the interpolated Laplacian coordinate from at the same position as . The Laplacian regularization is defined as Eq.(12).

(12)

The overall loss is a weighted sum of all losses across all the pyramid levels as in Eq.(13).

(13)

where is the factor for pyramid level , is the regularization parameter, are the scale factors for each loss respectively. With the self-supervised loss, our model is able to learn good scene flow from 3D point clouds pairs without any ground truth supervision.

5 Experiments

In this section, we train and evaluate our PointPWC-Net on the FlyingThings3D dataset [23] both using the supervised learning loss and the self-supervised learning loss. Then, we test the generalization ability of our model by directly testing the model on the real-world KITTI Scene Flow dataset [26, 25] without any fine-tuning. We also compare the time efficiency of our model with previous work. Finally, we conduct some ablation study to analysis the contribution of each part of the model and the loss function.

Implementation Details. We build a 4-level feature pyramid from the input point cloud. The weights are set to be , , , and , and is set to be for both supervised learning and self-supervised learning. The scale factor in self-supervised learning are set to be , , and . We train our model starting from a learning rate of

and reducing by half at every 80 epochs. All the hyperparameters are set using the validation set of FlyingThings3D.

Evaluation Metrics.

For fair comparison, we adopt the evaluation metrics that are used in

[11]. Let denote the predicted scene flow, and be the ground truth scene flow. The evaluate metrics are computed as follows:

EPE3D(m): the main metric, averaged over each point.

Acc3DS: the percentage of points whose EPE3D or relative error .

Acc3DR: the percentage of points whose EPE3D or relative error .

Outliers3D: the percentage of points whose EPE3D or relative error .

EPE2D(px): 2D end point error, which is obtained by projecting point clouds back to the image.

Acc2D: the percentage of points whose EPE2D or relative error .

Dataset Method Sup. EPE3D Acc3DS Acc3DR Outliers3D EPE2D Acc2D
Flyingthings3D ICP [3] 0.4062 0.1614 0.3038 0.8796 23.2280 0.2913
PointPWC-Net 0.1246 0.3068 0.6552 0.7032 6.6494 0.4516
FlowNet3D [22] 0.1136 0.4125 0.7706 0.6016 5.9740 0.5692
SPLATFlowNet [34] 0.1205 0.4197 0.7180 0.6187 6.9759 0.5512
original BCL [11] 0.1111 0.4279 0.7551 0.6054 6.3027 0.5669
HPLFlowNet [11] 0.0804 0.6144 0.8555 0.4287 4.6723 0.6764
PointPWC-Net 0.0588 0.7379 0.9276 0.3424 3.2390 0.7994
KITTI ICP [3] 0.5181 0.0669 0.1667 0.8712 27.6752 0.1056
PointPWC-Net 0.2549 0.2379 0.4957 0.6863 8.9439 0.3299
FlowNet3D [22] 0.1767 0.3738 0.6677 0.5271 7.2141 0.5093
SPLATFlowNet [34] 0.1988 0.2174 0.5391 0.6575 8.2306 0.4189
original BCL [11] 0.1729 0.2516 0.6011 0.6215 7.3476 0.4411
HPLFlowNet [11] 0.1169 0.4783 0.7776 0.4103 4.8055 0.5938
PointPWC-Net 0.0694 0.7281 0.8884 0.2648 3.0062 0.7673
Table 1: Evaluation results on FlyingThings3D and KITTI dataset. means self-supervised, means fully-supervised. All methods are trained on only FlyingThings3D, on KITTI Self and Full only refer to the respective model trained on FlyingThings3D that is evaluated on KITTI. Our model outperforms all baselines by a large margin. On EPE2D, our method is the only one under on FlyingThings3D dataset. Particularly on KITTI the better Outliers3D results demonstrate the strong generalization ability of PointPWC-Net.

5.1 Supervised Learning on FlyingThings3D

In order to demonstrate the effectiveness of PointPWC-Net, we conduct experiments using our supervised loss. To our knowledge, there is no publicly available large-scale real-world dataset that enable the scene flow estimation from point clouds (The input to the KITTI scene flow benchmark is 2D), thus we train our PointPWC-Net on the synthetic Flyingthings3D dataset. Following the preprocessing procedure in [11], we reconstruct the 3D point clouds and the ground truth scene flow to construct the training dataset and the evaluation dataset. The training dataset includes 19,640 pairs of point clouds, and the evaluation dataset includes 3,824 pairs of point clouds. Our model takes points in each point cloud. We first train the model with of the training set(4,910 pairs), and then fine-tune it on the whole training set, to speed up training.

Table 1 shows the quantitative evaluation results on the Flyingthings3D dataset. We compare our method with FlowNet3D [22], SPLATFlowNet [34], original BCL [11], and HPLFlowNet [11]. Our method outperforms all the methods on all metrics by a large margin. Compared to FlowNet3D, our cost volume layer is able to capture the motion information better. Comparing to SPLATFlowNet, original BCL, and HPLFlowNet, our method avoids the preprocessing step of building a permutohedral lattice from the input. Besides, our method outperforms HPLFlowNet on EPE3D by . And, we are the only method with EPE2D under 4px, which improves over HPLFlowNet by . The first row in Fig. 4 illustrates some scene flow results with supervised learning.

Figure 4: Scene flow estimation on FlyingThings3D dataset. The blue points are from the first point cloud . The green points are the warped points according to the correctly predicted flow. The “correctness” is measured by Acc3DR. The red points are wrongly predicted. The first row shows results with supervised learning. The second row shows results with self-supervised learning.
Figure 5: Scene flow estimation on KITTI Scene Flow 2015 dataset. The blue points are from the first point cloud . The green points are the warped points according to the correctly predicted flow. The “correctness” is measured by Acc3DR. The red points are wrongly predicted. The first row shows results with supervised learning. The second row shows results with self-supervised learning.

5.2 Self-supervised Learning on FlyingThings3D

Acquiring or annotating dense scene flow from real-world 3D point clouds is very expensive, so it would be interesting to evaluate the performance of our self-supervised approach. We train our model using the same procedure as in supervised learning, i.e. first train the model with one quarter of the training dataset, then fine-tune with the whole training set. Table 1 gives the quantitative results on PointPWC-Net with self-supervised learning. We compare our method with ICP [3]. Because the iterative closest point(ICP) method iteratively revises the rigid transformation to minimize a error metric, we can view it as a self-supervised method albeit with a rigidity constraint. We can see that our PointPWC-Net outperforms the ICP on all the metrics with a large margin. Note as well our method without supervision achieves on EPE3D, which is comparable with SPLATFlowNet () trained with supervision. The second row of Fig. 4 illustrates some scene flow prediction with self-supervised learning.

5.3 Generalization on KITTI Scene Flow 2015

To study the generalization ability of our PointPWC-Net, we directly take the model trained using FlyingThings3D and evaluate it on KITTI Scene Flow 2015 [25, 26] without any fine-tuning. KITTI Scene Flow 2015 consists of 200 training scenes and 200 test scenes. To evaluate our PointPWC-Net, we use ground truth labels and trace raw point clouds associated with the frames, following [22, 11]. Since no point clouds and ground truth are provided on test set, we evaluate on all 142 scenes in the training set with available point clouds. We remove the ground in the point clouds by height () as in [11] for fair comparison with previous methods.

From Table 1, our PointPWC-Net outperforms all the state-of-the-art methods with supervised learning, which demonstrates the generalization ability of our model. For EPE3D, our model is the only one below , which improves over HPLFlowNet by . For Acc3DS, our method outperforms both FlowNet3D and HPLFlowNet by and respectively. For self-supervised learning, our method gives good generalization ability by reducing EPE3D from to compared with ICP. One interesting result is that our PointPWC-Net trained without ground truth is able to outperform SPLATFlowNet trained with ground truth flow on Acc3DS. Fig. 5 illustrates some scene flow prediction of KITTI dataset with the first row the supervised learning results and the second row the self-supervised learning results.

5.4 Ablation Study

In this section, we conduct some ablation studies on the model design choice and the loss function. On model design, we evaluate the different choices of cost volume layer and removing the warping layer. On the loss function, we investigate removing the smoothness constraint and Laplacian regularization in the self-supervised learning loss. All models in the ablation studies are trained using FlyingThings3D, and the EPE3D is reported on FlyingThings3D evaluation dataset.

Component Status EPE3D
Cost Volume MLP+Maxpool [22] 0.0741
Ours 0.0588
Warping Layer w/o 0.0984
w 0.0588
Table 2: Model design. Using our cost volume instead of the MLP+Maxpool used in FlowNet3D’s flow embedding layer improves performance by . Compared to no warping, the warping layer improves the performance by .
Chamfer Smoothness Laplacian EPE3D
- - 0.2112
- 0.1304
0.1246
Table 3: Loss functions. For self-supervised learning, Chamfer loss is not enough to estimate a good scene flow. With the help of the smoothness constraint, the scene flow result improves by . Laplacian regularization also improves slightly.
Method Runtime(ms)
FlowNet3D [22] 130.8
HPLFlowNet [11] 98.4
PointPWC-Net 117.4
Table 4: Time efficiency. Average runtime(ms) on Flyingthings3D. The runtime for FlowNet3D and HPLFlowNet is reported from [11] on a single Titan V. The runtime for our PointPWC-Net is reported on a single 1080Ti.
Component Feature Cost Upsample Scene Flow
Pyramid Volume Warping Predictor
Runtime(ms) 43.2 22.7 24.5 27.0
Table 5: Runtime breakdown. The runtime of each critical component in PointPWC-Net.

Tables 23 shows the results of the ablation studies. In Table 2 we can see that our design of the cost volume obtains significantly better EPE3D than using the flow embedding in FlowNet3D [22] and the warping layer is crucial for performance. In Table 3, we see that both the smoothness constraint and Laplacian regularization improve the performance in self-supervised learning. In Table 4, we report the runtime of our PointPWC-Net, which is comparable with other deep learning based methods. Table 5 includes the runtime of each critical component in PointPWC-Net.

With proposed self-supervised loss, we are able to finetune the FlyingThings3D trained models on KITTI without using the knowledge of ground truth label. Table 6 shows that both of our model, pretrained with supervised and self-supervised loss, give much better EPE3D after finetuning with self-supervised loss on KITTI.

Sup. Finetune with self-supervision EPE3D
- 0.0694
0.0415
- 0.2549
0.0457
Table 6: Finetune with self-supervised loss on KITTI. With proposed self-supervised loss, we are able to finetune the FlyingThings3D trained models on KITTI without using the knowledge of ground truth label. means the model is pretrained using fully-supervised loss on FlyingThings3D. means the model is pretrained using self-supervised loss on FlyingThings3D. Both models give significantly better results with fintuning on KITTI using self-supervised loss.

6 Conclusion

We proposed a novel network structure, called PointPWC-Net, which predict scene flow directly from 3D point clouds in a coarse-to-fine fashion. A crucial layer is the cost volume layer based on PointConv, which improves over prior work on feature matching. Because of the fact that real-world ground truth scene flow is hard to acquire, we introduce a loss function that train the PointPWC-Net without supervision. Experiments on the FlyingThings3D and KITTI datasets demonstrates the effectiveness of our PointPWC-Net and the self-supervised loss function.

References

  • [1] Aseem Behl, Despoina Paschalidou, Simon Donné, and Andreas Geiger. Pointflownet: Learning representations for rigid motion estimation from point clouds. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7962–7971, 2019.
  • [2] James R Bergen, Patrick Anandan, Keith J Hanna, and Rajesh Hingorani. Hierarchical model-based motion estimation. In European conference on computer vision, pages 237–252. Springer, 1992.
  • [3] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, volume 1611, pages 586–606. International Society for Optics and Photonics, 1992.
  • [4] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision, pages 25–36. Springer, 2004.
  • [5] Andrés Bruhn, Joachim Weickert, and Christoph Schnörr. Lucas/kanade meets horn/schunck: Combining local and global optic flow methods. International journal of computer vision, 61(3):211–231, 2005.
  • [6] Rohan Chabra, Julian Straub, Christopher Sweeney, Richard Newcombe, and Henry Fuchs. Stereodrnet: Dilated residual stereonet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11786–11795, 2019.
  • [7] Ayush Dewan, Tim Caselitz, Gian Diego Tipaldi, and Wolfram Burgard. Rigid scene flow for 3d lidar scans. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1765–1770. IEEE, 2016.
  • [8] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
  • [9] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017.
  • [10] Fabian Groh, Patrick Wieschollek, and Hendrik PA Lensch. Flex-convolution. In Asian Conference on Computer Vision, pages 105–122. Springer, 2018.
  • [11] Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, and Panqu Wang. Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3254–3263, 2019.
  • [12] Pedro Hermosilla, Tobias Ritschel, Pere-Pau Vázquez, Àlvar Vinacua, and Timo Ropinski. Monte carlo convolution for learning on non-uniformly sampled point clouds. In SIGGRAPH Asia 2018 Technical Papers, page 235. ACM, 2018.
  • [13] Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
  • [14] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung.

    Pointwise convolutional neural networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 984–993, 2018.
  • [15] Frédéric Huguet and Frédéric Devernay. A variational method for scene flow estimation from stereo sequences. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–7. IEEE, 2007.
  • [16] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
  • [17] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, pages 667–675, 2016.
  • [18] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 66–75, 2017.
  • [19] Minhaeng Lee and Charless C Fowlkes. Cemnet: Self-supervised learning for accurate continuous ego-motion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [20] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
  • [21] Liang Liu, Guangyao Zhai, Wenlong Ye, and Yong Liu. Unsupervised learning of scene flow estimation fusing with local rigidity. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 876–882. AAAI Press, 2019.
  • [22] Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 529–537, 2019.
  • [23] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
  • [24] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3061–3070, 2015.
  • [25] Moritz Menze, Christian Heipke, and Andreas Geiger. Joint 3d estimation of vehicles and scene flow. In ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
  • [26] Moritz Menze, Christian Heipke, and Andreas Geiger. Object scene flow. ISPRS Journal of Photogrammetry and Remote Sensing (JPRS), 2018.
  • [27] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
  • [28] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
  • [29] Anurag Ranjan and Michael J Black. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4161–4170, 2017.
  • [30] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500, 2016.
  • [31] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1164–1172, 2015.
  • [32] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3693–3702, 2017.
  • [33] Olga Sorkine. Laplacian mesh processing. In Eurographics (STARs), pages 53–70, 2005.
  • [34] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2530–2539, 2018.
  • [35] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical flow estimation and their principles. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 2432–2439. IEEE, 2010.
  • [36] Deqing Sun, Stefan Roth, and Michael J Black. A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision, 106(2):115–137, 2014.
  • [37] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
  • [38] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3887–3896, 2018.
  • [39] Arash K Ushani and Ryan M Eustice. Feature learning for scene flow estimation from lidar. In Conference on Robot Learning, pages 283–292, 2018.
  • [40] Arash K Ushani, Ryan W Wolcott, Jeffrey M Walls, and Ryan M Eustice. A learning approach for real-time temporal scene flow estimation from lidar data. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5666–5673. IEEE, 2017.
  • [41] Sundar Vedula, Simon Baker, Peter Rander, Robert Collins, and Takeo Kanade. Three-dimensional scene flow. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 722–729. IEEE, 1999.
  • [42] Nitika Verma, Edmond Boyer, and Jakob Verbeek. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2598–2606, 2018.
  • [43] Christoph Vogel, Konrad Schindler, and Stefan Roth. Piecewise rigid scene flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 1377–1384, 2013.
  • [44] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3d scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision, 115(1):1–28, 2015.
  • [45] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–67, 2018.
  • [46] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2589–2597, 2018.
  • [47] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):146, 2019.
  • [48] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2019.
  • [49] Jia Xu, René Ranftl, and Vladlen Koltun. Accurate optical flow via direct cost volume processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1289–1297, 2017.
  • [50] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1983–1992, 2018.
  • [51] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), pages 36–53, 2018.