1 Introduction
Scene flow is the 3D displacement vector between each surface point in two consecutive frames. As a fundamental lowlevel understanding of the world, scene flow can be used in various applications, such as motion segmentation, action recognition, and autonomous driving, etc. Traditionally, scene flow was estimated from RGB data
[24, 23, 41, 43]. Recently, due to the increasing application of 3D sensors such as LiDAR, there is interest on directly estimating scene flow from 3D point clouds.Fueled by recent advances in 3D deep networks that learn effective feature representations directly from point cloud data, recent works adopt ideas from 2D deep optical flow networks to 3D to estimate scene flow from a point cloud. FlowNet3D [22] operates directly on points with PointNet++ [28], and proposes a flow embedding which is computed in one layer to capture the correlation between two point clouds, and then propagates it through finer layers to estimate the scene flow. HPLFlowNet [11] computes the correlation jointly from multiple scales utilizing the upsampling operation in bilateral convolutional layers.
In this work, we explore another classic optical flow idea, coarsetofine estimation [2, 5, 35, 37], in 3D point clouds. Coarsetofine flow estimation allows us to accommodate large motion at a coarse level without a prohibitive search space, and then the coarse level flow is upsampled and warped to the next finer layer, where only the residual flow is estimated. The process continues until the finest layer. Because only a small neighborhood needs to be searched upon at each level, it is computationally efficient too. We also propose efficient upsampling and warping layers to implement the above process in point clouds.
An important piece in stateoftheart deep optical flow estimation networks is the cost volume [18, 49, 37]
, a 3D tensor that contains matching information between neighboring pixel pairs from consecutive frames. In this paper, we propose a novel pointbased cost volume where we discretize the cost volume to nearby input point pairs, avoiding the creation of a dense 4D tensor if we naively extend from image to point cloud. Then we apply the efficient PointConv layer
[48] on this irregularly discretized cost volume. We experimentally show that it outperforms previous approaches for associating point cloud correspondences.As in optical flow, it is difficult and expensive to acquire accurate scene flow labels. Hence, beyond supervised scene flow estimation, we also explore selfsupervised scene flow which does not require human annotations. To our knowledge, our work is the first to explore selfsupervised scene flow estimation from point cloud data. We propose new selfsupervised loss terms: Chamfer distance [9], smoothness constraint and Laplacian regularization. The Chamfer distance enforces that two point clouds be close to each other. The smoothness constraint enforces the local scene flow to be similar to each other. The Laplacian regularization encourages the warped point cloud to have similar shape as the second point cloud.
We conduct extensive experiments on FlyingThings3D [23] and KITTI Scene Flow 2015 [26, 25] dataset with both supervised loss and the proposed selfsupervised losses. Experiments show that the proposed PointPWCNet outperforms all the previous methods with a large margin. Even the selfsupervised version is comparable with some of the previous supervised methods, such as SPLATFlowNet [34]. We also ablate each critical component of PointPWCNet to understand their contributions.
The key contributions of our work are:

We present a novel model, called PointPWCNet, that estimates scene flow from two consecutive point clouds in a coarsetofine fashion.

We propose a novel PointConv based cost volume layer that performs convolution on the cost volume without creating a dense 4dimensional tensor.

We introduce selfsupervised losses that can train the PointPWCNet without any ground truth label.

We achieve stateoftheart performance on FlyingThing3D and KITTI Scene Flow 2015, surpassing previous methods by .
2 Related Work
Deep Learning on Point Clouds.Deep learning methods on 3D point clouds have gain more and more attentions for the past several years. Some of the latest works [30, 27, 28, 34, 38, 14, 10, 42, 20] directly take raw point clouds as input. [30, 27, 28]
use a shared multilayer perceptron (MLP) and max pooling layer to obtain features of point clouds. SPLATNet
[34] projects the input features of the point clouds onto a highdimensional lattice, and then apply bilateral convolution on the highdimensional lattice to aggregate features. Other works [32, 17, 47, 12, 46, 48] propose to learn continuous convolutional filter weights as a nonlinear function from 3D point coordinates, approximated with MLP. [12, 48] use a density estimation to compensate the nonuniform sampling, and [48] significantly improves the memory efficiency by a change of summation trick, allowing these networks to scale up.Optical Flow Estimation.
Optical flow estimation is a core computer vision problem and has many applications. Traditionally, the top performing methods adopt the energy minimization approach
[13] and a coarsetofine, warpingbased method in [2, 5, 4]. Since FlowNet [8], there were many recent breakthroughs using a deep network to learn optical flow. [16] stacks several FlowNet into a larger one. [29] develops a compact spatial pyramid network. [37] integrates the widely used traditional pyramid, warping, and cost volume technique into CNNs for optical flow, and outperform all the previous methods with high efficiency. We utilized a basic structure similar to theirs in our PointPWCNet but proposed novel layers appropriate for point clouds.Scene Flow Estimation. 3D scene flow is first introduced by [41]. Many works [15, 24, 44] estimate scene flow using RGB data. [15] introduces a variational method to estimate scene flow from stereo sequences. [24] proposes an objectlevel scene flow estimation method and introduces a dataset for 3D scene flow estimation. [44] presents a piecewise rigid scene model for 3D scene flow estimation.
Recent years, there are some works [7, 40, 39] that estimate scene flow directly from point clouds using classical techniques. [7] introduces a method that formulates the scene flow estimation problem as an energy minimization problem with assumptions on local geometric constancy and regularization for motion smoothness. [40] proposes a realtime four steps method of constructing occupancy grids, filtering the background, solving an energy minimization problem, and refining with a filtering framework. [39] further improves the method in [40] by using an encoding network to learn features from an occupancy grid.
In some most recent work [46, 22, 11], researchers try to estimate scene flow from point clouds using deep learning in a endtoend fashion. [46] uses PCNN to operate on LiDAR data to estimate LiDAR motion. [22] introduces FlowNet3D based on PointNet++ [28]. FlowNet3D uses a flow embedding layer to encode the motion of point clouds. However it only has one flow embedding layer, so requires encoding a large neighborhood in order to capture large motions. [11] presents HPLFlowNet to estimate the scene flow using Bilateral Convolutional Layers(BCL), which projects the point cloud onto a permutohedral lattice. [1] estimates scene flow by using a network that jointly predicts 3D bounding boxes and rigid motions of objects or background in the scene. Different from [1], we do not require the rigid motion assumption and segmentation level supervision to estimate scene flow.
Selfsupervised Learning. There are several recent works [21, 50, 51, 19]
which jointly estimate multiple tasks, i.e. depth, optical flow, egomotion and camera pose without supervision. They take 2D image as input, which has ambiguity when used in scene flow estimation. In this paper, we investigate selfsupervised learning of scene flow from 3D point clouds with our PointPWCNet. To the best of our knowledge, we are the first to study selfsupervised learning of scene flow from 3D point clouds.
3 PointPWCNet
As shown in Fig.1, PointPWCNet predicts dense scene flow in a coarsetofine fashion. The input to PointPWCNet is two consecutive point clouds, with points, and with points. We first construct a feature pyramid for each point cloud. Afterwards, we build a cost volume using features from both point clouds at each layer. Then, we use the feature from , the cost volume, and the upsampled flow to estimate the finer scene flow. We take the predicted scene flow as the coarse flow, upsample it to a finer flow, and warp points from onto . Note that both the upsampling and the warping layers are efficient with no learnable parameters.
Feature Pyramid from Point Cloud. To estimate scene flow with high accuracy, we need to extract strong features from the input point clouds. We generate a level pyramid of feature representations, with the top level being the input point clouds, i.e., . For each level , we use furthest point sampling [28] to downsample the points by factor of 4 from previous level , and use PointConv [48] to do convolution on the feature from level . As a result, we can generate a feature pyramid with levels for each input point cloud. After this, we enlarge the receptive field at level of the pyramid by upsampling the feature in level and concatenate it to the feature at level .
Cost Volume Layer. The cost volume is one of the key components of optical flow. Most stateoftheart algorithms, both traditional [36, 31] and modern deep learning based ones [37, 49, 6], use the cost volume to estimate optical flow. However, computing cost volumes on point clouds is still an openproblem. [22] proposes a flow embedding layer to aggregate feature similarities and spatial relationships to encode point motions. However, the motion information between points can be lost due to the max pooling operation in the flow embedding layering. [11] introduces a CorrBCL layer to compute the correlation between two point clouds, which requires to transfer two point clouds onto the same permutohedral lattice. We present a cost volume layer that takes the motions of the points into account and can directly apply onto the feature of two point clouds.
Suppose is the feature for and the feature for , the matching cost between a point and a point can be defined as Eq.(1):
(1) 
where is a function with learnable parameters.
This cost has been computed in literature either using concatenation or elementwise product [37] or even simple arithmetic difference [6] between and . Due to the flexibility of the point cloud, we also add a direction vector to the function , and the function is a concatenation of its inputs followed by MLP.
Once we have the matching costs, they can be aggregated to predict the movement between two point clouds. [22] uses maxpooing to aggregate features in the second point cloud. [11] uses CorrBCL to aggregate features on a permutohedral lattice. However, their method only aggregate costs in a pointtopoint manner. In this work, we propose to aggregate costs in a patchtopatch manner as in cost volumes on 2D images [18, 37]. For a point in , we first find a neighborhood around in . For each point , we find neighborhood around in . The cost volume for is defined as Eq.(2):
(2)  
(3)  
(4) 
Where and are the convolutional weights that are used to aggregate the costs from the patches in and . It is learned as a continuous function of the directional vectors and , respectively with an MLP, similar to PointConv [48] and PCNN [46]. The output of the cost volume layer is a tensor with shape , where is the number of points in , and is the dimension of the cost volume, which encodes all the motion information for each point. The patchtopatch idea used in the cost volume is illustrated in Fig. 2.
The main difference between this cost volume and conventional cost volumes in stereo and optical flow is that this cost volume is discretized irregularly on the two input point clouds and their costs are aggregated with pointbased convolution. Previously, in order to compute the cost volume for optical flow in a area on a 2D image, all the values in a tensor needs to be populated, which is already slow to compute in 2D, but would be prohibitively costly in the 3D space. Our cost volume discretizes on input points and avoids this costly operation, while essentially creating the same capabilities to perform convolutions on the cost volume. We anticipate this to be widely useful beyond scene flow estimation. Table 2 shows that it is better than FlowNet3D’s MLP+Maxpool strategy.
Upsampling Layer.
The upsampling layer can propagate the scene flow estimated from a coarse layer to a finer layer. We use a distance based interpolation to upsample the coarse flow. Let
be the point cloud at level , be the estimated scene flow at level , and be the point cloud at level . For each point in the finer level point cloud , we can find its K nearest neighbors in its coarser level point cloud . The interpolated scene flow of finer level can be computed using inverse distance weighted interpolation in Eq.(5).(5) 
where , , and . is a distance metric. We use Euclidean distance in this work.
Warping Layer. Warping would “apply” the computed flow so that afterwards only the residual flow needs to be estimated, hence the search radius can be smaller when constructing the cost volume. In our network, we first upsample the scene flow from the previous coarser level and then warp it before computing the cost volume. Denote the upsampled scene flow as , and the warped point cloud as . The warping layer is simply an elementwise addition as in Eq.(6). As shown in Fig. 3, the scene flow of current level can be computed using the upsampled flow and the estimated residual flow of current level.
(6) 
A similar warping operation is used for visualization to compare the estimated flow with the ground truth in [22, 11], but not used in coarsetofine estimation. [11] uses an offset strategy to reduce search radius which is specific to the permutohedral lattice.
Scene Flow Predictor. In order to obtain a flow estimate at each level, a convolutional scene flow predictor is built as multiple layers of PointConv and MLP. The input of the flow predictor are the cost volume, the feature of the first point cloud, the upsampled flow from previous layer and the upsampled feature of the second last layer from previous level’s scene flow predictor, which we call the predictor feature. The intuition of adding predictor feature from coarse level is that predictor feature encodes all the information needed to predict scene flow at coarse level. By adding that, we might be able to correct a prediction with large error and improve robustness. The output is the scene flow of the first point cloud . The first several PointConv layers are used to merge the feature locally, and the following MLP is used to estimate the scene flow on each point. We keep the flow predictor structure at different levels the same, but the parameters are not shared.
4 Training Loss Functions
4.1 Supervised Loss
We adopt the multiscale loss function in FlowNet
[8] and PWCNet [37] as a supervised learning loss to demonstrate the effectiveness of the network structure and the design choice. Let be the ground truth flow at th level. The multiscale training loss can be written as Eq.(7).(7) 
where computes the norm, is the weight for each pyramid level , is the regularization parameter, and is the set of all the learnable parameters in our PointPWCNet, including the feature extractor, cost volume layer and scene flow predictor at different pyramid levels. Note that the flow loss is not squared as in [37] for robustness.
4.2 Selfsupervised Loss
Given the fact that obtaining the ground truth scene flow for 3D point clouds is hard and there are not many publicly available datasets for point clouds scene flow learning, it would be interesting to investigate the selfsupervised learning approach for scene flow from 3D point clouds. In this section, we propose a selfsupervised learning objective function to learn the scene flow in 3D point clouds without supervision. Our loss function contains three parts: Chamfer distance, Smoothness constraint, and Laplacian regularization [45, 33]. To the best of our knowledge, we are the first to study the selfsupervised learning of scene flow estimation from 3D point clouds.
Chamfer Distance. The goal of using Chamfer loss is to estimate a scene flow by moving the first point cloud as close as the second one. Let be the scene flow predicted at level . Let be the point cloud that warped from the first point cloud according to in level , be the second point cloud at level . Let and be the point in and . The Chamfer loss can be written as Eq.(8).
(8)  
(9) 
Smoothness Constraint. In order to enforce local spatial smoothness, we add a smoothness constraint , which assumes that the predicted scene flow in a local region of should be similar to the scene flow at :
(10) 
Where is the number of points in the local region .
Laplacian Regularization. Because the points in a point cloud are only on the surface of a object, their Laplacian coordinate vector approximates the local shape characteristics of the surface, including the normal direction and the mean curvature [33]. The Laplacian coordinate vector can be computed as Eq.(11).
(11) 
For scene flow, the warped point cloud should have the same Laplacian coordinate vector with the second point cloud at the same position. To enforce the Laplacian regularization, we firstly compute the Laplacian coordinate for each point in warped point cloud and in second point cloud . Then, we interpolate the Laplacian coordinate of to obtain the Laplacian coordinate on each point , because the points in point cloud and do not correspond to each other. We use an inverse distancebased interpolation method as Eq.(5) by replacing the with the Laplacian coordinate . Let be the Laplacian coordinate of point at level , be the interpolated Laplacian coordinate from at the same position as . The Laplacian regularization is defined as Eq.(12).
(12) 
The overall loss is a weighted sum of all losses across all the pyramid levels as in Eq.(13).
(13) 
where is the factor for pyramid level , is the regularization parameter, are the scale factors for each loss respectively. With the selfsupervised loss, our model is able to learn good scene flow from 3D point clouds pairs without any ground truth supervision.
5 Experiments
In this section, we train and evaluate our PointPWCNet on the FlyingThings3D dataset [23] both using the supervised learning loss and the selfsupervised learning loss. Then, we test the generalization ability of our model by directly testing the model on the realworld KITTI Scene Flow dataset [26, 25] without any finetuning. We also compare the time efficiency of our model with previous work. Finally, we conduct some ablation study to analysis the contribution of each part of the model and the loss function.
Implementation Details. We build a 4level feature pyramid from the input point cloud. The weights are set to be , , , and , and is set to be for both supervised learning and selfsupervised learning. The scale factor in selfsupervised learning are set to be , , and . We train our model starting from a learning rate of
and reducing by half at every 80 epochs. All the hyperparameters are set using the validation set of FlyingThings3D.
Evaluation Metrics.
For fair comparison, we adopt the evaluation metrics that are used in
[11]. Let denote the predicted scene flow, and be the ground truth scene flow. The evaluate metrics are computed as follows:EPE3D(m): the main metric, averaged over each point.
Acc3DS: the percentage of points whose EPE3D or relative error .
Acc3DR: the percentage of points whose EPE3D or relative error .
Outliers3D: the percentage of points whose EPE3D or relative error .
EPE2D(px): 2D end point error, which is obtained by projecting point clouds back to the image.
Acc2D: the percentage of points whose EPE2D or relative error .
Dataset  Method  Sup.  EPE3D  Acc3DS  Acc3DR  Outliers3D  EPE2D  Acc2D 
Flyingthings3D  ICP [3]  0.4062  0.1614  0.3038  0.8796  23.2280  0.2913  
PointPWCNet  0.1246  0.3068  0.6552  0.7032  6.6494  0.4516  
FlowNet3D [22]  0.1136  0.4125  0.7706  0.6016  5.9740  0.5692  
SPLATFlowNet [34]  0.1205  0.4197  0.7180  0.6187  6.9759  0.5512  
original BCL [11]  0.1111  0.4279  0.7551  0.6054  6.3027  0.5669  
HPLFlowNet [11]  0.0804  0.6144  0.8555  0.4287  4.6723  0.6764  
PointPWCNet  0.0588  0.7379  0.9276  0.3424  3.2390  0.7994  
KITTI  ICP [3]  0.5181  0.0669  0.1667  0.8712  27.6752  0.1056  
PointPWCNet  0.2549  0.2379  0.4957  0.6863  8.9439  0.3299  
FlowNet3D [22]  0.1767  0.3738  0.6677  0.5271  7.2141  0.5093  
SPLATFlowNet [34]  0.1988  0.2174  0.5391  0.6575  8.2306  0.4189  
original BCL [11]  0.1729  0.2516  0.6011  0.6215  7.3476  0.4411  
HPLFlowNet [11]  0.1169  0.4783  0.7776  0.4103  4.8055  0.5938  
PointPWCNet  0.0694  0.7281  0.8884  0.2648  3.0062  0.7673 
5.1 Supervised Learning on FlyingThings3D
In order to demonstrate the effectiveness of PointPWCNet, we conduct experiments using our supervised loss. To our knowledge, there is no publicly available largescale realworld dataset that enable the scene flow estimation from point clouds (The input to the KITTI scene flow benchmark is 2D), thus we train our PointPWCNet on the synthetic Flyingthings3D dataset. Following the preprocessing procedure in [11], we reconstruct the 3D point clouds and the ground truth scene flow to construct the training dataset and the evaluation dataset. The training dataset includes 19,640 pairs of point clouds, and the evaluation dataset includes 3,824 pairs of point clouds. Our model takes points in each point cloud. We first train the model with of the training set(4,910 pairs), and then finetune it on the whole training set, to speed up training.
Table 1 shows the quantitative evaluation results on the Flyingthings3D dataset. We compare our method with FlowNet3D [22], SPLATFlowNet [34], original BCL [11], and HPLFlowNet [11]. Our method outperforms all the methods on all metrics by a large margin. Compared to FlowNet3D, our cost volume layer is able to capture the motion information better. Comparing to SPLATFlowNet, original BCL, and HPLFlowNet, our method avoids the preprocessing step of building a permutohedral lattice from the input. Besides, our method outperforms HPLFlowNet on EPE3D by . And, we are the only method with EPE2D under 4px, which improves over HPLFlowNet by . The first row in Fig. 4 illustrates some scene flow results with supervised learning.
5.2 Selfsupervised Learning on FlyingThings3D
Acquiring or annotating dense scene flow from realworld 3D point clouds is very expensive, so it would be interesting to evaluate the performance of our selfsupervised approach. We train our model using the same procedure as in supervised learning, i.e. first train the model with one quarter of the training dataset, then finetune with the whole training set. Table 1 gives the quantitative results on PointPWCNet with selfsupervised learning. We compare our method with ICP [3]. Because the iterative closest point(ICP) method iteratively revises the rigid transformation to minimize a error metric, we can view it as a selfsupervised method albeit with a rigidity constraint. We can see that our PointPWCNet outperforms the ICP on all the metrics with a large margin. Note as well our method without supervision achieves on EPE3D, which is comparable with SPLATFlowNet () trained with supervision. The second row of Fig. 4 illustrates some scene flow prediction with selfsupervised learning.
5.3 Generalization on KITTI Scene Flow 2015
To study the generalization ability of our PointPWCNet, we directly take the model trained using FlyingThings3D and evaluate it on KITTI Scene Flow 2015 [25, 26] without any finetuning. KITTI Scene Flow 2015 consists of 200 training scenes and 200 test scenes. To evaluate our PointPWCNet, we use ground truth labels and trace raw point clouds associated with the frames, following [22, 11]. Since no point clouds and ground truth are provided on test set, we evaluate on all 142 scenes in the training set with available point clouds. We remove the ground in the point clouds by height () as in [11] for fair comparison with previous methods.
From Table 1, our PointPWCNet outperforms all the stateoftheart methods with supervised learning, which demonstrates the generalization ability of our model. For EPE3D, our model is the only one below , which improves over HPLFlowNet by . For Acc3DS, our method outperforms both FlowNet3D and HPLFlowNet by and respectively. For selfsupervised learning, our method gives good generalization ability by reducing EPE3D from to compared with ICP. One interesting result is that our PointPWCNet trained without ground truth is able to outperform SPLATFlowNet trained with ground truth flow on Acc3DS. Fig. 5 illustrates some scene flow prediction of KITTI dataset with the first row the supervised learning results and the second row the selfsupervised learning results.
5.4 Ablation Study
In this section, we conduct some ablation studies on the model design choice and the loss function. On model design, we evaluate the different choices of cost volume layer and removing the warping layer. On the loss function, we investigate removing the smoothness constraint and Laplacian regularization in the selfsupervised learning loss. All models in the ablation studies are trained using FlyingThings3D, and the EPE3D is reported on FlyingThings3D evaluation dataset.
Component  Status  EPE3D 

Cost Volume  MLP+Maxpool [22]  0.0741 
Ours  0.0588  
Warping Layer  w/o  0.0984 
w  0.0588 
Chamfer  Smoothness  Laplacian  EPE3D 

✓      0.2112 
✓  ✓    0.1304 
✓  ✓  ✓  0.1246 
Method  Runtime(ms) 

FlowNet3D [22]  130.8 
HPLFlowNet [11]  98.4 
PointPWCNet  117.4 
Component  Feature  Cost  Upsample  Scene Flow 

Pyramid  Volume  Warping  Predictor  
Runtime(ms)  43.2  22.7  24.5  27.0 
Tables 2, 3 shows the results of the ablation studies. In Table 2 we can see that our design of the cost volume obtains significantly better EPE3D than using the flow embedding in FlowNet3D [22] and the warping layer is crucial for performance. In Table 3, we see that both the smoothness constraint and Laplacian regularization improve the performance in selfsupervised learning. In Table 4, we report the runtime of our PointPWCNet, which is comparable with other deep learning based methods. Table 5 includes the runtime of each critical component in PointPWCNet.
With proposed selfsupervised loss, we are able to finetune the FlyingThings3D trained models on KITTI without using the knowledge of ground truth label. Table 6 shows that both of our model, pretrained with supervised and selfsupervised loss, give much better EPE3D after finetuning with selfsupervised loss on KITTI.
Sup.  Finetune with selfsupervision  EPE3D 

  0.0694  
✓  0.0415  
  0.2549  
✓  0.0457 
6 Conclusion
We proposed a novel network structure, called PointPWCNet, which predict scene flow directly from 3D point clouds in a coarsetofine fashion. A crucial layer is the cost volume layer based on PointConv, which improves over prior work on feature matching. Because of the fact that realworld ground truth scene flow is hard to acquire, we introduce a loss function that train the PointPWCNet without supervision. Experiments on the FlyingThings3D and KITTI datasets demonstrates the effectiveness of our PointPWCNet and the selfsupervised loss function.
References

[1]
Aseem Behl, Despoina Paschalidou, Simon Donné, and Andreas Geiger.
Pointflownet: Learning representations for rigid motion estimation
from point clouds.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 7962–7971, 2019.  [2] James R Bergen, Patrick Anandan, Keith J Hanna, and Rajesh Hingorani. Hierarchical modelbased motion estimation. In European conference on computer vision, pages 237–252. Springer, 1992.
 [3] Paul J Besl and Neil D McKay. Method for registration of 3d shapes. In Sensor fusion IV: control paradigms and data structures, volume 1611, pages 586–606. International Society for Optics and Photonics, 1992.
 [4] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision, pages 25–36. Springer, 2004.
 [5] Andrés Bruhn, Joachim Weickert, and Christoph Schnörr. Lucas/kanade meets horn/schunck: Combining local and global optic flow methods. International journal of computer vision, 61(3):211–231, 2005.
 [6] Rohan Chabra, Julian Straub, Christopher Sweeney, Richard Newcombe, and Henry Fuchs. Stereodrnet: Dilated residual stereonet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11786–11795, 2019.
 [7] Ayush Dewan, Tim Caselitz, Gian Diego Tipaldi, and Wolfram Burgard. Rigid scene flow for 3d lidar scans. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1765–1770. IEEE, 2016.
 [8] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
 [9] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017.
 [10] Fabian Groh, Patrick Wieschollek, and Hendrik PA Lensch. Flexconvolution. In Asian Conference on Computer Vision, pages 105–122. Springer, 2018.
 [11] Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, and Panqu Wang. Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on largescale point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3254–3263, 2019.
 [12] Pedro Hermosilla, Tobias Ritschel, PerePau Vázquez, Àlvar Vinacua, and Timo Ropinski. Monte carlo convolution for learning on nonuniformly sampled point clouds. In SIGGRAPH Asia 2018 Technical Papers, page 235. ACM, 2018.
 [13] Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial intelligence, 17(13):185–203, 1981.

[14]
BinhSon Hua, MinhKhoi Tran, and SaiKit Yeung.
Pointwise convolutional neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 984–993, 2018.  [15] Frédéric Huguet and Frédéric Devernay. A variational method for scene flow estimation from stereo sequences. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–7. IEEE, 2007.
 [16] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
 [17] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, pages 667–675, 2016.
 [18] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. Endtoend learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 66–75, 2017.
 [19] Minhaeng Lee and Charless C Fowlkes. Cemnet: Selfsupervised learning for accurate continuous egomotion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
 [20] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on xtransformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
 [21] Liang Liu, Guangyao Zhai, Wenlong Ye, and Yong Liu. Unsupervised learning of scene flow estimation fusing with local rigidity. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 876–882. AAAI Press, 2019.
 [22] Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 529–537, 2019.
 [23] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
 [24] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3061–3070, 2015.
 [25] Moritz Menze, Christian Heipke, and Andreas Geiger. Joint 3d estimation of vehicles and scene flow. In ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
 [26] Moritz Menze, Christian Heipke, and Andreas Geiger. Object scene flow. ISPRS Journal of Photogrammetry and Remote Sensing (JPRS), 2018.
 [27] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
 [28] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
 [29] Anurag Ranjan and Michael J Black. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4161–4170, 2017.
 [30] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500, 2016.
 [31] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Epicflow: Edgepreserving interpolation of correspondences for optical flow. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1164–1172, 2015.
 [32] Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3693–3702, 2017.
 [33] Olga Sorkine. Laplacian mesh processing. In Eurographics (STARs), pages 53–70, 2005.
 [34] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, MingHsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2530–2539, 2018.
 [35] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical flow estimation and their principles. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 2432–2439. IEEE, 2010.
 [36] Deqing Sun, Stefan Roth, and Michael J Black. A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision, 106(2):115–137, 2014.
 [37] Deqing Sun, Xiaodong Yang, MingYu Liu, and Jan Kautz. Pwcnet: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
 [38] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and QianYi Zhou. Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3887–3896, 2018.
 [39] Arash K Ushani and Ryan M Eustice. Feature learning for scene flow estimation from lidar. In Conference on Robot Learning, pages 283–292, 2018.
 [40] Arash K Ushani, Ryan W Wolcott, Jeffrey M Walls, and Ryan M Eustice. A learning approach for realtime temporal scene flow estimation from lidar data. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5666–5673. IEEE, 2017.
 [41] Sundar Vedula, Simon Baker, Peter Rander, Robert Collins, and Takeo Kanade. Threedimensional scene flow. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 722–729. IEEE, 1999.
 [42] Nitika Verma, Edmond Boyer, and Jakob Verbeek. Feastnet: Featuresteered graph convolutions for 3d shape analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2598–2606, 2018.
 [43] Christoph Vogel, Konrad Schindler, and Stefan Roth. Piecewise rigid scene flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 1377–1384, 2013.
 [44] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3d scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision, 115(1):1–28, 2015.
 [45] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and YuGang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–67, 2018.
 [46] Shenlong Wang, Simon Suo, WeiChiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2589–2597, 2018.
 [47] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):146, 2019.
 [48] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2019.
 [49] Jia Xu, René Ranftl, and Vladlen Koltun. Accurate optical flow via direct cost volume processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1289–1297, 2017.
 [50] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1983–1992, 2018.
 [51] Yuliang Zou, Zelun Luo, and JiaBin Huang. Dfnet: Unsupervised joint learning of depth and flow using crosstask consistency. In Proceedings of the European Conference on Computer Vision (ECCV), pages 36–53, 2018.