1 Introduction
Scene flow estimation [37]
aims to provide dense or semidense 3D vectors, representing the perpoint 3D motion in two consecutive frames. The information provided has proven invaluable in analyzing dynamic scenes. Although significant advances have been made in the 2D optical flow, the counterpart in 3D point cloud is far more challenging. This is partly due to the irregularity and sparsity of the data, but also due to the diversity of the scene.
As pointed out in [17], most of the structures in the visual world are rigid or at least nearly so. Many previous topperforming approaches [13, 5, 45, 26] simplify this task as a regression problem by estimating a pointwise translational motion. Although promising results have been achieved, the performance is far from satisfactory as the potential rigid motion constraints existing in the local region are ignored. As shown in Fig. 1(a), the results generated by the FlowNet3D [13] are deformed and fail to maintain the local geometric smoothness. A straightforward remedy is to utilize pairwise regularization to smooth the flow prediction. However, ignoring the potential rigid transformations makes it hard to maintain the underlying spatial structure, as presented in Fig. 1(b).
To address this issue, we propose a novel framework termed HCRFFlow, which consists of two components: a positionaware flow estimation module (PAFE) for perpoint translational motion regression and a continuous highorder CRFs module (ConHCRFs) for the refinement of the perpoint predictions by considering both spatial smoothness and rigid transformation constraints. Specifically, in ConHCRFs, a pairwise term is designed to encourage neighboring points with similar local structure to have similar motions. In addition, a novel high order term is designed to encourage each point in a local region to take a motion obeying the shared rigid motion parameters, i.e., translation and rotation parameters, in this region.
In point cloud scene flow estimation, it is challenging to aggregate the matching costs, which are calculated by comparing one point with its softly corresponding points. To encode this knowledge into the embedding features, we propose a positionaware flow embedding layer in the PAFE module. In the aggregation step, we introduce a pseudo matching pair that is applied to calculate the difference of the matching cost. For each softly corresponding pair, both its position information and the matching cost difference will be considered to output weights for aggregation.
Our main contributions can be summarized as follows:

We propose a novel scene flow estimation framework HCRFFlow by combining the strengths of DNNs and CRFs to perform a perpoint translational motion regression and a refinement with both pairwise and regionlevel regularization;

Formulating the rigid motion constraints as a high order term, we propose continuous highorder CRFs (ConHCRFs) to model the interaction of points by imposing pointlevel and regionlevel consistency constraints.

We present a novel positionaware flow estimation layer to build reliable matching costs and aggregate them based on both position information and matching cost differences.

Our proposed HCRFFlow significantly outperforms the stateoftheart on both FlyingThing3D and KITTI Scene Flow 2015 datasets. In particular, we achieve Acc3DR scores of 95.07% and 94.44% on FlyingThing3D and KITTI, respectively.
1.1 Related work
Scene flow from RGB or RGBD images Scene flow is first proposed in [37] to represent the threedimensional motion field of points in a scene. Many works [8, 25, 32, 36, 38, 39, 40, 20, 16, 29, 6] try to recover scene flow from stereo RGB images or monocular RGBD images. The local rigidity assumption has been applied in scene flow estimation from images. [39, 40, 20, 16] directly predict the rigidity parameter of each local region to produce scene flow estimates. [38, 29, 6] add a rigidity term into the energy function to constrain the scene flow estimation. Compared with them, our method is different in the following aspects: 1) our method formulates the rigidity constraint as a high order term in ConHCRFs. It encourages the regionlevel rigidity of pointwise scene flow rather than directly computing rigidity parameters. Thus, our ConHCRFs can be easily added to other point cloud scene flow estimation methods as a plugin module to improve the rigidity of their predictions; 2) our method targets irregular and unordered point cloud data instead of well organized 2D images.
Deep scene flow from point clouds Some approaches [4, 35]
estimate scene from point cloud via traditional techniques. Recently, inspired by the success of deep learning for point clouds, more works
[5, 1, 13, 45, 26, 23, 14, 42] have employed DNNs in this field. [13] estimates scene flow based on PointNet++ [28]. [5] proposes a sparse convolution architecture for scene flow learning, [45] designs a coarsetofine scene flow estimation framework, and [26] estimates the point translation by point matching. Despite achieving impressive performance, these methods neglect the rigidity constraints and estimate each point’s motion independently. Although the rigid motion for each point is computed in [1], the perpoint rigid parameter is independently regressed by a DNN without fully considering its geometry constraints. Unlike previous methods, we design a novel ConHCRFs to explicitly model both spatial smoothness and rigid motion constraints.Deep learning on 3D point clouds Many works [27, 28, 34, 18, 44, 15, 41, 7] focus on learning directly on raw point clouds. PointNet [27] and PointNet++ [28]
are the pioneering works which use shared MultiLayerPerception (MLP) to extract features and a max pooling to aggregate them.
[7, 41] use the attention mechanism to produce aggregation weights. [7, 15]encode shape features from local geometry clues to improve the feature extraction. Inspired by these works, we propose a positionaware flow embedding layer to dynamically aggregate matching costs based on both position representations and matching cost differences for better matching cost aggregation.
Conditional random fields (CRFs) CRFs are a type of probabilistic graphical models, which are widely used to model the effects of interactions among examples in numerous vision tasks [10, 12, 2, 48, 46]. In point cloud processing, previous works [47, 33, 3] apply CRFs to discrete labeling tasks for spatial smoothness. In contrast to the CRFs in these works, the variables of ConHCRFs are defined in a continuous domain, and two different relations are modeled by ConHCRFs at the point level and the region level.
2 HCRFFlow
2.1 Overview
In the task of point cloud scene flow estimation, the inputs are two point clouds at two consecutive frames: at frame and at frame , where are 3D coordinates of individual points. Our goal is to predict the 3D displacement for each point in , which describes the motion of each point from frame to frame . Unless otherwise stated, we use boldfaced uppercase and lowercase letters to denote matrices and column vectors, respectively.
As shown in Fig. 2, the HCRFFlow consists of two components: a PAFE module for perpoint flow estimation and a ConHCRFs module for refinement. For the PAFE module, we try two different architectures, a singlescale architecture referring to FlowNet3D [27] and a pyramid architecture referring to PointPWCNet [45]. To mix the two point clouds, in the PAFE module, we propose a novel positionaware flow embedding layer to build reliable matching costs and aggregate them to produce flow embeddings that encode the motion information. For better aggregation, we use the position information and the matching cost difference as clues to generate aggregation weights. Sec. 2.2 introduces the details about this layer. In the ConHCRFs module, we propose novel continuous high order CRFs to refine the coarse scene flow by encouraging both pointlevel and regionlevel consistency. More details are given in Sec. 3.
2.2 Positionaware flow embedding layer
As shown in Fig. 3, the positionaware flow embedding layer aims to produce flow embedding for each point in . For each point , we first find neighbouring points around in frame . Then, following [13], the matching cost between point and a softly corresponding point in are addressed as:
(1) 
where and are the features for and , respectively. is a concatenation of its inputs followed by a MLP. After obtaining the matching costs for point , two subbranches are followed: position encoding unit and pseudopairing unit, to produce weights for aggregation, as shown in Fig. 3.
Pseudopairing unit When aggregating the matching costs, this unit is designed to automatically select prominent ones by assigning them more weights. To this end, we compare each matching pair with a pseudo stationary pair and use the difference as a clue to measure the importance of this matching pair. The pseudo stationary pair represents the situation when this point does not move, the softly corresponding point is itself. Based on Eq. 1, the matching cost of the pseudo stationary pair for each point can be defined as:
(2) 
The matching cost difference between each matching pair and this pseudo pair can be expressed as:
(3) 
In subsequent aggregation procedure, the matching cost difference will be treated as a feature to produce aggregation weights for each matching cost.
Position encoding unit Further, to improve the ability of our aggregation, we incorporate the position representations into the aggregation procedure as a significant factor in producing soft weights. Specifically, inspired by [31] and [7], for each matching pair and , we utilize 3D Euclidean distance, the absolute and relative coordinates as position information to encode the position representation , which can be expressed as:
(4) 
where is a MLP to map the position information into the position representation, is the concatenation operation, and computes the Euclidean distance between the two points.
Given the matching cost difference and the position representation, we design a shared function to produce a unique weight vector for each matching cost for aggregation. Specifically, the function is composed of a MLP followed by a softmax operation to normalize the weights across all matching costs in a set. The normalized weight for each matching cost can be written as:
(5) 
Therefore, for each point , according to the learned aggregation weights, the final flow embedding for point can be expressed as:
(6) 
where is the elementwise multiplication and is the set of softly corresponding points for point in next frame.
3 Continuous HighOrder CRFs
In this section, we introduce the details of our continuous high order CRFs. We first formulate the problem of scene flow refinement. Then, we describe the details of three kinds of potential functions involved in the ConHCRFs. Lastly, we discuss how to utilize mean field theory to approximate the ConHCRFs distribution and obtain the final iterative inference algorithm.
3.1 Overview
Take a point cloud with points, indexed . In the scene flow refinement, we attempt to assign every point a refined 3D displacement based on the initial scene flow produced by the PAFE module. Let be a matrix of 3D displacements corresponding to all points in point cloud , where each . Following [12, 30]
, we model the conditional probability distribution with the following density function
Here is the energy function and is the partition function defined as:Different from conventional CRFs, the ConHCRFs proposed in this paper contains a novel high order potential that performs rigid motion constraints in each local region. Specifically, the energy function is defined as:
(7)  
where represents the set of neighboring points of center point ; represents the set of rigid regions in the whole point cloud and is a matrix composed by the scene flow of points belonging to the region with the point excluded. And the point set without the point is denoted as . The unary term encourages the refined scene flow to be consistent with the initial scene flow. The pairwise term encourages neighboring points with similar local structure to take similar displacements. The high order term encourages points belonging to the same rigid region to share the same rigid motion parameters. In this paper, we use oversegmentation method to segment the entire point cloud into a series of supervoxels and treat each supervoxel as the rigid region in the high order term. An illustration is shown in Fig. 4. We drop the conditioning on in the rest of this paper for convenience.
3.2 Potential functions
Unary potential The unary potential is constructed from the initial scene flow by considering norm:
(8) 
where represents the initial 3D displacement at point produced by PAFE module; denotes the norm of a vector.
Pairwise potential The pairwise potential is constructed from types of similarity observations to describe the relation between pairs of hidden variables and :
(9) 
where is a weight to specify the relation between the points and ; denotes the coefficient for each similarity measure.
Specifically, we set weight depending on a Gaussian kernel , where and indicate the features of neighboring points and associating with similarity measure ; is the kernel’s bandwidth parameter.
In this paper, we use point position and surface normal as the observations to construct two Gaussian kernels.
High order potential For the high order potential term, we want to explore the effects of interactions among points in a supervoxel. According to the rigid motion constraint, the high order potential term in CRF can be defined as:
(10) 
where is a displacement produced by a function , where shared rigid motion parameters will be computed by and the displacement for point obeying the shared parameters can be obtained by applying the shared parameters back to the original position . is a coefficient. In the following, we will give details about the computation of .
In a rigid region , given point , we denote the points in region not containing the point as and the corresponding 3D displacements as . The warped positions in the next frame can be obtained by adding the scene flow back to the corresponding positions in frame :
(11) 
The possible rigid transformation from to can be defined as where and . Inspired by the work [43] in point cloud registration, we can minimize the meansquared error to find the most suitable rigid motion parameters to describe the motion :
(12) 
(13) 
where is the number of points in region with the point excluded.
Define centers of and as and respectively. Then the crosscovariance matrix can be written as:
Using the the singular value decomposition (SVD) to decompose
, we can get the closedform solutions of , written as:(14) 
When treating the most suitable parameters as the shared rigid motion parameters by all points in region , the displacement that satisfies the rigid motion constraints for point is given by:
(15) 
3.3 Inference
In order to produce the most probable scene flow, we should solve the MAP inference problem for
. Following [30], we approximate the original conditional distribution by mean field theory [9]. Thus, the distribution is approximated by a product of independent marginals, i.e., . By minimizing the KLdivergence between and , the solution for can be written as: where represents an expectation under distributions over all variable for .Following [30], we represent each
as a multivariate normal distribution, the mean field update for mean
and normalization parameter can be written as:(16) 
(17) 
where is a set of mean for all ; is the diagonal element of covariance . The detailed derivation of the inference algorithm can be found in supplementary. We observe that there usually exist hundreds of points in a supervoxel, which makes the rigid parameters computed on all points in the supervoxel excluding the point vary close to the rigid parameters computed on all points in the supervoxel, i.e., is vary close to . Thus, in practice, we approximate in Eq. 17 with , and the approximated mean is:
(18) 
After this approximation, we only need to calculate a set of rigid motion parameters for each supervoxel rather than for each point, which greatly reduces the time complexity.
In the MAP inference, since we approximate with , an estimate of each can be obtained by computing the expected value
of the Gaussian distribution
:(19) 
The inference procedure of our ConHCRFs can be clearly sketched in Algorithm 1.
Moreover, thanks to the differentiable SVD function provided by PyTorch
[24], the mean field update operation is differentiable in our inference procedure. Therefore, following [49, 46], our mean field algorithm can be fully integrated with deep learning models, which ensures the endtoend training of the whole framework.4 Experiments
In this section, we first train and evaluate our method on the synthetic FlyingThings3D dataset in Sec. 4.1, and then in Sec. 4.2 we test the generalization ability of our method on realworld KITTI dataset without finetuning.
In Sec. 4.3, we validate the generality of our ConHCRFs on other networks.
Finally, we conduct ablation studies to analyze the contribution of each component in Sec. 4.4.
Note that in the following experiments, there are two different architectures of the PAFE module, the singlescale one denoted as PAFES and the pyramid one denoted as PAFE.
And corresponding HCRFFlow models are denoted as HCRFFlowS and HCRFFlow, respectively.
Evaluation metrics. Let denote the predicted scene flow, and
be the ground truth scene flow. The evaluate metrics are computed as follows.
EPE3D(m): the main metric, average over each point. Acc3DS(%): the percentage of points whose EPE3D 0.05m or relative error . Acc3DR(%): the percentage of points whose EPE3D 0.1m or relative error . Outliers3D(%): the percentage of points whose EPE3D 0.3m or relative error . EPE2D(px): 2D End Point Error, which is a common metric for optical flow. Acc2D(%): the percentage of points whose EPE2D or relative error .4.1 Results on FlyingThings3D
FlyingThings3D [19] is a largescale synthetic dataset. We follow [5] to build the training set and the test set. Our method takes points in each point cloud as input. We train our models on one quarter of the training set (4,910 pairs) and evaluate on the whole test set (3,824 pairs).
Referring to PointPWCNet [45], we build a pyramid PAFE module, PAFE, and corresponding HCRFFlow framework, HCRFFlow. Note that, compared with original architecture in [45], there are three adjustments in our PAFE: 1) we replace the MLPs in level 0 with a set conv [13]; 2) we replace all PointConvs [44] with set convs [13]; 3) we replace cost volume layers [45] with our positionaware flow embedding layers. In ConHCRFs, we utilize the algorithm proposed in [11]
for supervoxel segmentation. During training, we first train our PAFE with a multiscale loss function used in
[45]. Then we add the ConHCRFs to PAFE for finetuning. More implementation details are in supplementary.The quantitative evaluation results on the Flyingthings3D are shown in Table 1. We compare our method with four baseline models: FlowNet3D [13], HPLFlowNet [5], PointPWCNet [45], and FLOT [26]. As shown in Table 1, our PAFE module outperforms the above four methods. Further, adding ConHCRFs and finetuning on FlyingThings3D, the final method, HCRFFlow, achieves the best performance on all metrics. Qualitative results are shown in Fig. 5.


Dataset  Method  EPE3D  Acc3DS  Acc3DR  Outliers3D  EPE2D  Acc2D 
FlyingThings3D  FlowNet3D [13]  0.0886  41.63  81.61  58.62  4.7142  60.10 
HPLFlowNet [5]  0.0804  61.44  85.55  42.87  4.6723  67.64  
PointPWCNet [45]  0.0588  73.79  92.76  34.24  3.2390  79.94  
FLOT [26]  0.0520  73.20  92.70  35.70      
Ours (PAFE module)  0.0535  78.90  94.93  30.51  2.8253  83.46  
Ours (HCRFFlow)  0.0488  83.37  95.07  26.14  2.5652  87.04  
KITTI  FlowNet3D [13]  0.1069  42.77  79.78  41.38  4.3424  57.51 
HPLFlowNet [5]  0.1169  47.83  77.76  41.03  4.8055  59.38  
PointPWCNet [45]  0.0694  72.81  88.84  26.48  3.0062  76.73  
FLOT [26]  0.0560  75.50  90.80  24.20      
Ours (PAFE module)  0.0646  80.29  93.47  20.24  2.4829  80.80  
Ours (HCRFFlow)  0.0531  86.31  94.44  17.97  2.0700  86.56  

4.2 Generalization results on KITTI
KITTI Scene Flow 2015 [22, 21] is a wellknown dataset for 3D scene flow estimation. In this section, in order to evaluate the generalization ability of our method, we train our model on FlyingThings3D dataset and test on KITTI Scene Flow 2015 without finetuning. And the desired supervoxel size and the bandwidth parameters in KITTI are the same as those in FlyingThings3D.
Following [13, 5], we evaluate on all 142 scenes in the training set and remove the points on the ground by height () for a fair comparison. The quantitative evaluation results on the KITTI are shown in Table 1. Our method outperforms the competing methods, which represents the good generalization ability of our method on realworld data. Fig. 5 shows the qualitative results.
4.3 Generality of ConHCRFs on other models
In this section, we study the generalization ability of ConHCRFs by applying it to the other scene flow estimation models as a postprocessing module. We evaluate the performance of our proposed ConHCRFs with FlowNet3D [13] and FLOT [26], which have shown strong capability on both challenging synthetic data from FlyingThings3D and real Lidar scans from KITTI. The results are presented in Table 2. Although built upon strong baselines, our proposed ConHCRFs boost the performance of each baseline by a large margin on both datasets, demonstrating the strong robustness and generalization.


Dataset  Method  Acc3DS  Acc3DS 
FlyingThings3D  FlowNet3D [13]  41.63  0.0 
+ ConHCRFs  47.01  + 5.38  
FLOT [26]  73.20  0.0  
+ ConHCRFs  78.63  + 5.43  
KITTI  FlowNet3D [13]  42.77  0.0 
+ ConHCRFs  46.90  + 4.13  
FLOT [26]  75.50  0.0  
+ ConHCRFs  85.44  + 9.94  

4.4 Ablation studies
In this section, we provide a detailed analysis of every component in our method. All experiments are conducted in the FlyThings3D dataset. Besides the pyramid models, PAFE and HCRFFlow, for comprehensive analysis, we also evaluate the performance of each component when used in singlescale models, PAFES and HCRFFlowS. These models are designed referring to FlowNet3D [13].
Ablation for positionaware flow embedding layer. We explore the effect of the aggregation strategy on our positionaware flow embedding layer. This strategy is introduced to dynamically aggregate the matching costs considering position information and matching cost differences. As shown in Table 3, for the two baselines, when applying both the pseudopairing unit and the position encoding unit to the flow embedding layer, the performance in Acc3DS can be improved by around 8. Moreover, to verify the effectiveness of the pseudo pair, we design a naive dynamic aggregation unit, denoted as NDA, which directly produces weights from matching costs rather than the matching cost difference between each matching pair and this pseudo pair. As shown in Table 3, after replacing PP with NDA, the improvement in Acc3DS decreases from 6.79 to 2.62. Thus, the pseudopairing unit is a better choice for this task.


Method  MP  PP  PE  NDA  Acc3DS  Acc3DS 
Singlescale baseline  41.63  0.00  
+NDA  44.25  + 2.62  
+PP  48.42  + 6.79  
+PP+PE (PAFES module)  50.08  + 8.45  
Pyramid baseline  69.94  0.0  
+PP+PE (PAFE module)  78.90  + 8.96  

Ablation for ConHCRFs. To ensure the spatial smoothness and the local rigidity of the final predictions, we propose continuous CRFs with a novel high order term. The ablation results for ConHCRFs are presented in Table 4. With the help of a pairwise term, denoted as (Unary+Pair), the performance gains a slight improvement due to the fact that this pairwise term aims at spatial smoothness but ignores the potential rigid motion constraints. Our proposed ConHCRFs module, which formulates the rigid motion constraints as its high order term, boosts the performance by a large margin for PAFES and PAFE modules. After jointly optimizing ConHCRFs and PAFE modules, we observe further improvement.
Can we replace the rigid motion constraints by a regionlevel smoothness in a supervoxel? We want to explore whether the rigid motion constraint is a good approach to model the relations among points in a supervoxel. Instead of sharing unique rigid motion parameters, a straightforward approach is to encourage the points among a rigid region to share the same motion, i.e., encourage regionlevel smoothness in a supervoxel. To this end, we design a naive regional term as: , where is an average of over all points in . The results are shown in Table 4, denoted as (Unary+Pair+naive Region). As it only enforces spatial smoothness in a region and fails to model suitable dependencies among points in this rigid region, this kind of CRFs is ineffective and even worsen the performance. In contrast, when applying our proposed ConHCRFs, the final scene flow shows significant improvements.


Method  Acc3DS  Acc3DS 
PAFES module  50.08  0.00 
+ (Unary+Pair)  50.24  + 0.16 
+ (Unary+Pair+naive Region)  48.10   1.98 
+ (Unary+Pair+Highorder)/ConHCRFs  54.51  + 4.43 
+ (Unary+Pair+Highorder)/ConHCRFs  56.29  + 6.21 
PAFE module  78.90  0.00 
+ (Unary+Pair+Highorder)/ConHCRFs  81.39  + 2.49 
+ (Unary+Pair+Highorder)/ConHCRFs  83.37  + 4.47 



Component  Supervoxel  Pairwise term  High order term  Total 
Time (ms)  115.1  12.3  100.8  228.2 

Speed analysis of ConHCRFs Table 5 reports the average runtime of each component of ConHCRFs tested on a single 1080ti GPU. As shown in Table 5, the ConHCRFs takes 0.2s to process a scene with 8192 points. The speed of ConHCRFs is similar to DenseCRF [10], which also takes about 0.2s to process a 320x213 image. Additionally, due to the approximate computation that we apply in the high order term, this term only takes 0.1s for a scene. In contrast, the time for the term will dramatically increase from 0.1s to 14s, if the rigid motion parameters are calculated for each point instead of supervoxel. The large gap of runtime shows that the approximation discussed in Sec. 3.3 can significantly boost the efficiency of our ConHCRFs.


Desired point number  80  100  140  200  PAFES 
EPE3D  0.0804  0.0788  0.0782  0.0790  0.0815 

Impact of supervoxel sizes. To illustrate the sensitivity to supervoxel sizes, we test our method when facing supervoxels with different point numbers. As shown in Table 6, the method achieves the best performance when the desired point number of each supervoxel is set to a range of 140 to 200.
5 Conclusions
In this paper, we have proposed a novel point cloud scene flow estimation method, termed HCRFFlow, by incorporating the strengths of DNNs and CRFs to perform translational motion regression on each point and operate refinement with both pairwise and regionlevel regularization. Formulating the rigid motion constraints as a high order term, we propose a novel highorder CRF based relation module (ConHCRFs) considering both pointlevel and regionlevel consistency. In addition, we design a positionaware flow estimation layer for better matching cost aggregation. Experimental results on FlyingThings3D and KITTI datasets show that our proposed method performs favorably against comparison methods. We have also shown the generality of our ConHCRFs on other point cloud scene flow estimation methods.
6 Acknowledgements
This research was conducted in collaboration with SenseTime. This work is supported by A*STAR through the Industry Alignment Fund  Industry Collaboration Projects Grant. This work is also supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISGRP2018003), and the MOE Tier1 research grants: RG28/18 (S) and RG22/19 (S).
References

[1]
Aseem Behl, Despoina Paschalidou, Simon Donné, and Andreas Geiger.
Pointflownet: Learning representations for rigid motion estimation
from point clouds.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 7962–7971, 2019.  [2] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.

[3]
Christopher Choy, JunYoung Gwak, and Silvio Savarese.
4d spatiotemporal convnets: Minkowski convolutional neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.  [4] Ayush Dewan, Tim Caselitz, Gian Diego Tipaldi, and Wolfram Burgard. Rigid scene flow for 3d lidar scans. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1765–1770. IEEE, 2016.
 [5] Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, and Panqu Wang. Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on largescale point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3254–3263, 2019.
 [6] Michael Hornacek, Andrew Fitzgibbon, and Carsten Rother. Sphereflow: 6 dof scene flow from rgbd pairs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3526–3533, 2014.
 [7] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randlanet: Efficient semantic segmentation of largescale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11108–11117, 2020.
 [8] Frédéric Huguet and Frédéric Devernay. A variational method for scene flow estimation from stereo sequences. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–7. IEEE, 2007.
 [9] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 [10] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages 109–117, 2011.
 [11] Yangbin Lin, Cheng Wang, Dawei Zhai, Wei Li, and Jonathan Li. Toward better boundary preserved supervoxel segmentation for 3d point clouds. ISPRS journal of photogrammetry and remote sensing, 143:39–47, 2018.
 [12] Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5162–5170, 2015.
 [13] Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 529–537, 2019.
 [14] Xingyu Liu, Mengyuan Yan, and Jeannette Bohg. Meteornet: Deep learning on dynamic 3d point cloud sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 9246–9255, 2019.
 [15] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relationshape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8895–8904, 2019.
 [16] WeiChiu Ma, Shenlong Wang, Rui Hu, Yuwen Xiong, and Raquel Urtasun. Deep rigid instance scene flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3614–3622, 2019.
 [17] D Man and A Vision. A computational investigation into the human representation and processing of visual information, 1982.
 [18] Jiageng Mao, Xiaogang Wang, and Hongsheng Li. Interpolated convolutional networks for 3d point cloud understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 1578–1587, 2019.
 [19] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
 [20] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3061–3070, 2015.
 [21] Moritz Menze, Christian Heipke, and Andreas Geiger. Joint 3d estimation of vehicles and scene flow. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, 2, 2015.
 [22] Moritz Menze, Christian Heipke, and Andreas Geiger. Object scene flow. ISPRS Journal of Photogrammetry and Remote Sensing, 140:60–76, 2018.
 [23] Himangi Mittal, Brian Okorn, and David Held. Just go with the flow: Selfsupervised scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11177–11185, 2020.
 [24] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 [25] JeanPhilippe Pons, Renaud Keriven, and Olivier Faugeras. Multiview stereo reconstruction and scene flow estimation with a global imagebased matching score. International Journal of Computer Vision, 72(2):179–193, 2007.
 [26] Gilles Puy, Alexandre Boulch, and Renaud Marlet. Flot: Scene flow on point clouds guided by optimal transport. arXiv preprint arXiv:2007.11142, 2020.
 [27] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
 [28] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
 [29] Julian Quiroga, Thomas Brox, Frédéric Devernay, and James Crowley. Dense semirigid scene flow estimation from rgbd images. In European Conference on Computer Vision, pages 567–582. Springer, 2014.

[30]
Kosta Ristovski, Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic.
Continuous conditional random fields for efficient regression in
large fully connected graphs.
In
TwentySeventh AAAI Conference on Artificial Intelligence
, 2013.  [31] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Selfattention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
 [32] Deqing Sun, Erik B Sudderth, and Hanspeter Pfister. Layered rgbd scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 548–556, 2015.
 [33] Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese. Segcloud: Semantic segmentation of 3d point clouds. In 2017 international conference on 3D vision (3DV), pages 537–547. IEEE, 2017.
 [34] Hugues Thomas, Charles R Qi, JeanEmmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pages 6411–6420, 2019.
 [35] Arash K Ushani, Ryan W Wolcott, Jeffrey M Walls, and Ryan M Eustice. A learning approach for realtime temporal scene flow estimation from lidar data. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5666–5673. IEEE, 2017.
 [36] Levi Valgaerts, Andrés Bruhn, Henning Zimmer, Joachim Weickert, Carsten Stoll, and Christian Theobalt. Joint estimation of motion, structure and geometry from stereo sequences. In European Conference on Computer Vision, pages 568–581. Springer, 2010.
 [37] Sundar Vedula, Simon Baker, Peter Rander, Robert Collins, and Takeo Kanade. Threedimensional scene flow. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 722–729. IEEE, 1999.
 [38] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3d scene flow estimation with a rigid motion prior. In 2011 International Conference on Computer Vision, pages 1291–1298. IEEE, 2011.
 [39] Christoph Vogel, Konrad Schindler, and Stefan Roth. Piecewise rigid scene flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 1377–1384, 2013.
 [40] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3d scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision, 115(1):1–28, 2015.
 [41] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10296–10305, 2019.
 [42] Shenlong Wang, Simon Suo, WeiChiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2589–2597, 2018.
 [43] Yue Wang and Justin M Solomon. Deep closest point: Learning representations for point cloud registration. In Proceedings of the IEEE International Conference on Computer Vision, pages 3523–3532, 2019.
 [44] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2019.
 [45] Wenxuan Wu, Zhi Yuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. Pointpwcnet: Cost volume on point clouds for (self) supervised scene flow estimation. In European Conference on Computer Vision, pages 88–107. Springer, 2020.
 [46] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Multiscale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5354–5362, 2017.
 [47] Bo Yang, Jianan Wang, Ronald Clark, Qingyong Hu, Sen Wang, Andrew Markham, and Niki Trigoni. Learning object bounding boxes for 3d instance segmentation on point clouds. In Advances in Neural Information Processing Systems, pages 6737–6746, 2019.
 [48] Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua Shen. Canet: Classagnostic segmentation networks with iterative refinement and attentive fewshot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5217–5226, 2019.

[49]
Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet,
Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr.
Conditional random fields as recurrent neural networks.
In Proceedings of the IEEE international conference on computer vision, pages 1529–1537, 2015.