HCRF-Flow: Scene Flow from Point Clouds with Continuous High-order CRFs and Position-aware Flow Embedding

05/17/2021 ∙ by Ruibo Li, et al. ∙ 0

Scene flow in 3D point clouds plays an important role in understanding dynamic environments. Although significant advances have been made by deep neural networks, the performance is far from satisfactory as only per-point translational motion is considered, neglecting the constraints of the rigid motion in local regions. To address the issue, we propose to introduce the motion consistency to force the smoothness among neighboring points. In addition, constraints on the rigidity of the local transformation are also added by sharing unique rigid motion parameters for all points within each local region. To this end, a high-order CRFs based relation module (Con-HCRFs) is deployed to explore both point-wise smoothness and region-wise rigidity. To empower the CRFs to have a discriminative unary term, we also introduce a position-aware flow estimation module to be incorporated into the Con-HCRFs. Comprehensive experiments on FlyingThings3D and KITTI show that our proposed framework (HCRF-Flow) achieves state-of-the-art performance and significantly outperforms previous approaches substantially.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene flow estimation [37]

aims to provide dense or semi-dense 3D vectors, representing the per-point 3D motion in two consecutive frames. The information provided has proven invaluable in analyzing dynamic scenes. Although significant advances have been made in the 2D optical flow, the counterpart in 3D point cloud is far more challenging. This is partly due to the irregularity and sparsity of the data, but also due to the diversity of the scene.

Figure 1: The warped point cloud at the next frame based on different scene flow. Green points represent the point cloud at frame . Red points are the warped results at frame by adding scene flow back to corresponding green points. (a) scene flow produced by FlowNet3D [13]; (b) scene flow produced by FlowNet3D and refined by a conventional CRF; (c) scene flow produced by FlowNet3D and refined by our continuous high order CRFs; (d) ground truth scene flow. The local structure of the warped point cloud is distorted in Flownet3d and the conventional CRF but preserved in our method.

As pointed out in [17], most of the structures in the visual world are rigid or at least nearly so. Many previous top-performing approaches [13, 5, 45, 26] simplify this task as a regression problem by estimating a point-wise translational motion. Although promising results have been achieved, the performance is far from satisfactory as the potential rigid motion constraints existing in the local region are ignored. As shown in Fig. 1(a), the results generated by the FlowNet3D [13] are deformed and fail to maintain the local geometric smoothness. A straightforward remedy is to utilize pair-wise regularization to smooth the flow prediction. However, ignoring the potential rigid transformations makes it hard to maintain the underlying spatial structure, as presented in Fig. 1(b).

To address this issue, we propose a novel framework termed HCRF-Flow, which consists of two components: a position-aware flow estimation module (PAFE) for per-point translational motion regression and a continuous high-order CRFs module (Con-HCRFs) for the refinement of the per-point predictions by considering both spatial smoothness and rigid transformation constraints. Specifically, in Con-HCRFs, a pairwise term is designed to encourage neighboring points with similar local structure to have similar motions. In addition, a novel high order term is designed to encourage each point in a local region to take a motion obeying the shared rigid motion parameters, i.e., translation and rotation parameters, in this region.

In point cloud scene flow estimation, it is challenging to aggregate the matching costs, which are calculated by comparing one point with its softly corresponding points. To encode this knowledge into the embedding features, we propose a position-aware flow embedding layer in the PAFE module. In the aggregation step, we introduce a pseudo matching pair that is applied to calculate the difference of the matching cost. For each softly corresponding pair, both its position information and the matching cost difference will be considered to output weights for aggregation.

Our main contributions can be summarized as follows:

  • We propose a novel scene flow estimation framework HCRF-Flow by combining the strengths of DNNs and CRFs to perform a per-point translational motion regression and a refinement with both pairwise and region-level regularization;

  • Formulating the rigid motion constraints as a high order term, we propose continuous high-order CRFs (Con-HCRFs) to model the interaction of points by imposing point-level and region-level consistency constraints.

  • We present a novel position-aware flow estimation layer to build reliable matching costs and aggregate them based on both position information and matching cost differences.

  • Our proposed HCRF-Flow significantly outperforms the state-of-the-art on both FlyingThing3D and KITTI Scene Flow 2015 datasets. In particular, we achieve Acc3DR scores of 95.07% and 94.44% on FlyingThing3D and KITTI, respectively.

1.1 Related work

Scene flow from RGB or RGB-D images Scene flow is first proposed in [37] to represent the three-dimensional motion field of points in a scene. Many works [8, 25, 32, 36, 38, 39, 40, 20, 16, 29, 6] try to recover scene flow from stereo RGB images or monocular RGB-D images. The local rigidity assumption has been applied in scene flow estimation from images. [39, 40, 20, 16] directly predict the rigidity parameter of each local region to produce scene flow estimates. [38, 29, 6] add a rigidity term into the energy function to constrain the scene flow estimation. Compared with them, our method is different in the following aspects: 1) our method formulates the rigidity constraint as a high order term in Con-HCRFs. It encourages the region-level rigidity of point-wise scene flow rather than directly computing rigidity parameters. Thus, our Con-HCRFs can be easily added to other point cloud scene flow estimation methods as a plug-in module to improve the rigidity of their predictions; 2) our method targets irregular and unordered point cloud data instead of well organized 2D images.

Deep scene flow from point clouds Some approaches [4, 35]

estimate scene from point cloud via traditional techniques. Recently, inspired by the success of deep learning for point clouds, more works 

[5, 1, 13, 45, 26, 23, 14, 42] have employed DNNs in this field. [13] estimates scene flow based on PointNet++ [28]. [5] proposes a sparse convolution architecture for scene flow learning, [45] designs a coarse-to-fine scene flow estimation framework, and [26] estimates the point translation by point matching. Despite achieving impressive performance, these methods neglect the rigidity constraints and estimate each point’s motion independently. Although the rigid motion for each point is computed in [1], the per-point rigid parameter is independently regressed by a DNN without fully considering its geometry constraints. Unlike previous methods, we design a novel Con-HCRFs to explicitly model both spatial smoothness and rigid motion constraints.

Deep learning on 3D point clouds Many works [27, 28, 34, 18, 44, 15, 41, 7] focus on learning directly on raw point clouds. PointNet [27] and PointNet++ [28]

are the pioneering works which use shared Multi-Layer-Perception (MLP) to extract features and a max pooling to aggregate them.

[7, 41] use the attention mechanism to produce aggregation weights. [7, 15]

encode shape features from local geometry clues to improve the feature extraction. Inspired by these works, we propose a position-aware flow embedding layer to dynamically aggregate matching costs based on both position representations and matching cost differences for better matching cost aggregation.

Conditional random fields (CRFs) CRFs are a type of probabilistic graphical models, which are widely used to model the effects of interactions among examples in numerous vision tasks [10, 12, 2, 48, 46]. In point cloud processing, previous works [47, 33, 3] apply CRFs to discrete labeling tasks for spatial smoothness. In contrast to the CRFs in these works, the variables of Con-HCRFs are defined in a continuous domain, and two different relations are modeled by Con-HCRFs at the point level and the region level.

Figure 2: HCRF-Flow architecture. Our HCRF-Flow consists of two components: a PAFE module to produce per-point initial scene flow and a Con-HCRFs module to refine the initial scene flow. Our proposed position-aware flow embedding layer is employed in the PAEF module to encode motion information. We build two different architectures of the PAFE module: one is designed by considering only single-scale feature (similar to FlowNet3D [27]). The other one introduces a pyramid architecture (similar to PointPWC-Net [45]).
Figure 3: The details of the position-aware flow embedding layer.

2 HCRF-Flow

2.1 Overview

In the task of point cloud scene flow estimation, the inputs are two point clouds at two consecutive frames: at frame and at frame , where are 3D coordinates of individual points. Our goal is to predict the 3D displacement for each point in , which describes the motion of each point from frame to frame . Unless otherwise stated, we use boldfaced uppercase and lowercase letters to denote matrices and column vectors, respectively.

As shown in Fig. 2, the HCRF-Flow consists of two components: a PAFE module for per-point flow estimation and a Con-HCRFs module for refinement. For the PAFE module, we try two different architectures, a single-scale architecture referring to FlowNet3D [27] and a pyramid architecture referring to PointPWC-Net [45]. To mix the two point clouds, in the PAFE module, we propose a novel position-aware flow embedding layer to build reliable matching costs and aggregate them to produce flow embeddings that encode the motion information. For better aggregation, we use the position information and the matching cost difference as clues to generate aggregation weights. Sec. 2.2 introduces the details about this layer. In the Con-HCRFs module, we propose novel continuous high order CRFs to refine the coarse scene flow by encouraging both point-level and region-level consistency. More details are given in Sec. 3.

2.2 Position-aware flow embedding layer

As shown in Fig. 3, the position-aware flow embedding layer aims to produce flow embedding for each point in . For each point , we first find neighbouring points around in frame . Then, following [13], the matching cost between point and a softly corresponding point in are addressed as:

(1)

where and are the features for and , respectively. is a concatenation of its inputs followed by a MLP. After obtaining the matching costs for point , two subbranches are followed: position encoding unit and pseudo-pairing unit, to produce weights for aggregation, as shown in Fig. 3.

Pseudo-pairing unit When aggregating the matching costs, this unit is designed to automatically select prominent ones by assigning them more weights. To this end, we compare each matching pair with a pseudo stationary pair and use the difference as a clue to measure the importance of this matching pair. The pseudo stationary pair represents the situation when this point does not move, the softly corresponding point is itself. Based on Eq. 1, the matching cost of the pseudo stationary pair for each point can be defined as:

(2)

The matching cost difference between each matching pair and this pseudo pair can be expressed as:

(3)

In subsequent aggregation procedure, the matching cost difference will be treated as a feature to produce aggregation weights for each matching cost.

Position encoding unit Further, to improve the ability of our aggregation, we incorporate the position representations into the aggregation procedure as a significant factor in producing soft weights. Specifically, inspired by [31] and [7], for each matching pair and , we utilize 3D Euclidean distance, the absolute and relative coordinates as position information to encode the position representation , which can be expressed as:

(4)

where is a MLP to map the position information into the position representation, is the concatenation operation, and computes the Euclidean distance between the two points.

Given the matching cost difference and the position representation, we design a shared function to produce a unique weight vector for each matching cost for aggregation. Specifically, the function is composed of a MLP followed by a softmax operation to normalize the weights across all matching costs in a set. The normalized weight for each matching cost can be written as:

(5)

Therefore, for each point , according to the learned aggregation weights, the final flow embedding for point can be expressed as:

(6)

where is the element-wise multiplication and is the set of softly corresponding points for point in next frame.

3 Continuous High-Order CRFs

In this section, we introduce the details of our continuous high order CRFs. We first formulate the problem of scene flow refinement. Then, we describe the details of three kinds of potential functions involved in the Con-HCRFs. Lastly, we discuss how to utilize mean field theory to approximate the Con-HCRFs distribution and obtain the final iterative inference algorithm.

3.1 Overview

Take a point cloud with points, indexed . In the scene flow refinement, we attempt to assign every point a refined 3D displacement based on the initial scene flow produced by the PAFE module. Let be a matrix of 3D displacements corresponding to all points in point cloud , where each . Following [12, 30]

, we model the conditional probability distribution with the following density function

Here is the energy function and is the partition function defined as:

Different from conventional CRFs, the Con-HCRFs proposed in this paper contains a novel high order potential that performs rigid motion constraints in each local region. Specifically, the energy function is defined as:

(7)

where represents the set of neighboring points of center point ; represents the set of rigid regions in the whole point cloud and is a matrix composed by the scene flow of points belonging to the region with the point excluded. And the point set without the point is denoted as . The unary term encourages the refined scene flow to be consistent with the initial scene flow. The pairwise term encourages neighboring points with similar local structure to take similar displacements. The high order term encourages points belonging to the same rigid region to share the same rigid motion parameters. In this paper, we use over-segmentation method to segment the entire point cloud into a series of supervoxels and treat each supervoxel as the rigid region in the high order term. An illustration is shown in Fig. 4. We drop the conditioning on in the rest of this paper for convenience.

Figure 4: Illustration of Con-HCRFs. Red points and blue points represent the point cloud in frame  and frame , respectively. The black lines and gray background represent the pairwise relations and the neighborhood in the pairwise term. The dashed boxes cover rigid regions. An arrow with a point represents the rigid motion of a region with rigid motion parameters shared by all the points in it. The rigid motion constraints compose the high order term in Con-HCRFs.

3.2 Potential functions

Unary potential The unary potential is constructed from the initial scene flow by considering norm:

(8)

where represents the initial 3D displacement at point produced by PAFE module; denotes the norm of a vector.

Pairwise potential The pairwise potential is constructed from types of similarity observations to describe the relation between pairs of hidden variables and :

(9)

where is a weight to specify the relation between the points and ; denotes the coefficient for each similarity measure. Specifically, we set weight depending on a Gaussian kernel , where and indicate the features of neighboring points and associating with similarity measure ; is the kernel’s bandwidth parameter. In this paper, we use point position and surface normal as the observations to construct two Gaussian kernels.

High order potential For the high order potential term, we want to explore the effects of interactions among points in a supervoxel. According to the rigid motion constraint, the high order potential term in CRF can be defined as:

(10)

where is a displacement produced by a function , where shared rigid motion parameters will be computed by and the displacement for point obeying the shared parameters can be obtained by applying the shared parameters back to the original position . is a coefficient. In the following, we will give details about the computation of .

In a rigid region , given point , we denote the points in region not containing the point as and the corresponding 3D displacements as . The warped positions in the next frame can be obtained by adding the scene flow back to the corresponding positions in frame :

(11)

The possible rigid transformation from to can be defined as where and . Inspired by the work [43] in point cloud registration, we can minimize the mean-squared error to find the most suitable rigid motion parameters to describe the motion :

(12)
(13)

where is the number of points in region with the point excluded.

Define centers of and as and respectively. Then the cross-covariance matrix can be written as:

Using the the singular value decomposition (SVD) to decompose

, we can get the closed-form solutions of , written as:

(14)

When treating the most suitable parameters as the shared rigid motion parameters by all points in region , the displacement that satisfies the rigid motion constraints for point is given by:

(15)

3.3 Inference

In order to produce the most probable scene flow, we should solve the MAP inference problem for

. Following [30], we approximate the original conditional distribution by mean field theory [9]. Thus, the distribution is approximated by a product of independent marginals, i.e., . By minimizing the KL-divergence between and , the solution for can be written as: where represents an expectation under distributions over all variable for .

Following [30], we represent each

as a multivariate normal distribution, the mean field update for mean

and normalization parameter can be written as:

(16)
(17)

where is a set of mean for all ; is the diagonal element of covariance . The detailed derivation of the inference algorithm can be found in supplementary. We observe that there usually exist hundreds of points in a supervoxel, which makes the rigid parameters computed on all points in the supervoxel excluding the point  vary close to the rigid parameters computed on all points in the supervoxel, i.e., is vary close to . Thus, in practice, we approximate in Eq. 17 with , and the approximated mean is:

(18)

After this approximation, we only need to calculate a set of rigid motion parameters for each supervoxel rather than for each point, which greatly reduces the time complexity.

In the MAP inference, since we approximate with , an estimate of each can be obtained by computing the expected value

of the Gaussian distribution

:

(19)

The inference procedure of our Con-HCRFs can be clearly sketched in Algorithm 1.

Input: Coarse scene flow ;Coordinates of point cloud ;
Output: Refined scene flow ;
Procedure:

1:   for all ;COMMENT  Initialization
2:  while  not converged do
3:     Compute for supervoxel ;
4:     ; COMMENT  Message passing from supervoxel
5:     ;
6:     ; COMMENT  Message passing from neighboring points
7:     ;
8:     ; COMMENT  Weighted summing
9:     ; COMMENT Normalizing
10:  end while
11:   for all .
Algorithm 1 Mean field in Con-HCRFs

Moreover, thanks to the differentiable SVD function provided by PyTorch 

[24], the mean field update operation is differentiable in our inference procedure. Therefore, following [49, 46], our mean field algorithm can be fully integrated with deep learning models, which ensures the end-to-end training of the whole framework.

4 Experiments

In this section, we first train and evaluate our method on the synthetic FlyingThings3D dataset in Sec. 4.1, and then in Sec. 4.2 we test the generalization ability of our method on real-world KITTI dataset without fine-tuning. In Sec. 4.3, we validate the generality of our Con-HCRFs on other networks. Finally, we conduct ablation studies to analyze the contribution of each component in Sec. 4.4. Note that in the following experiments, there are two different architectures of the PAFE module, the single-scale one denoted as PAFE-S and the pyramid one denoted as PAFE. And corresponding HCRF-Flow models are denoted as HCRF-Flow-S and HCRF-Flow, respectively.

Evaluation metrics. Let denote the predicted scene flow, and

be the ground truth scene flow. The evaluate metrics are computed as follows.

EPE3D(m): the main metric, average over each point. Acc3DS(%): the percentage of points whose EPE3D 0.05m or relative error . Acc3DR(%): the percentage of points whose EPE3D 0.1m or relative error . Outliers3D(%): the percentage of points whose EPE3D 0.3m or relative error . EPE2D(px): 2D End Point Error, which is a common metric for optical flow. Acc2D(%): the percentage of points whose EPE2D or relative error .

4.1 Results on FlyingThings3D

FlyingThings3D [19] is a large-scale synthetic dataset. We follow [5] to build the training set and the test set. Our method takes points in each point cloud as input. We train our models on one quarter of the training set (4,910 pairs) and evaluate on the whole test set (3,824 pairs).

Referring to PointPWC-Net [45], we build a pyramid PAFE module, PAFE, and corresponding HCRF-Flow framework, HCRF-Flow. Note that, compared with original architecture in [45], there are three adjustments in our PAFE: 1) we replace the MLPs in level 0 with a set conv [13]; 2) we replace all PointConvs [44] with set convs [13]; 3) we replace cost volume layers [45] with our position-aware flow embedding layers. In Con-HCRFs, we utilize the algorithm proposed in [11]

for supervoxel segmentation. During training, we first train our PAFE with a multi-scale loss function used in 

[45]. Then we add the Con-HCRFs to PAFE for fine-tuning. More implementation details are in supplementary.

The quantitative evaluation results on the Flyingthings3D are shown in Table 1. We compare our method with four baseline models: FlowNet3D [13], HPLFlowNet [5], PointPWC-Net [45], and FLOT [26]. As shown in Table 1, our PAFE module outperforms the above four methods. Further, adding Con-HCRFs and fine-tuning on FlyingThings3D, the final method, HCRF-Flow, achieves the best performance on all metrics. Qualitative results are shown in Fig. 5.

 

Dataset Method EPE3D Acc3DS Acc3DR Outliers3D EPE2D Acc2D
FlyingThings3D FlowNet3D [13] 0.0886 41.63 81.61 58.62 4.7142 60.10
HPLFlowNet [5] 0.0804 61.44 85.55 42.87 4.6723 67.64
PointPWC-Net [45] 0.0588 73.79 92.76 34.24 3.2390 79.94
FLOT [26] 0.0520 73.20 92.70 35.70 - -
Ours (PAFE module) 0.0535 78.90 94.93 30.51 2.8253 83.46
Ours (HCRF-Flow) 0.0488 83.37 95.07 26.14 2.5652 87.04
KITTI FlowNet3D [13] 0.1069 42.77 79.78 41.38 4.3424 57.51
HPLFlowNet [5] 0.1169 47.83 77.76 41.03 4.8055 59.38
PointPWC-Net [45] 0.0694 72.81 88.84 26.48 3.0062 76.73
FLOT [26] 0.0560 75.50 90.80 24.20 - -
Ours (PAFE module) 0.0646 80.29 93.47 20.24 2.4829 80.80
Ours (HCRF-Flow) 0.0531 86.31 94.44 17.97 2.0700 86.56

 

Table 1: Evaluation results on FlyingThings3D and KITTI Scene Flow 2015. Our model outperforms all baselines on all evaluation metrics. Especially, the good performance on KITTI shows the generalization ability of our method.

4.2 Generalization results on KITTI

KITTI Scene Flow 2015 [22, 21] is a well-known dataset for 3D scene flow estimation. In this section, in order to evaluate the generalization ability of our method, we train our model on FlyingThings3D dataset and test on KITTI Scene Flow 2015 without fine-tuning. And the desired supervoxel size and the bandwidth parameters in KITTI are the same as those in FlyingThings3D.

Following [13, 5], we evaluate on all 142 scenes in the training set and remove the points on the ground by height () for a fair comparison. The quantitative evaluation results on the KITTI are shown in Table 1. Our method outperforms the competing methods, which represents the good generalization ability of our method on real-world data. Fig. 5 shows the qualitative results.

Figure 5: Qualitative results on FlyingThings3D (top) and KITTI (bottom). Blue points are point cloud at frame . Green points are the warped results at frame for the points, whose predicted displacements are measured as correct by Acc3DR. For the incorrect predictions, we use the ground-truth scene flow to replace them. And the ground truth warped results are shown as red points.

4.3 Generality of Con-HCRFs on other models

In this section, we study the generalization ability of Con-HCRFs by applying it to the other scene flow estimation models as a post-processing module. We evaluate the performance of our proposed Con-HCRFs with FlowNet3D [13] and FLOT [26], which have shown strong capability on both challenging synthetic data from FlyingThings3D and real Lidar scans from KITTI. The results are presented in Table 2. Although built upon strong baselines, our proposed Con-HCRFs boost the performance of each baseline by a large margin on both datasets, demonstrating the strong robustness and generalization.

 

Dataset Method Acc3DS Acc3DS
FlyingThings3D FlowNet3D [13] 41.63 0.0
+ Con-HCRFs 47.01 + 5.38
FLOT [26] 73.20 0.0
+ Con-HCRFs 78.63 + 5.43
KITTI FlowNet3D [13] 42.77 0.0
+ Con-HCRFs 46.90 + 4.13
FLOT [26] 75.50 0.0
+ Con-HCRFs 85.44 + 9.94

 

Table 2: Generalization results of Con-HCRFs on FlowNet3D and FLOT models. denotes the difference in metrics with respect to each original model. Although built upon strong baselines, our proposed Con-HCRFs boost the performance of each baseline by a large margin on the two datasets.

4.4 Ablation studies

In this section, we provide a detailed analysis of every component in our method. All experiments are conducted in the FlyThings3D dataset. Besides the pyramid models, PAFE and HCRF-Flow, for comprehensive analysis, we also evaluate the performance of each component when used in single-scale models, PAFE-S and HCRF-Flow-S. These models are designed referring to FlowNet3D [13].

Ablation for position-aware flow embedding layer. We explore the effect of the aggregation strategy on our position-aware flow embedding layer. This strategy is introduced to dynamically aggregate the matching costs considering position information and matching cost differences. As shown in Table 3, for the two baselines, when applying both the pseudo-pairing unit and the position encoding unit to the flow embedding layer, the performance in Acc3DS can be improved by around 8. Moreover, to verify the effectiveness of the pseudo pair, we design a naive dynamic aggregation unit, denoted as NDA, which directly produces weights from matching costs rather than the matching cost difference between each matching pair and this pseudo pair. As shown in Table 3, after replacing PP with NDA, the improvement in Acc3DS decreases from 6.79 to 2.62. Thus, the pseudo-pairing unit is a better choice for this task.

 

Method MP PP PE NDA Acc3DS Acc3DS
Single-scale baseline 41.63 0.00
+NDA 44.25 + 2.62
+PP 48.42 + 6.79
+PP+PE (PAFE-S module) 50.08 + 8.45
Pyramid baseline 69.94 0.0
+PP+PE (PAFE module) 78.90 + 8.96

 

Table 3: Ablation study for position-aware flow embedding layer. MP: Maxpooling. PP: pseudo-pairing unit. PE: position encoding unit. NDA: naive dynamic aggregation unit. denotes the difference in metrics with respect to each baseline model.

Ablation for Con-HCRFs. To ensure the spatial smoothness and the local rigidity of the final predictions, we propose continuous CRFs with a novel high order term. The ablation results for Con-HCRFs are presented in Table 4. With the help of a pairwise term, denoted as (Unary+Pair), the performance gains a slight improvement due to the fact that this pairwise term aims at spatial smoothness but ignores the potential rigid motion constraints. Our proposed Con-HCRFs module, which formulates the rigid motion constraints as its high order term, boosts the performance by a large margin for PAFE-S and PAFE modules. After jointly optimizing Con-HCRFs and PAFE modules, we observe further improvement.

Can we replace the rigid motion constraints by a region-level smoothness in a supervoxel?  We want to explore whether the rigid motion constraint is a good approach to model the relations among points in a supervoxel. Instead of sharing unique rigid motion parameters, a straightforward approach is to encourage the points among a rigid region to share the same motion, i.e., encourage region-level smoothness in a supervoxel. To this end, we design a naive regional term as: , where is an average of over all points in . The results are shown in Table 4, denoted as (Unary+Pair+naive Region). As it only enforces spatial smoothness in a region and fails to model suitable dependencies among points in this rigid region, this kind of CRFs is ineffective and even worsen the performance. In contrast, when applying our proposed Con-HCRFs, the final scene flow shows significant improvements.

 

Method Acc3DS Acc3DS
PAFE-S module 50.08 0.00
+ (Unary+Pair) 50.24 + 0.16
+ (Unary+Pair+naive Region) 48.10 - 1.98
+ (Unary+Pair+High-order)/Con-HCRFs 54.51 + 4.43
+ (Unary+Pair+High-order)/Con-HCRFs 56.29 + 6.21
PAFE module 78.90 0.00
+ (Unary+Pair+High-order)/Con-HCRFs 81.39 + 2.49
+ (Unary+Pair+High-order)/Con-HCRFs 83.37 + 4.47

 

Table 4: Ablation study for Con-HCRFs. Unary: unary term. Pair: pairwise term. High-order: our proposed high order term. naive Region: a naive regional term designed as a reference to verify the effectiveness of our high order term. denotes the difference in metrics with respect to PAFE or PAFE-S module, whose details are introduced in Table 3. means jointly optimizing Con-HCRFs and PAFE modules.

 

Component Supervoxel Pairwise term High order term Total
Time (ms) 115.1 12.3 100.8 228.2

 

Table 5: Time consumption of Con-HCRFs.

Speed analysis of Con-HCRFs Table 5 reports the average runtime of each component of Con-HCRFs tested on a single 1080ti GPU. As shown in Table 5, the Con-HCRFs takes 0.2s to process a scene with 8192 points. The speed of Con-HCRFs is similar to DenseCRF [10], which also takes about 0.2s to process a 320x213 image. Additionally, due to the approximate computation that we apply in the high order term, this term only takes 0.1s for a scene. In contrast, the time for the term will dramatically increase from 0.1s to 14s, if the rigid motion parameters are calculated for each point instead of supervoxel. The large gap of runtime shows that the approximation discussed in Sec. 3.3 can significantly boost the efficiency of our Con-HCRFs.

 

Desired point number 80 100 140 200 PAFE-S
EPE3D 0.0804 0.0788 0.0782 0.0790 0.0815

 

Table 6: The impact of point number of each supervoxel on our method.

Impact of supervoxel sizes. To illustrate the sensitivity to supervoxel sizes, we test our method when facing supervoxels with different point numbers. As shown in Table 6, the method achieves the best performance when the desired point number of each supervoxel is set to a range of 140 to 200.

5 Conclusions

In this paper, we have proposed a novel point cloud scene flow estimation method, termed HCRF-Flow, by incorporating the strengths of DNNs and CRFs to perform translational motion regression on each point and operate refinement with both pairwise and region-level regularization. Formulating the rigid motion constraints as a high order term, we propose a novel high-order CRF based relation module (Con-HCRFs) considering both point-level and region-level consistency. In addition, we design a position-aware flow estimation layer for better matching cost aggregation. Experimental results on FlyingThings3D and KITTI datasets show that our proposed method performs favorably against comparison methods. We have also shown the generality of our Con-HCRFs on other point cloud scene flow estimation methods.

6 Acknowledgements

This research was conducted in collaboration with SenseTime. This work is supported by A*STAR through the Industry Alignment Fund - Industry Collaboration Projects Grant. This work is also supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2018-003), and the MOE Tier-1 research grants: RG28/18 (S) and RG22/19 (S).

References

  • [1] Aseem Behl, Despoina Paschalidou, Simon Donné, and Andreas Geiger. Pointflownet: Learning representations for rigid motion estimation from point clouds. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7962–7971, 2019.
  • [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  • [3] Christopher Choy, JunYoung Gwak, and Silvio Savarese.

    4d spatio-temporal convnets: Minkowski convolutional neural networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.
  • [4] Ayush Dewan, Tim Caselitz, Gian Diego Tipaldi, and Wolfram Burgard. Rigid scene flow for 3d lidar scans. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1765–1770. IEEE, 2016.
  • [5] Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, and Panqu Wang. Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3254–3263, 2019.
  • [6] Michael Hornacek, Andrew Fitzgibbon, and Carsten Rother. Sphereflow: 6 dof scene flow from rgb-d pairs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3526–3533, 2014.
  • [7] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11108–11117, 2020.
  • [8] Frédéric Huguet and Frédéric Devernay. A variational method for scene flow estimation from stereo sequences. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–7. IEEE, 2007.
  • [9] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • [10] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages 109–117, 2011.
  • [11] Yangbin Lin, Cheng Wang, Dawei Zhai, Wei Li, and Jonathan Li. Toward better boundary preserved supervoxel segmentation for 3d point clouds. ISPRS journal of photogrammetry and remote sensing, 143:39–47, 2018.
  • [12] Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5162–5170, 2015.
  • [13] Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 529–537, 2019.
  • [14] Xingyu Liu, Mengyuan Yan, and Jeannette Bohg. Meteornet: Deep learning on dynamic 3d point cloud sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 9246–9255, 2019.
  • [15] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8895–8904, 2019.
  • [16] Wei-Chiu Ma, Shenlong Wang, Rui Hu, Yuwen Xiong, and Raquel Urtasun. Deep rigid instance scene flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3614–3622, 2019.
  • [17] D Man and A Vision. A computational investigation into the human representation and processing of visual information, 1982.
  • [18] Jiageng Mao, Xiaogang Wang, and Hongsheng Li. Interpolated convolutional networks for 3d point cloud understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 1578–1587, 2019.
  • [19] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
  • [20] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3061–3070, 2015.
  • [21] Moritz Menze, Christian Heipke, and Andreas Geiger. Joint 3d estimation of vehicles and scene flow. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, 2, 2015.
  • [22] Moritz Menze, Christian Heipke, and Andreas Geiger. Object scene flow. ISPRS Journal of Photogrammetry and Remote Sensing, 140:60–76, 2018.
  • [23] Himangi Mittal, Brian Okorn, and David Held. Just go with the flow: Self-supervised scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11177–11185, 2020.
  • [24] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [25] Jean-Philippe Pons, Renaud Keriven, and Olivier Faugeras. Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score. International Journal of Computer Vision, 72(2):179–193, 2007.
  • [26] Gilles Puy, Alexandre Boulch, and Renaud Marlet. Flot: Scene flow on point clouds guided by optimal transport. arXiv preprint arXiv:2007.11142, 2020.
  • [27] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  • [28] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
  • [29] Julian Quiroga, Thomas Brox, Frédéric Devernay, and James Crowley. Dense semi-rigid scene flow estimation from rgbd images. In European Conference on Computer Vision, pages 567–582. Springer, 2014.
  • [30] Kosta Ristovski, Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. Continuous conditional random fields for efficient regression in large fully connected graphs. In

    Twenty-Seventh AAAI Conference on Artificial Intelligence

    , 2013.
  • [31] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  • [32] Deqing Sun, Erik B Sudderth, and Hanspeter Pfister. Layered rgbd scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 548–556, 2015.
  • [33] Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese. Segcloud: Semantic segmentation of 3d point clouds. In 2017 international conference on 3D vision (3DV), pages 537–547. IEEE, 2017.
  • [34] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pages 6411–6420, 2019.
  • [35] Arash K Ushani, Ryan W Wolcott, Jeffrey M Walls, and Ryan M Eustice. A learning approach for real-time temporal scene flow estimation from lidar data. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5666–5673. IEEE, 2017.
  • [36] Levi Valgaerts, Andrés Bruhn, Henning Zimmer, Joachim Weickert, Carsten Stoll, and Christian Theobalt. Joint estimation of motion, structure and geometry from stereo sequences. In European Conference on Computer Vision, pages 568–581. Springer, 2010.
  • [37] Sundar Vedula, Simon Baker, Peter Rander, Robert Collins, and Takeo Kanade. Three-dimensional scene flow. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 722–729. IEEE, 1999.
  • [38] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3d scene flow estimation with a rigid motion prior. In 2011 International Conference on Computer Vision, pages 1291–1298. IEEE, 2011.
  • [39] Christoph Vogel, Konrad Schindler, and Stefan Roth. Piecewise rigid scene flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 1377–1384, 2013.
  • [40] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3d scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision, 115(1):1–28, 2015.
  • [41] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10296–10305, 2019.
  • [42] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2589–2597, 2018.
  • [43] Yue Wang and Justin M Solomon. Deep closest point: Learning representations for point cloud registration. In Proceedings of the IEEE International Conference on Computer Vision, pages 3523–3532, 2019.
  • [44] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2019.
  • [45] Wenxuan Wu, Zhi Yuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation. In European Conference on Computer Vision, pages 88–107. Springer, 2020.
  • [46] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5354–5362, 2017.
  • [47] Bo Yang, Jianan Wang, Ronald Clark, Qingyong Hu, Sen Wang, Andrew Markham, and Niki Trigoni. Learning object bounding boxes for 3d instance segmentation on point clouds. In Advances in Neural Information Processing Systems, pages 6737–6746, 2019.
  • [48] Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua Shen. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5217–5226, 2019.
  • [49] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr.

    Conditional random fields as recurrent neural networks.

    In Proceedings of the IEEE international conference on computer vision, pages 1529–1537, 2015.